python reptile growth path (a): Star crawl securities python reptile stock data growth path (a): Star crawl securities stock data

python reptile growth path (a): Star crawl securities stock data

 

      Data acquisition is an integral part of data analysis, and a web crawler is one of the important channels to obtain the data. In view of this, I picked up the Python this weapon, opened the road network of reptiles.

      Benpian used version python3.5, meaning the day of all the data on the A-share grab stars securities. The main program is divided into three parts: Web access to the source code, extract the contents of the required finishing of the result.

First, access to the source code page

      Many people like to use python reptile's one of the reasons is that it is easy to use. Just follow a few lines of code can grab the source of most pages.

Press Ctrl + C to copy the code
Press Ctrl + C to copy the code

      While grasping a source easy, but a lot of crawling web pages source code in a server site has often been intercepted, suddenly feeling the world is full of malice. So I began to study the anti-break exercises reptile restrictions.

      1. Camouflage header is wandering

      Many server sent its header to confirm whether it is human user through a browser, so we can request headers by imitating the behavior of the browser sends a request to the server configuration. The server will recognize some of the parameters to identify whether you are a human user, many sites will identify the User-Agent This parameter, so best to bring the request header. There are some relatively high awareness of the site may also be recognition by other parameters, such as through the Accept-Language to identify whether you are a human user, some anti-hotlinking function of the site have to bring referer this parameter and so on.

      2. randomly generated UA

      Star of the securities only with User-Agent This parameter can fetch page information, but continuous crawl a few pages on the server is blocked. I decided to different browsers each simulation when crawling data transmission request, and the server is identified by different browsers User-Agent, so every time crawling pages can be randomly generated different configurations UA request header to the server,

      3. crawling speed slows

      Although the simulation of different browsers crawling data, but you can find some time crawling hundreds of pages of data, but only sometimes crawling ten pages, it seems that the server will be identified according to the frequency you of your visit It is a human user or web crawler. So I grab one every let it rest a few seconds of random, adding sentence after the code each time period can be crawling a large number of stock data.

      4. Use a proxy IP

      Accidents will happen, program successfully tested successfully at the company, back to the bedroom and find that it can only crawl a few pages on the server is blocked. I quickly asked panicked degree of your mother, informed the server can identify your IP and IP access times this record, you can use anonymous proxy IP high, and in the process constantly crawl replacement, so that the server can not find out who They are murderers. This work has not been Xiu, how funeral, please listen next time decomposition.

      5. Other Methods to limit the anti-reptile

      Many server sends in an interview with the browser requests a cookie file to the browser, and then to track your access to the process through cookie, in order to prevent server recognizes you are a reptile, it would be best to bring together climb cookie fetching data; if to encounter simulated landing site, in order to keep their accounts drawn black, you can apply a lot of accounts, and then climb into, here involving simulated landing, code recognition, knowledge, being no longer get to the bottom ... in short, for website owners, the reptiles some really annoying, they will come up with many ways to restrict access to reptiles, so we have to pay attention to etiquette after forced entry, do not take other people's site to its knees.

Second, extract the required content

      After obtaining the web page source code, we can extract the data we need. To obtain the required information from the source in one of many ways, the use of regular expressions is a more classic approach. Let's look at the parts of the page source acquisition.

      To reduce interference, I will match a regular expression over the entire page from the main part of the source code with the positive, and then match the information on each stock from the body portion. code show as below.

pattern = re.compile ( '<tbody [   \ s \ S] * </ tbody>') 
all code between the body = re.findall (pattern, str ( content)) # Match <tbody and </ tbody> 
pattern the re.compile = ( '> (. *?) <') 
stock_page the re.findall = (pattern, body [0]) # match> and <among all of the information

       Wherein compile pattern matching method is compiled, findAll pattern matching methods to match this with the required information, and returns a list manner. Regular expression syntax is pretty much, I only listed the following meanings of the symbols used.

grammar Explanation
. Matches any character except newline "\ n" outside
* Matches the preceding character zero or infinite
Matches the preceding character zero or one time
\s Blank characters: [<space> \ t \ r \ n \ f \ v]
\S Non-whitespace characters: [^ \ s]
[...] Character set, a position corresponding to a character may be any character concentrated
(...) Bracketed expression as a packet, which is generally required to extract the contents of our

      Regular expression syntax of a lot of, maybe as long as a large cattle regular expressions can extract what I want to extract. Find someone with xpath expression to extract more concise when extracting some of the main stock part of the code, page parsing it seems has a long way to go.

Third, the consolidation of the results obtained

      By non-greedy mode (. *?) Match between> and the <all data will match some whitespace characters come out, so we use the following code to remove whitespace characters.

stock_last = stock_total [:] #stock_total: matched stock data 
for data in stock_total: #stock_last: stock data after finishing 
    IF Data == '': 
        stock_last.remove ( '')

      Finally, we can print several columns of data facie effect, the code below

print ( 'Code', '\ t', 'short', '', '\ t', 'new price', '\ t', 'Quote change', '\ t', 'Change Amount' '\ t', '5 minutes rose') 
for I in Range (0, len (stock_last), 13 is): # pages total of 13 data 
    print (stock_last [i], ' \ t', stock_last [i + 1 ], '', '\ t ', stock_last [i + 2], '', '\ t', stock_last [i + 3], '', '\ t', stock_last [i + 4], '', '\ t', stock_last [i + 5])

      Print section below

      Securities Star crawl on the final day of the program all the A-share data is as follows

Press Ctrl + C to copy the code
Press Ctrl + C to copy the code

      Data acquisition is an integral part of data analysis, and a web crawler is one of the important channels to obtain the data. In view of this, I picked up the Python this weapon, opened the road network of reptiles.

      本篇使用的版本为python3.5,意在抓取证券之星上当天所有A股数据。程序主要分为三个部分:网页源码的获取、所需内容的提取、所得结果的整理。

一、网页源码的获取

      很多人喜欢用python爬虫的原因之一就是它容易上手。只需以下几行代码既可抓取大部分网页的源码。

按 Ctrl+C 复制代码
按 Ctrl+C 复制代码

      虽说抓一页的源码容易,不过在一个网站内大量抓取网页源码却经常遭到服务器拦截,顿时感觉世界充满了恶意。于是我开始研习突破反爬虫限制的功法。

      1.伪装流浪器报头

      很多服务器通过浏览器发给它的报头来确认是否是人类用户,所以我们可以通过模仿浏览器的行为构造请求报头给服务器发送请求。服务器会识别其中的一些参数来识别你是否是人类用户,很多网站都会识别User-Agent这个参数,所以请求头最好带上。有一些警觉性比较高的网站可能还会通过其他参数识别,比如通过Accept-Language来辨别你是否是人类用户,一些有防盗链功能的网站还得带上referer这个参数等等。

      2.随机生成UA

      证券之星只需带User-Agent这个参数就可以抓取页面信息了,不过连续抓取几页就被服务器阻止了。于是我决定每次抓取数据时模拟不同的浏览器发送请求,而服务器通过User-Agent来识别不同浏览器,所以每次爬取页面可以通过随机生成不同的UA构造报头去请求服务器,

      3.减慢爬取速度

      虽然模拟了不同浏览器爬取数据,但发现有的时间段可以爬取上百页的数据,有时候却只能爬取十来页,看来服务器还会根据你的访问的频率来识别你是人类用户还是网络爬虫。所以我每抓取一页都让它随机休息几秒,加入此句代码后,每个时间段都能爬取大量股票数据了。

      4.使用代理IP

      天有不测风云,程序在公司时顺利测试成功,回寝室后发现又只能抓取几页就被服务器阻止了。惊慌失措的我赶紧询问度娘,获知服务器可以识别你的IP,并记录此IP访问的次数,可以使用高匿的代理IP,并在抓取的过程中不断的更换,让服务器无法找出谁是真凶。此功还未修成,欲知后事如何,请听下回分解。

      5.其他突破反爬虫限制的方法

      很多服务器在接受浏览器请求时会发送一个cookie文件给浏览器,然后通过cookie来跟踪你的访问过程,为了不让服务器识别出你是爬虫,建议最好带上cookie一起去爬取数据;如果遇上要模拟登陆的网站,为了不让自己的账号被拉黑,可以申请大量的账号,然后再爬入,此处涉及模拟登陆、验证码识别等知识,暂时不再深究...总之,对于网站主人来说,有些爬虫确实是令人讨厌的,所以会想出很多方法限制爬虫的进入,所以我们在强行进入之后也得注意些礼仪,别把人家的网站给拖垮了。

二、所需内容的提取

      获取网页源码后,我们就可以从中提取我们所需要的数据了。从源码中获取所需信息的方法有很多,使用正则表达式就是比较经典的方法之一。我们先来看所采集网页源码的部分内容。

      为了减少干扰,我先用正则表达式从整个页面源码中匹配出以上的主体部分,然后从主体部分中匹配出每只股票的信息。代码如下。

pattern=re.compile('<tbody[\s\S]*</tbody>')  
body=re.findall(pattern,str(content))  #匹配<tbody和</tbody>之间的所有代码
pattern=re.compile('>(.*?)<')
stock_page=re.findall(pattern,body[0])  #匹配>和<之间的所有信息

       其中compile方法为编译匹配模式,findall方法用此匹配模式去匹配出所需信息,并以列表的方式返回。正则表达式的语法还挺多的,下面我只罗列所用到符号的含义。

语法 说明
. 匹配任意除换行符“\n”外的字符
* 匹配前一个字符0次或无限次
匹配前一个字符0次或一次
\s 空白字符:[<空格>\t\r\n\f\v]
\S 非空白字符:[^\s]
[...] 字符集,对应的位置可以是字符集中任意字符
(...) 被括起来的表达式将作为分组,里面一般为我们所需提取的内容

      正则表达式的语法挺多的,也许有大牛只要一句正则表达式就可提取我想提取的内容。在提取股票主体部分代码时发现有人用xpath表达式提取显得更简洁一些,看来页面解析也有很长的一段路要走。

三、所得结果的整理

      通过非贪婪模式(.*?)匹配>和<之间的所有数据,会匹配出一些空白字符出来,所以我们采用如下代码把空白字符移除。

stock_last=stock_total[:] #stock_total:匹配出的股票数据
for data in stock_total:  #stock_last:整理后的股票数据
    if data=='':
        stock_last.remove('')

      最后,我们可以打印几列数据看下效果,代码如下

print('代码','\t','简称','   ','\t','最新价','\t','涨跌幅','\t','涨跌额','\t','5分钟涨幅')
for i in range(0,len(stock_last),13):        #网页总共有13列数据
    print(stock_last[i],'\t',stock_last[i+1],' ','\t',stock_last[i+2],'  ','\t',stock_last[i+3],'  ','\t',stock_last[i+4],'  ','\t',stock_last[i+5])

      打印的部分结果如下

      抓取证券之星上当天所有A股数据的最终程序如下

按 Ctrl+C 复制代码
按 Ctrl+C 复制代码

Guess you like

Origin www.cnblogs.com/medik/p/10989815.html