多进程爬取补天的厂商

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/ZZZJX7/article/details/52860161

    最近工作上挺多事的,心有点乱,感觉是时候静下心来了。

   之前就想找个爬取补天的厂商,又碰巧在一个论坛看到一篇文章,然后自己就改改了,算二次原创吧,自己加了多进程并且自动获取最终页数。

   

#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import multiprocessing
import time
import requests as req
import re
import lxml
from bs4 import BeautifulSoup


def Spide(url):
  try:
    html=req.get(url,timeout=60).text
    print url
    html=html.encode('utf-8')
    pat='<td  align="left" style="padding-left:20px;">.*</td>'
    u=re.compile(pat)
    ress=u.findall(html)
    res=[]
    R=[]
    for i in ress:
      u=re.compile('>.*<')
      res+=u.findall(i)
    for i in res:
      a=i.strip('<>')
      #print a
      with open('360.txt','a+') as f:
      	f.write(a+'\n')
  except Exception,e:
    pass

def get_page(url):
    a = req.get(url)
    if a.status_code == 200:
        soup = BeautifulSoup(a.text,"lxml")
        pages = soup.select("div.pages > a")[-1].get('href').split('/')[-1]
        return pages

if __name__ == "__main__":
  pool = multiprocessing.Pool(processes=4)
  url_list=[]
  url='https://butian.360.cn/company/lists/page/1'
  page=int(get_page(url))
  for i in range(1,int(page)):
    url='https://butian.360.cn/company/lists/page/'+str(i)
    url_list.append(url)
  pool.map(Spide,url_list)
  pool.close()
  pool.join()
  print("Done!")

  效果如下,感觉还可以:



猜你喜欢

转载自blog.csdn.net/ZZZJX7/article/details/52860161