界面新闻抓取 | 爬虫

# 界面新闻抓取
# 科技新闻版块: http://www.jiemian.com/lists/65.html

# 分析:
#     F12,查看是否可以在页面是获取到想要的数据,比如这里我们要获取新闻链接,标题,以及图片链接;
#     可以从response中看到,说明数据是可以直接抓取的！
#     问题：怎么获取分页？
#        我们可以看到分页是点击加载事件:
        """<div class="load-more" onclick="AsyncLoadList(this)" url="https://a.jiemian.com/index.php?m=lists&amp;a=cLists&amp;id=242&amp;type=card&amp;notid=2080130,2074075,2070788" page="2">加载更多</div>"""    
#     措施: 其实就是点击之后访问url属性标注的链接;
#           这里我们清空NetWork中的数据,点击页面的加载事件,控制台会出现一个index.php?....的文件；
#          在Headers中的Query String Parameter中查看GET请求所需要的参数;
            m: lists
            a: cLists
            id: 242
            type: card
            notid: 2080130,2074075,2070788
            callback: jQuery1102017521587218181867_1524626577671
            page: 2
            _: 1524626577673
#          经过尝试,发现callback,_两个参数是服务端构造返回给我们的,可以不携带;
#           就可以构造出我们需要访问的url,根据page获取分页的url:
           https://a.jiemian.com/index.php?m=lists&a=cLists&id=242&type=card&notid=2080130,2074075,2070788&page={0}".format(page)
           
# 代码:

import json
import requests
from lxml import etree


class JieMianSpider(object):
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
        }
        self.proxies = {
            # "https:": "https://113.87.160.73:8181",
            "https:": "https://218.72.110.144:18118",
        }

    def get_page(self, url):
        response = requests.get(url, headers=self.headers, proxies=self.proxies)
        response.encoding = response.apparent_encoding
        return response.text

    def run(self):
        with open('jiedian.txt', 'w') as f:
            page = 0
            while True:
                page += 1
                url = "https://a.jiemian.com/index.php?m=lists&a=cLists&id=242&type=card&notid=2080130,2074075,2070788&page={0}".format(page)
                # 1.发起请求
                response = self.get_page(url)
                # 2.获取json数据转换成dict类型
                res_dict = json.loads(response[1:-1])
                # print(res_dict)
                res_data = res_dict['rst']
                # 3.获取节点，解析数据
                html = etree.HTML(res_data)
                el_objs = html.xpath('//div[@class="news-img"]')
                result = []
                for el in el_objs:
                    url = el.xpath('./a/@href')
                    img = el.xpath('./a/img/@src')
                    title = el.xpath('./a/@title')
                    result.append({
                        "url": url,
                        "img": img,
                        "title": title
                    })
                result = json.dumps(result, ensure_ascii=False)
                print(result)
                f.write(result)
                if page==20:break

                




if __name__ == '__main__':
    JM_spider = JieMianSpider()
    JM_spider.run()

界面新闻

界面新闻抓取 | 爬虫

猜你喜欢