采集公众号’今日头条’的文章,可以使用macaca,模拟页面点击与上拉,本文使用requests库
进入今日头条,抓包,获取url,如下
url=’https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MjM5ODEyOTAyMA==&scene=124&devicetype=android-22&version=2605083a&lang=zh_CN&nettype=WIFI&a8scene=3&pass_ticket=iD8S0RbTAzk%2Bjvb11FqZS0ds6KHxqYsUcOaC%2FBVr6ZW%2F7scu856kxVTy0i4x2beq&wx_header=1’
向上滑动页面,获取url如下
‘https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MjM5ODEyOTAyMA==&f=json&offset=10&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket=iD8S0RbTAzk%2Bjvb11FqZS0ds6KHxqYsUcOaC%2FBVr6ZW%2F7scu856kxVTy0i4x2beq&wxtoken=&appmsg_token=931_vA3uwr%252F7pziEtbv2FTsuWOm-Z1jXfr95atu1EQ~~&x5=1&f=json’
上面的url中biz代表公众号,offset是每次请求文章数的偏置,count表示每次请求的文章数
查看cookies
cookies = { 'wap_sid2':'CM2s/eMBElxiVjQzcXZoYUFEMWN4VGVWMmxZZ29pMnZLbW02LTZqUlFXTmF4MDhtbFhzQ1d0ZUlWWm41OFlYWXo1Vk54eHJTYmlBRVYzazhHNWZ5cC03SUVsZFQwNk1EQUFBfjCEksnQBTgMQJRO',
}
wap_sid2’有有效期,过一段时间会失效
代码如下:
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Linux; Android 5.1.1; ATH-AL00 Build/HONORATH-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 BaiduWallet-7.0.0.4-Android-walletapp_1080_1776_ATH-AL00-HWATH_22_5.1.1_3.2.0_320',
}
# 请求的cookies
cookies = {
'wap_sid2': 'CM2s/eMBElxiVjQzcXZoYUFEMWN4VGVWMmxZZ29pMnZLbW02LTZqUlFXTmF4MDhtbFhzQ1d0ZUlWWm41OFlYWXo1Vk54eHJTYmlBRVYzazhHNWZ5cC03SUVsZFQwNk1EQUFBfjCEksnQBTgMQJRO',
}
def crawl_wx():
sess = requests.Session()
sess.cookies = requests.utils.cookiejar_from_dict(cookies)
# 抓取前5页的文章信息
for page in range(5):
url = 'https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz=MjM5ODEyOTAyMA==&f=json&offset={}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket=iD8S0RbTAzk%2Bjvb11FqZS0ds6KHxqYsUcOaC%2FBVr6ZW%2F7scu856kxVTy0i4x2beq&wxtoken=&appmsg_token=931_vA3uwr%252F7pziEtbv2FTsuWOm-Z1jXfr95atu1EQ~~&x5=1&f=json'.format(
page * 10)
res = sess.get(url, headers=headers, verify=False)
items=res.json()['general_msg_list']
items=json.loads(items)['list']
#获取文章标题和链接,获取到文章链接后,就可以之间使用requests去抓取文章内容了
infos=[(item['app_msg_ext_info']['title'],item['app_msg_ext_info']['content_url']) for item in items]
for info in infos:
print(info)
if __name__ == '__main__':
crawl_wx()