1.找到需要的内容,F12在network中选择XHR文件。
2.分析链接
链接1:https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFEYzFqUFVzZ2tQWVNDekQ1TXBRLWdnWHNzU01UQmktZjV4Y2VablhPVDdrWWg0WDFmbEJOdE9ycnU0WFY3SXk5U3hRZjR2VllkOXdPVWxJbDNHT2t6VQ%3D%3D%22%7D
链接2:https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFDX0ZJdTN0SDNTY3dDcmJ5dmc5OWNCWkwxQXBWZmZ4SU56bFpOTTVqM1FtN29XQzBrT192aXdEclJWdXlHSk9YZGY3dWNYMTltTW9YOVhJbFBVUG5mMQ%3D%3D%22%7D
对比可以发现只有相同的地方不同的地方用{}来表示
https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22{}%3D%3D%22%7D
目前需要获取不同的{}内容
3.如何获取{}的内容存在关键
以上可以发现链接2的空缺部分就是链接1内容中的“end_cursor”的内容。
4.总结以上已经获取了链接的构成,爬取已经可以实现。
具体实现代码如下:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from selenium.webdriver.support.ui import WebDriverWait
import json
from bs4 import BeautifulSoup
options = Options()
options.add_argument('--lang=en')
options.add_argument('--start-maximized')
# 取消自动化
options.add_experimental_option("excludeSwitches", ['enable-automation'])
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(r'--user-data-dir=C:\Users\Administrator\AppData\Local\Google\Chrome\User Data')
driver = webdriver.Chrome(options = options)
wait = WebDriverWait(driver, 30)
# driver.get('https://www.instagram.com/bedsurehome/')
# after参数获取
"""
# 获取最新的帖子
driver.get('https://www.instagram.com/bedsurehome/?__a=1')
html=driver.page_source
soup = BeautifulSoup(html, 'lxml')
cc = soup.select('pre')[0]
res = json.loads(cc.text)
with open('instagram/0.json', 'w') as f:
json.dump(res, f)
"""
# url的构成
URL_FIRST='https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFEbS1NcUxsa3VZN3JzbW5CdHZ5MXl1OEdYaElxY2Y2R1poWW05QlROSlhjaE53UjNsSHhER0c5TzJRSGw5bmNvQnF6M19DTlRZWFYtamVCbFFKTlNtTw%3D%3D%22%7D'
driver.get(URL_FIRST)
# 1.保存json内容
html=driver.page_source
soup = BeautifulSoup(html, 'lxml')
cc = soup.select('pre')[0]
res = json.loads(cc.text)
with open('instagram/1.json', 'w') as f:
json.dump(res, f)
print(html)
after=res["data"]["user"]["edge_owner_to_timeline_media"]["page_info"]["end_cursor"][:-2]
count=2
for i in range(0,136):
url = 'https://www.instagram.com/graphql/query/?query_hash=103056d32c2554def88228bc3fd9668a&variables=%7B%22id%22%3A%222176779867%22%2C%22first%22%3A12%2C%22after%22%3A%22{}%3D%3D%22%7D'.format(
after)
print(after)
driver.get(url)
# 1.保存json内容
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
cc = soup.select('pre')[0]
res = json.loads(cc.text)
with open('instagram/{}.json'.format(count), 'w') as f:
json.dump(res, f)
count+=1
after = res["data"]["user"]["edge_owner_to_timeline_media"]["page_info"]["end_cursor"][:-2]
time.sleep(1)
driver.close()
如果有不懂的地方可以跟博主沟通