网络爬虫笔记（Day3）

第一次进去后，第一次Ajax请求得到的是若下图所示的 max_id=-1, count=10。

然后往下拉，第二次Ajax请求，如下图；发现URL里面就max_id 和count不同，max_id为前一次Ajax的最后一条数据的id，以后的每次请求都是count=15，故我需要对url进行拼接。

URL拼接代码如下：

 url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id={}&count={}&category=111'.format(str(max_id), str(count))

之后就是对数据定位操作了，通过python字典，列表，还有json等等，将需要的内容定位处理，然后连接数据库，将数据存储到数据库。下面是完整代码

import requests
import json
import pymysql

# 打开数据库连接
db = pymysql.connect(host="localhost", user="root", password="8888", database="test")
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()
# ————————————————————————————————————————————————————————————————
i = 1  # 用来控制要爬取的页面数
count = 10   # 因为第一页和其他页面的count不同，其他页面为15
max_id = -1
while i < 10:
    '''第一个循环用来控制要爬取数据的 总ajax请求的次数'''
    url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id={}&count={}&category=111'.format(str(max_id), str(count))
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
        'Cookie': 'aliyungf_tc=AQAAABS6+HG0OAQAUhVFeZTrWYKcrmDe; xq_a_token=584d0cf8d5a5a9809761f2244d8d272bac729ed4; xq_a_token.sig=x0gT9jm6qnwd-ddLu66T3A8KiVA; xq_r_token=98f278457fc4e1e5eb0846e36a7296e642b8138a; xq_r_token.sig=2Uxv_DgYTcCjz7qx4j570JpNHIs; _ga=GA1.2.2007990410.1534303926; _gid=GA1.2.1454932696.1534303926; u=781534303927452; device_id=0883ecbffed505f2f843656aec9a0524; Hm_lvt_1db88642e346389874251b5a1eded6e3=1534303929,1534303938,1534344710; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1534344710; _gat_gtag_UA_16079156_4=1'
    }
    response = requests.get(url, headers=headers)
    
    msg = response.content.decode('utf-8')
    msg_dict = json.loads(msg)
    # print(type(msg_dict))
    # print(msg_dict)
    if count == 10:
        max_id = msg_dict['next_id']
    else:
        max_id = msg_dict['next_max_id']
    
    flag = 0
    while flag < count:
        data = msg_dict['list'][flag]['data']
        flag += 1
        data_dict = json.loads(data)
        # print(data_dict)
    
        uid = data_dict['id']
        title = data_dict['title']
        description = data_dict['description']
        target = data_dict['target']
    # ---------------------添加到数据库-----------------------------
        # SQL 插入语句
        sql = """INSERT INTO xueqiu(uid, title, description, target)
                     VALUES ('{}', '{}', '{}', '{}')""".format(uid, title, description, target)
        try:
            # 执行sql语句
            cursor.execute(sql)
            # 提交到数据库执行
            db.commit()
        except:
            # 如果发生错误则回滚
            db.rollback()

    count = 15
    i = i + 1
# 关闭数据库连接
db.close()

网络爬虫笔记（Day3）

猜你喜欢