【Python爬虫】投资者互动问答信息抓取

注：由于爬虫性质，请勿回复有关商业买卖等话题，全部内容予以公开。

引言：出于学术研究目的，本人自学了爬虫相关技术，并进行了项目实践。

一、引入相关库

import requests
from lxml import etree
import csv

出于研究需要，本人目的为抓取具有关键词特征的投资者互动问答信息，并写入CSV文件。

二、初始参数设定

#   初始化输入参数

key_list=["关键词1","关键词2"]

#   写入表头
for i in range(len(key_list)):
    page = 1
    key= key_list[i]
    print(key)
    with open('上交所关键词：{0}.csv'.format(key), 'w', newline='', encoding="utf-8") as csv_file:
        data = csv.writer(csv_file, delimiter=',')
        data.writerow(["time", "name", "code", "Question", "Answer"])
    error_count=0
    while True:
        print("\nstart No.{} page crawl...".format(page))
        data="sdate=2018-01-01&edate=2022-12-31&keyword={}&type=1&page={}&comId=".format(key,page)
        res=requests.get("http://sns.sseinfo.com/getNewData.do",params=data)
        res.encoding="utf8"
        if "暂时没有问答内容" in res.text:
            print("crawl finish")
            break

注释1：Key_list为关键词词表，可根据需要自行设定。但不建议同时设定过多，可以再次写个循环，每5个一爬。

注释2：其中，sdate=2018-01-01&edate=2022-12-31，为起始日期与截至日期，具体为何进行如此筛选可见上证e网站。

注释3：其中，keyword={}为关键词设定，上证e可搜索关键词信息。

三、抓取

        divs=etree.HTML(res.text).xpath("//div[@class='m_feed_item']")

        for div in divs:
            try:
                text=div.xpath(".//div[@class='m_feed_detail m_qa_detail']//div[@class='m_feed_txt']//text()")
                text=[i.strip() for i in text if i.strip() != '']
                answer = div.xpath(".//div[@class='m_feed_detail m_qa']//div[@class='m_feed_txt']//text()")
                answer = [i.strip() for i in answer if i.strip() != '']

                if len(text)==0:
                    text = div.xpath(".//div[@class='m_feed_detail ']//div[@class='m_feed_txt']//text()")
                    text = [i.strip() for i in text if i.strip() != '']
                    answer = ["暂无回应"]

                text="".join(text)
                answer="".join(answer)
                code=text.split(")")[0].split("(")[1]
                name=text.split(")")[0].split("(")[0]
                time="".join(div.xpath(".//div[@class='m_feed_from']//span//text()")).split(" ")[0]
                text=text.split(")")[1]
                row=[time,name,code,text,answer]
                print(row)
                with open('上交所关键词：{0}.csv'.format(key), 'a', newline='', encoding="utf-8") as csv_file:
                    data = csv.writer(csv_file, delimiter=',')
                    data.writerow([time, name, code, text, answer])
                error_count = 0
            except:
                error_count=error_count+1
                continue

        if error_count>3:
            print("process miss the defence...")
            break

        page=page+1

四、最后

由于关键词设定抓取存在重复值，因此建议在合并所有CSV文件后可以删除重复值，即可实现。词频统计可见后续代码文章。

【Python爬虫】投资者互动问答信息抓取

猜你喜欢