python 爬虫爬取网易新闻网易排行榜

爬取新浪新闻的链接：

https://blog.csdn.net/Iv_zzy/article/details/107535041

爬取中国新闻网的链接

https://blog.csdn.net/Iv_zzy/article/details/107537295

与获取新浪新闻思路不同，新浪新闻的获取是先把所有的链接存入csv文件，再统一对所有的链接解析。本人对网易新闻的获取采用边解析链接、边获取链接的内容（本人尽可能提供不同的方法，若需要，对照修改使用即可)

网易排行榜如下图所示
在这里插入图片描述
以娱乐新闻为例，点击娱乐，到了这个界面
网页链接为：

http://news.163.com/special/0001386F/rank_ent.html

1、首先，从网页上获取带有链接的title
此处是返回一个list

def Initpage(url, headers):
    res = requests.get(url, headers = headers) 
    res.content.decode('gb18030','ignore') #原网页gbk
    soup = BeautifulSoup(res.text, 'html.parser')
    #print(soup.prettify())
    titles = soup.find('div', 'area-half left').find('div', 'tabContents active').find_all('a') #list
    return titles

出来的结果是：
在这里插入图片描述
2、对list内的title一个个分解取出链接，获取内容

def parse(titles, headers):
    count = 0 
    for title in titles:
        #get urls from html
        news_url = (str(title.get('href')))
        #read each url 
        news_response = requests.get(news_url, headers=headers)
        news_html = news_response.text
        news_soup = BeautifulSoup(news_html, 'html.parser')
        #analyze html to find news' title and news' content
        if news_soup.find('div', 'post_text') is None:  #if html loose, jump out circulation
            continue
        news_title = news_soup.find('h1').text
        contents = news_soup.find('div', 'post_text').find_all('p')[:-2]
        news_contents = ""
        for content in contents:
            if len(content.text)<=0 or ("video" in content.text) or ("img" in content.text):
                continue
            else:
                news_contents = news_contents + content.text.strip()
        count = count + 1
        try:
            print(news_title,news_contents)
            print('第'+ str(count) + '条新闻写入成功')
        except:
            print('第'+ str(count) + '条新闻抓取失败,正在尝试下一条')

另外，需要补充的是，本人尝试将新闻内容存储到数据库，如若有需要的，可以参考以下内容

连接数据库

def con_db():
    try:
        global db
        db = pymysql.connect('localhost','root','123456','newsDB',charset='utf8')
    except pymysql.Error as e:
        print("Error: {}".format(e))
    cur = db.cursor()
    print('connection success')
    return cur

插入数据

def insert_news(news_title,news_contents):
    category = '娱乐' #更改类别
    sqli = '''
        insert into WYnews(category,newsTitle,newsContent)
        values("%s","%s","%s")
    '''%(pymysql.escape_string(category),pymysql.escape_string(news_title),pymysql.escape_string(news_contents))
    cur.execute(sqli)
    time.sleep(1)

若数据量过大，推荐使用多进程处理，处理方法在我前面的文章里有简单介绍

https://blog.csdn.net/Iv_zzy/article/details/107535041

如有转载，请注明出处，谢谢~

python 爬虫爬取网易新闻 网易排行榜

猜你喜欢

python 爬虫爬取网易新闻网易排行榜