python抓取网络文章 - 代码天地

python抓取网络文章

其他 2019-03-28 11:31:10 阅读次数: 0

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import urllib.request
from requests.exceptions import RequestException
import csv
import pandas as pd


import random


def getUrl():
    data = []
    res = requests.get('https://xxx.com/')#获取目标网页
    res.encoding = 'utf-8'#抓取网页出现乱码

    soup = BeautifulSoup(res.text,'html.parser')#爬取网页

    for news in soup.select('#list li'):
        m_url = 'https://xxx.org'+news.find('a').get('href')
        data.append(m_url)
        #print(data)
    return data

urls = getUrl();

# 获取页面内容
def getHtml(url):
    try:
        response = requests.get(url)

        if response.status_code == 200:
            return response.text
    except RequestException:
        print('===request exception===')
        return None

# 解析网页
def parse_html(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')

        for tag in soup.select('#entry'):

            title = tag.find('h1').get_text()

            for art in tag.select('#entrybody'):

                #a = art.find('div',id='fengxibutton')
                #b = art.find('div',id='fenxi')
                #a.decompose() 去除指定标签节点
                #b.decompose()
                content = art.get_text()

        return title,content
    except Exception:
        print('===parseHtml exception===')
        return None

# 保存到csv表中
def save2csv(title, content):
    with open('xx.csv', 'a+', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['title', 'content'])
        writer.writerow([title, content])
        pd.read_csv('xx.csv')

def article():
    for url in urls:

        html = getHtml(url)

        info = parse_html(html)

        if info==None:
            title = url
            content = url
        else:
            (title, content) = info

        save2csv(title, content)


article()

猜你喜欢

转载自blog.csdn.net/for_get_love/article/details/88865195

python抓取网络文章

python抓取网络图片保存到本地，通过url抓取文章的标题，通过链接地址，抓取内容数据

python抓取头条文章

python爬虫CSDN文章抓取

python爬虫-- 抓取网页、图片、文章

利用python scrapy抓取csdn的文章

Python微信文章抓取转PDF

python网络爬虫抓取图片

一篇文章带你用Python网络爬虫实现网易云音乐歌词抓取

python网络数据抓取二（bing图片抓取）

python网络数据抓取三（斗图网图片抓取）

Python爬虫教程：简书文章的抓取与存储

pyspider抓取伯乐在线python相关所有文章

python抓取微信公众号新闻文章图片

用 Python 抓取公号文章保存成 PDF

python网络爬虫及数据抓取（一）

python网络爬虫抓取网站图片

python网络爬虫（1）静态网页抓取

Python库之网络抓取和解析

使用Python网络爬虫抓取CodeForces题目

[网络安全自学篇] 八十六.威胁情报分析之Python抓取FreeBuf网站APT文章（上）

一篇文章教会你利用Python网络爬虫抓取百度贴吧评论区图片和视频

抓取文章内容

python3爬虫(三)--抓取单个网页的文章标题以及其链接和抓取单个网页图片

Python爬虫利器！python 抓取开源中国阅读数大于1000的优质文章

python3网络爬虫(抓取文字信息)

利用Python网络爬虫抓取网易云音乐歌词

利用Python网络爬虫抓取豆瓣首页图片代码分享

使用Python网络爬虫抓取牛客网题目

python网络爬虫抓取数个最优链接展示

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

循环神经网络（rnn）讲解

Tigao教程四：单独的关节运动

金蝶K3WISE15.0-注册套打教程

如何在Mac上配置Kubernetes

Android应用结束自身进程的方法

SpringMVC学习十三拦截器栈

中国驻洛杉矶总领馆举行新春招待会

HttpClient get post 发送

11 - three.js 笔记 - 绘制三维字体模型

Mysql递归获取某个父节点下面的所有子节点和子节点上的所有父节点

每日归档

更多

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)