爬虫04-网易科技新闻 - 代码天地

爬虫04-网易科技新闻

其他 2018-11-01 23:00:50 阅读次数: 0

"""
__title__ = ''
__author__ = 'Thompson'
__mtime__ = '2018/7/26'
# code is far away from bugs with the god animal protecting
    I love animals. They taste delicious.
              ┏┓      ┏┓
            ┏┛┻━━━┛┻┓
            ┃      ☃      ┃
            ┃  ┳┛  ┗┳  ┃
            ┃      ┻      ┃
            ┗━┓      ┏━┛
                ┃      ┗━━━┓
                ┃  神兽保佑    ┣┓
                ┃　永无BUG！   ┏┛
                ┗┓┓┏━┳┓┏┛
                  ┃┫┫  ┃┫┫
                  ┗┻┛  ┗┻┛
"""

from selenium import webdriver
import time
import random
from bs4 import BeautifulSoup
import json

browser = webdriver.Chrome()

browser.get("http://tech.163.com/")
last_height = browser.execute_script("return document.body.scrollHeight")
while True:
    print('页面加载中...')
    # 滑动一次
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    # 等待加载
    time.sleep(random.random()*10)
    # 计算新的滚动高度并与上一个滚动高度进行比较
    new_height = browser.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
html = browser.page_source
#print(html)
browser.close()

# 数据提取
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
ls = soup.select('div.data_row.news_article.clearfix')
print(len(ls))

file = open('./data/163tech.json', 'w', encoding='utf-8')
for item in ls:
    title = item.select('h3 > a')[0].get_text()
    print('title:',title)
    url = item.select('h3 > a')[0]['href']
    print('url:', url)
    content = json.dumps({'title':title,'url':url}, ensure_ascii=False) + "\n"
    file.write(content)
file.close()

file = open('./data/163tech.json', 'r', encoding='utf-8')

ls = file.readlines()
for it in ls:
    print(json.loads(it))

猜你喜欢

转载自blog.csdn.net/qwerLoL123456/article/details/83143193

爬虫04-网易科技新闻

科技新闻-AlphaGo

爬虫04-网易云评论

爬取科技新闻：从科技新闻网站获取最新科技新闻

科技新闻网站集合

爬取中华网科技新闻

从哪里可以找到国外最新的科技新闻资讯？

04-爬虫利器Fiddler

科技新闻_每日一闻：Microsoft reveals DoWhy library for causal inference

科技新闻营销写作：撰写引人注目的文案的11个技巧

Python3爬虫实践--网易科技滚动新闻爬取

04-爬虫的基本原理

科技新闻_每日一闻：Qt公司和MedAcuity宣布合作，以加速医疗设备公司的产品上市时间。

Node.js爬取科技新闻网站cnBeta（附前端及服务端源码）

python 爬虫爬取网易新闻网易排行榜

python爬虫基础04-网页解析库xpath

[Java爬虫-WebMagic]-04-处理爬取的结果

科技新闻-每日一闻：outsystems：应用开发状况，2018：How Low-Code Enables Innovation, Speeds-up Delivery, and Agility

Alfred Workflow Python BeautifulSoup爬虫浏览网易新闻头条

[Python爬虫]Scrapy框架爬取网易国内新闻

菜鸟学爬虫之爬取网易新闻

Jsoup + HtmlUtil 实现网易新闻网页爬虫

Python爬虫实战教程：爬取网易新闻

【Python实现网络爬虫】Scrapy爬取网易新闻

04-树

04-基本语法

04-整型

04-库的操作

04-路由管理

04-列表标签

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

static方法和非static方法的区别（java）

如何查找计算机专业paper

java.lang.ClassFormatError: Incompatible magic value 0 in class file com/sitecha

跳跃游戏II

stm32_之【建立工程】

TeaWeb v0.0.9 发布，统计底层优化、主机监控功能改进

事件分发 -----控制字体大小

JavaScript DOM练习（动态表格添加） December 25，2019

JSF Scope & CDI

实现从零搭建一个登录注册页面（附源代码）

每日归档

更多

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)