python 爬虫爬取新东方考研真题

其他 2018-11-01 06:41:34 阅读次数: 0

女朋友要考研，想看看各个院校的真题，网上找了下，发现新东方真题库里面的考研真题还挺齐全的，网址：http://new.bj.xdf.cn/zhentiku/daxue/kaoyan/kyzyk/list_381_1.html

于是随手写了个爬虫爬取所有新闻学的考研题，脚本如下：

from urllib import urlopen
from bs4 import BeautifulSoup
import re

for page in range(17):
    if page == 0:
        continue
    firstUrl = "http://new.bj.xdf.cn/zhentiku/daxue/kaoyan/kyzyk/list_381_" + str(page) + ".html"
    print "[Begin] scrap page", firstUrl
    html = urlopen(firstUrl)
    data = html.read()
    bsobj = BeautifulSoup(data)

    li = bsobj.findAll("a", {"title": re.compile(u"(.*?)新闻(.*?)")})

    for l in li:
        url = "http://new.bj.xdf.cn" + l.attrs["href"]
        filename = l.attrs["title"] + ".html"
        subdata = BeautifulSoup(urlopen(url).read())
        with open(filename, 'w') as f:
            f.write('<meta charset=\"UTF-8\">\n')
            f.write('%s' % subdata.select(".article-wrap"))
            f.close()
    print "[End]"

猜你喜欢

转载自blog.csdn.net/u012675539/article/details/53169374

python 爬虫爬取新东方考研真题

python爬虫东方数据爬取

python爬虫爬取英文考研词汇

python爬取公考真题

新东方风暴

2018新东方考研全程课低价转让啦！！！

Python爬虫与一汽项目【二】爬取中国东方电气集中采购平台

python爬虫：Selenium 爬取东方财富网上市公司财务报表

python自学-class20-爬取东方财富网股票数据(爬虫)

新东方笔记合集

《(2018)考研英语题源报刊阅读：提高篇 (新东方决胜考研系列)》下载_电子书 eBook(epub + azw3 + mobi + pdf + txt)

Python 爬虫爬取网页

python爬虫－爬取图片

python 爬虫爬取csdn

python爬虫爬取图片

Python爬虫：爬取图片

python爬虫（爬取视频）

python爬虫（爬取段子）

python爬虫 - 爬取图片

python爬虫爬取视频

Python爬虫——爬取小说

python爬虫登录爬取

【python爬虫】—图片爬取

Python爬取考研必备单词

用python求解考研数学真题

2020新东方考研政治基础班思修8讲

2020新东方在线考研数学百度云

【考研 - 英语 - 20新东方网课笔记 - 基础起步】基础词汇导学

新东方美文背诵30篇

新东方APP技术团队建设

今日推荐

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

周排行

[编程题]学英语

[codeforces 1288A] Deadline 约数+模

Python的web开发

Docker在Centos 7上的部署

python编码

解决Ubuntu16.04 fatal error: json/json.h: No such file or directory

mysql并发插入

rest接口如何适应jsonp的方案

linux 终端上网设置

高数——等号两边同时求导、积分的解释

每日归档

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)