爬取大西洋月刊每日新闻

导读

最近学英语中，想借助原汁原味的英文素材，浏览了下《经济学人》、《纽约时报》、《大西洋月刊》等，发现《大西洋月刊》（The Atlantic）比较合胃口，所以就写了个爬虫爬取每日新闻，保存markdown文件，便于推送到博客上。

文章收纳：

问题：

正则表达式忘得差不多了
scrapy使用也是，今晚复习了下如何爬取网页，保存数据、配置还没看

import requests
from lxml import etree
import re

# url = 'https://www.theatlantic.com/science/archive/2018/10/horsepox-smallpox-virus-science-ethics-debate/572200/'

url_root = 'https://www.theatlantic.com/latest/'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
}

def get_urlLists(html):
    selector = etree.HTML(html)
    url_lists = selector.xpath('//ul[@class="river"]/li/a/@href')

    url_lists = ['https://www.theatlantic.com{}'.format(url) for url in url_lists]
    return url_lists

root_html = requests.get(url_root,headers=headers).text
url_lists = get_urlLists(root_html)

len(url_lists)

lxml解析网页的方式

解析str
- etree.HTML(str)
解析html文件
- etree.parse(‘html文件路径’,etree.HTMLparser())

def get_MarkDown_by_url(url):
    html = requests.get(url,headers=headers)
    if html.status_code==200:
        html = html.text
        selector = etree.HTML(html)
        p_list=[]
        body = selector.xpath('//div[@class="l-article__container"]/div[@class="blah"]')
        for i in body:
            p_list.append([p.xpath('string()').strip() for p in i.xpath('div/section//p')])
        head = selector.xpath('//header')[0]
        header = head.xpath('h1/text()')
        meta = head.xpath('div[contains(@class,"c-article-meta")]/p/text()')
        time = head.xpath('div[contains(@class,"c-article-meta")]/time/text()')
        to_MarkDown(header,meta,time,p_list)
        print('{} | is OK'.format(header[0]))

首先获取每个div[@class=‘blah’]下的所有p标签，然后对每个p标签直接提取标签内的字符串（string()和text()有别）

<p>123
    <a href="">456</a>
</p>

注：xpath(./text())只能提取的123，使用string()直接提取所有字符串

//div和/div不要搞混，//div会把所有的都获取，达不到某个标签下的所有div的效果，/div可以（再进一步对每个子节点div做其他操作）

Xpath如何解析见文章首部文章推荐

测试 (含分析过程)

在这里插入图片描述

如图，结构非常简单

p_content = []
p_lists = body[0].xpath('div/section//p')
# for p in p_lists:
#     p_content.append(p.xpath('string()').strip())
p_content.append([p.xpath('string()').strip() for p in p_lists])

len(p_lists)

p_content

[['Over the past 12 months, three scholars—James Lindsay, Helen Pluckrose, and Peter Boghossian—wrote 20 fake papers using fashionable jargon to argue for ridiculous conclusions, and tried to get them placed in high-profile journals in fields including gender studies, queer studies, and fat studies. Their success rate was remarkable: By the time they took their experiment public late on Tuesday, seven of their articles had been accepted for publication by ostensibly serious peer-reviewed journals. Seven more were still going through various stages of the review process. Only six had been rejected.',
  'We’ve been here before.',
  'In the late 1990s, Alan Sokal, a professor of physics at New York University, began a soon-to-be-infamous article by setting out some of his core beliefs:',
  'that there exists an external world, whose properties are independent of any individual human being and indeed of humanity as a whole; that these properties are encoded in “eternal” physical laws; and that human beings can obtain reliable, albeit imperfect and tentative, knowledge of these laws by hewing to the “objective” procedures and epistemological strictures prescribed by the (so-called) scientific method.',
  'Sokal went on to “disprove” his credo in fashionable jargon. “Feminist and poststructuralist critiques have demystified the substantive content of mainstream Western scientific practice, revealing the ideology of domination concealed behind the façade of ‘objectivity,’” he claimed. “It has thus become increasingly apparent that physical ‘reality,’ no less than social ‘reality,’ is at bottom a social and linguistic construct.”']]

p_lists[0].xpath('string()').strip()

'Last year, the world learned that researchers led by David Evans from the University of Alberta had resurrected a virus called horsepox. The virus hasn’t been seen in nature for decades, but Evans’s team assembled it using genetic material that they ordered from a company that synthesizes DNA.'

测试OK

添加文章标题、时间、导读信息

像div[class=“class1 class2 class3”]，希望通过一个class属性值(比如class1)去获取该div标签？
xpath(//div[contains(@class,“class1”)])

这样可以取得所有class属性含有class1的div标签
xpath(’//div[contains(@class,“a”) and contains(@class,“b”)]’)
它会取class同时有a和b的元素

contains(@class,“a”) and id=“c”
同时包含含有属性值为a的Class和属性值为c的id的标签

在这里插入图片描述
如图所示，标题等结构也很简单

head = selector.xpath('//header')[0]
header = head.xpath('h1/text()')
meta = head.xpath('div[contains(@class,"c-article-meta")]/p/text()')
time = head.xpath('div[contains(@class,"c-article-meta")]/time/text()')

header[0]

'What an Audacious Hoax Reveals About Academia'

meta

['Three scholars wrote 20 fake papers using fashionable jargon to argue for ridiculous conclusions.']

time

['  6:00 AM ET  ']

以上是具体分析过程

写入文件

这里是直接写入markdown

def to_MarkDown(header,meta,time,p_list):
    with open('./《Atlantic》__{}.md'.format(header[0].strip()),'w+',encoding='utf=8') as f:
        f.writelines('## {}'.format(header[0].strip())+'\n')
        f.writelines('**{}**'.format(time[0].strip())+'爬取自《The Atlantic》\n\n')
        f.writelines('> 导读：**{}**'.format(meta[0].strip())+'\n\n')
        f.write('\n&nbsp;&nbsp;')
        for p in p_list:
            f.write('\n\n&nbsp;&nbsp;'.join(p))
            f.write('\n\n&nbsp;&nbsp;')
    print('./《Atlantic》__{}.md | 写入成功'.format(header[0].strip()))

运行结果

在这里插入图片描述

凑合着看下吧，考虑下次加入翻译接口，中英文同时写入

for url in url_lists[:10]:
    get_MarkDown_by_url(url)

./《Atlantic》__Something Went Wrong in Chicago.md | 写入成功
Something Went Wrong in Chicago | is OK
./《Atlantic》__What Six Senators Said About Their Kavanaugh Votes.md | 写入成功
What Six Senators Said About Their Kavanaugh Votes | is OK
./《Atlantic》__The Senate’s Ill Winds Blow Across the Kavanaugh Confirmation.md | 写入成功
The Senate’s Ill Winds Blow Across the Kavanaugh Confirmation | is OK
./《Atlantic》__Daily: Scrutinized, Demonized, and Shrouded in Mystery.md | 写入成功
Daily: Scrutinized, Demonized, and Shrouded in Mystery | is OK
./《Atlantic》__Susan Collins Gambles With the Future of.md | 写入成功
Susan Collins Gambles With the Future of | is OK
./《Atlantic》__The Powerful, Pained Raps of Sacramento’s Mozzy.md | 写入成功
The Powerful, Pained Raps of Sacramento’s Mozzy | is OK
./《Atlantic》__Politics & Policy Daily: The Maine Event.md | 写入成功
Politics & Policy Daily: The Maine Event | is OK
./《Atlantic》__Susan Collins Says She Believes Survivors—Just Not Ford.md | 写入成功
Susan Collins Says She Believes Survivors—Just Not Ford | is OK
./《Atlantic》__Read Susan Collins’s Historic Floor Speech on Brett Kavanaugh.md | 写入成功
Read Susan Collins’s Historic Floor Speech on Brett Kavanaugh | is OK
./《Atlantic》__Harvard Admissions on Trial.md | 写入成功
Harvard Admissions on Trial | is OK

10.10补充

今天获取文章时，发现有一类的文章结构和上文提到的有差异，如，点击查看该文章，但是分析起来也很简单，可以用来试手(：-