Reddit 网页爬虫使用 BeautifulSoup4 爬取结果为空

我尝试创建了一个 Reddit 的 /r/all 分区的网页爬虫，用来收集最热门帖子的链接。我按照 YouTube 上 thenewboston 的网页爬虫教程系列第一部分进行操作。

在我的代码中，我删除了 thenewboston 案例中设置要爬取的网页数量的 while 循环（我只打算爬取 /r/all 分区的 25 个最热门帖子，只需要一页）。当然，我做出了这些更改以便符合我的网页爬虫的用途。

在我的代码中，我将 URL 变量更改为“http://www.reddit.com/r/all/”（出于显而易见的原因），并将 Soup.findAll 迭代对象更改为 Soup.findAll(‘a’, {‘class’: ‘title may-blank loggedin’})（title may-blank loggedin 是 Reddit 上帖子标题的 class）。

以下是我的代码：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all/'
    sourceCode = requests.get(URL)
    plainText = sourceCode.text
    Soup = BeautifulSoup(plainText)
    for link in Soup.findAll('a', {
    
    'class': 'title may-blank loggedin'}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

我已经使用 print 语句在每行代码之间进行了一些业余的错误检查，结果 for 循环未执行。

要跟进或将 thenewboston 的代码与我的代码进行比较，请跳到他这个迷你系列的第二部分，并找到一段视频中显示了他的代码的位置。

编辑：应要求提供 thenewboston 的代码：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://buckysroom.org/trade/search.php?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in Soup.findAll('a', {
    
    'class': 'item-name'}):
            href = 'http://buckysroom.org' + link.get('href')
            print(href)
        page += 1

trade_spider()

2、解决方案

答案 1：

这不是确切的答案，但我认为我可以让你知道有一个名为 PRAW（Python Reddit API 封装器）的 Reddit API，供 Python 使用，你可能想要了解一下，因为它可以用更简单的方式做到你想要做的事情。

链接：https://praw.readthedocs.org/en/v2.1.20/

答案 2：

扫描二维码关注公众号，回复： 17428199 查看本文章

首先，newboston 似乎是一个录屏，因此将代码放入其中会很有帮助。

其次，我建议在本地输出文件，以便你可以在浏览器中打开它并在 Web 工具中四处查看，查看你想要什么。我还会建议使用 ipython 在本地对文件进行 BeautifulSoup 操作，而不是每次都进行爬取。

如果你把这个放进去，你就可以实现：

plainText = sourceCode.text
f = open('something.html', 'w')
f.write(sourceCode.text.encode('utf8'))

当我运行你的代码时，我首先不得不等待，因为多次给我的错误页面是我请求的太频繁了。这可能是你的第一个问题。

当我确实得到该页面时，有很多链接，但没有一个与你的类匹配。我不知道“title may-blank loggedin”代表什么，而无需观看整个 Youtube 系列。

现在我看到了问题
这是登录类，你没有用你的爬虫程序登录。

你不应该只登录就可以看到/r/all，只需使用以下方法：

soup.findAll('a', {
    
    'class': 'title may-blank '})

答案 3：

你没有“登录”，因此从不应用该类样式。这段代码在没有登录的情况下起作用：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all'
    source = requests.get(URL)
    Soup = BeautifulSoup(source.text)
    for link in Soup.findAll('a',attrs={
    
    'class' : 'title may-blank '}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

Reddit 网页爬虫使用 BeautifulSoup4 爬取结果为空

2、解决方案

猜你喜欢