selenium练习——爬取纵横中文网

文章目录

任务

使用 selenium 爬取纵横小说网任意 1 本小说
熟悉 selenium 的使用方法

（注意：这仅仅是做为练习，爬取这个网站没有必要使用 selenium）

代码实现

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support import expected_conditions as EC
import os
import time


def switch_pages(n):
    # 切换到第 n 个页面
    num = driver.window_handles
    driver.switch_to.window(num[n])     

def write_file(file,title,content):
    # 把标题和内容写入文件
    with open(file,'w',encoding='utf-8') as f:
        f.write('\t\t\t\t'+title+'\n\n')
        f.write(content) 

def content_processing(content):
    # 内容处理
    string = ""
    for c in content:
        string = string + c.text + '\n\n'
    return string


with webdriver.Chrome() as driver:
    driver.implicitly_wait(10)						# 隐式等待
    #wait = WebDriverWait(driver, 10)              
    driver.get("http://betawww.zongheng.com/")
    book = driver.find_element_by_xpath('//div[@class="bookname"]/a')       
    if os.path.exists(book.text):                   # 创建文件夹
        pass
    else:
        os.mkdir(book.text)
    novel_name = book.text
    print(novel_name)
    book.click()                # 点击第 1 本书


    switch_pages(1)             # 切换到第 2 个页面

    driver.execute_script('document.querySelector(".all-catalog").click();')     # 点击全部目录

    switch_pages(2)             # 切换到第 3 个页面
    chapter = driver.find_element_by_xpath('//li[@class=" col-4"]/a').click()    # 点击第 1 个章节

    switch_pages(3)             # 切换到第 4 个页面

    while True:
        title = driver.find_element_by_class_name("title_txtbox").text
        content = content_processing(driver.find_elements_by_xpath('//div[@class="content"]/p'))
                                                
        print(title,content)       
        
        write_file(novel_name+'\\'+title+'.txt',title,content)                    # 写入文件
        next_chapter = driver.find_element_by_class_name("nextchapter").click()   # 点击下一章节
        time.sleep(2)
        # 如果是最后 1 章，则结束
        end = driver.find_elements_by_tag_name("h4")
        for e in end:
            if e.text == "您已经读完最新一章":
                break

完成效果

在这里插入图片描述

常见问题

selenium获取不到元素
描述：selenium 打开了另外的页面，一直提示获取不到元素

解决方法：切换到新的句柄
```
num = driver.window_handles
driver.switch_to.window(num[1])   # num[1] 为第 2 个窗口
```
按钮点击无效
描述：能找到按钮的元素，点击无效

解决方法：
- 使用按键来代替（如果能）
- 使用 javascript
```
driver.execute_script('document.querySelector(".all-catalog").click();')
```
Message: stale element reference: element is not attached to the page
document
原因：这样的错误是说我已经点击了翻页，但是还没有完成翻页，于是又爬了一次当前页，然后再要执行翻页时页面已经刷新了，前面找到的翻页元素已经过期了，无法执行点击。

解决方法：
- 设置延迟
```
time.sleep(2)
```
- 显示等待
```
wait = WebDriverWait(browser,10)
wait.until(EC.presence_of_element_located((By.XPATH,"//div[@class='bookreadercontent']/p")))
```

文章目录

任务

相关链接

代码实现

完成效果

常见问题

猜你喜欢

目录

热门文章