Selenium crawls 300 pieces of product information lazily loaded by Jingdong

1. Page analysis

First, give the url of the data to be crawled: After JD
enters the page, the picture displayed on the interface has been loaded, but there is no content displayed on the screen, which is not loaded. Only when we drag the interface, will the corresponding Resources are requested and loaded. If we use the requests module to directly request page source data, the source code obtained is definitely incomplete.
At this time, you can use selenium to simulate the operation of the browser, and simulate human browsing by dragging the interface, so that after the browser resource is loaded, you can get the page source code, and the content obtained at this time is relatively complete.
The content of this crawl is 300 pieces of product information, book names, prices, and picture download links . After obtaining them, they will be saved in the mysql database.
Insert picture description here

2. Slide the scroll bar to get the full page source code

First select the reference label of the sliding scroll bar, execute the JS code, and slide the scroll bar to the position of this label. Here, in order to load the picture completely, the entire page is divided into 6 parts, and the source code of the page is obtained after six drags.
code show as below:

def get_page_text(url):
    driver.get(url)
    sleep(random.random()*3)
    for i in range(10, 61, 10):
        sleep(1)
        target = driver.find_element_by_xpath(f'//*[@id="J_goodsList"]/ul/li[{i}]')
        driver.execute_script("arguments[0].scrollIntoView();", target)
    sleep(random.random())
    page_text = driver.page_source
    return page_text

3. Parse the source code to get the tag content

Now that you have obtained the page data after loading, you can locate the specific location of the picture list according to the tools that come with the browser, which is roughly as follows.
Insert picture description here
Save the obtained li label to the list, and then locate the book name, price, and picture address in the li label.

Title of book:

Insert picture description here
Book price:

Insert picture description here
The map's address:

Insert picture description here
code show as below:

# 数据解析获取标签信息
def parse_page(page_text):
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="J_goodsList"]/ul/li')
    book_name_list = []    # 书籍名称
    book_price_list = []    # 书籍价格
    book_imageUrl_list = []    # 书籍图片
    count = 0
    for li in li_list:
        book_name_list.append(''.join(li.xpath('.//div[@class="p-name"]/a/em//text()')))
        book_price_list .append(li.xpath('.//div[@class="p-price"]/strong/i/text()')[0])

        # 由于网速的不稳定,可能有的图片没有加载完成,获取的数据可能为空
        image_url = li.xpath('.//div[@class="p-img"]/a/img/@src')
        if len(image_url) == 0:
            # 若图片未加载出来,图片的url的值为img标签的data-lazy-img标签值
            image_url = li.xpath('.//div[@class="p-img"]/a/img/@data-lazy-img')
        book_imageUrl_list.append('http:' + image_url[0])
    return connectDB(book_name_list, book_price_list, book_imageUrl_list)

4. Save the information to the database

code show as below:

# 连接数据库并保存
def connectDB(book_name_list, book_price_list, book_imageUrl_list):
    conn = pymysql.Connect(host = '127.0.0.1', port = 3306, user = 'root', password='password', db='books_jd', charset='utf8')
    # 创建游标对象
    cursor = conn.cursor()
    for i in range(len(book_name_list)):
        sql = 'INSERT INTO books(name, price, image_url) VALUES ("{0}","{1}","{2}")'.format(book_name_list[i], book_price_list[i], book_imageUrl_list[i])
        try:
            cursor.execute(sql)
            conn.commit()    # 事务提交
        except Exception as e:
            print(e)
            conn.rollback()    # 事务回滚
    conn.close()
    cursor.close()

Main function code:

if __name__ == '__main__':
    j = [1, 57, 117, 176, 236]
    for i in range(1,10,2):
        url = 'https://search.jd.com/Search?keyword=python&wq=python&page={0}&s={1}&click=0'.format(i,j[(i-1)//2])
        page_text = get_page_text(url)
        parse_page(page_text)
        sleep(random.random()*4)
        print('第{}页已爬取成功!'.format((i+1)//2))
    # 退出并清除浏览器缓存
    driver.quit()

5. Running results

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/109521728