Selenium crawls the product information of Jingdong lazy loading
1. Page analysis
First, give the url of the data to be crawled: After JD
enters the page, the picture displayed on the interface has been loaded, but there is no content displayed on the screen, which is not loaded. Only when we drag the interface, will the corresponding Resources are requested and loaded. If we use the requests module to directly request page source data, the source code obtained is definitely incomplete.
At this time, you can use selenium to simulate the operation of the browser, and simulate human browsing by dragging the interface, so that after the browser resource is loaded, you can get the page source code, and the content obtained at this time is relatively complete.
The content of this crawl is 300 pieces of product information, book names, prices, and picture download links . After obtaining them, they will be saved in the mysql database.
2. Slide the scroll bar to get the full page source code
First select the reference label of the sliding scroll bar, execute the JS code, and slide the scroll bar to the position of this label. Here, in order to load the picture completely, the entire page is divided into 6 parts, and the source code of the page is obtained after six drags.
code show as below:
def get_page_text(url):
driver.get(url)
sleep(random.random()*3)
for i in range(10, 61, 10):
sleep(1)
target = driver.find_element_by_xpath(f'//*[@id="J_goodsList"]/ul/li[{i}]')
driver.execute_script("arguments[0].scrollIntoView();", target)
sleep(random.random())
page_text = driver.page_source
return page_text
3. Parse the source code to get the tag content
Now that you have obtained the page data after loading, you can locate the specific location of the picture list according to the tools that come with the browser, which is roughly as follows.
Save the obtained li label to the list, and then locate the book name, price, and picture address in the li label.
Title of book:
Book price:
The map's address:
code show as below:
# 数据解析获取标签信息
def parse_page(page_text):
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="J_goodsList"]/ul/li')
book_name_list = [] # 书籍名称
book_price_list = [] # 书籍价格
book_imageUrl_list = [] # 书籍图片
count = 0
for li in li_list:
book_name_list.append(''.join(li.xpath('.//div[@class="p-name"]/a/em//text()')))
book_price_list .append(li.xpath('.//div[@class="p-price"]/strong/i/text()')[0])
# 由于网速的不稳定,可能有的图片没有加载完成,获取的数据可能为空
image_url = li.xpath('.//div[@class="p-img"]/a/img/@src')
if len(image_url) == 0:
# 若图片未加载出来,图片的url的值为img标签的data-lazy-img标签值
image_url = li.xpath('.//div[@class="p-img"]/a/img/@data-lazy-img')
book_imageUrl_list.append('http:' + image_url[0])
return connectDB(book_name_list, book_price_list, book_imageUrl_list)
4. Save the information to the database
code show as below:
# 连接数据库并保存
def connectDB(book_name_list, book_price_list, book_imageUrl_list):
conn = pymysql.Connect(host = '127.0.0.1', port = 3306, user = 'root', password='password', db='books_jd', charset='utf8')
# 创建游标对象
cursor = conn.cursor()
for i in range(len(book_name_list)):
sql = 'INSERT INTO books(name, price, image_url) VALUES ("{0}","{1}","{2}")'.format(book_name_list[i], book_price_list[i], book_imageUrl_list[i])
try:
cursor.execute(sql)
conn.commit() # 事务提交
except Exception as e:
print(e)
conn.rollback() # 事务回滚
conn.close()
cursor.close()
Main function code:
if __name__ == '__main__':
j = [1, 57, 117, 176, 236]
for i in range(1,10,2):
url = 'https://search.jd.com/Search?keyword=python&wq=python&page={0}&s={1}&click=0'.format(i,j[(i-1)//2])
page_text = get_page_text(url)
parse_page(page_text)
sleep(random.random()*4)
print('第{}页已爬取成功!'.format((i+1)//2))
# 退出并清除浏览器缓存
driver.quit()
5. Running results