The following content is original, I welcome everyone to watch and study, and it is forbidden to be used for commercial purposes. Please indicate the source when reprinting.
Great Karma! I am Yhen who has been practicing python for one month. I am very happy to share my learning experience with you here. As a little white, I may encounter all kinds of bugs when I write code. I share some of my experiences with you and I hope it will be helpful to everyone!
Today I will take you to use a special crawling method-selenium to achieve crawling of the one piece picture in Baidu pictures. I will give you the source code later, because it is more detailed, so the length may be longer, so if you only want to see the results, you can go directly to the source code behind.
———————— Manual dividing line ————————————————
Alright, start sharing today
I have nothing to do, I want to climb some anime pictures and play.
What kind of anime should I climb ?
Open the Baidu Image Search wallpaper
browse through a partition animation Columns Columns
Hey yeah ~ I decided you were
"One Piece"
Although I haven't seen this anime much
But I have heard about its superb painting
I believe many friends love this movie
The pictures inside are all cool
url:
wallpaper cartoon anime one piece
You know, there are more than these dozens of pictures here,
all the way to the bottom . There are more than
447 pictures.
Our goal today is to crawl all the 447 pictures.
With the demand, let's start thinking analysis:
Since the picture we are going to climb
Naturally, it is easy to think of a common idea:
1. First send a request to the homepage interface to obtain page data
2. Perform data extraction and obtain a link
to a picture 3. Send a request to a picture link to obtain picture data
4. Save the picture to local
Isn't this the same as climbing the emoticon before? So easy, you can get it in 10 minutes!
But ... is it really that simple?
Come, let me show you the general method of climbing pictures
The first is a simple guide package and send request
# 导入爬虫库
import requests
# 导入pyquery数据提取库
from pyquery import PyQuery as pq
# 首页网址
url = "https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=©right="
# 请求头
headers ={"User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36"}
#对首页地址发送请求,返回文本数据
response = requests.get(url).text
print(response)
Data can be obtained normally
Then the data is extracted
First, open the inspection tool
is positioned to the first picture
on the right can be seen in the class corresponding to the class selector is a link inside main_img img-hover attribute data-imgurl https://ss0.bdstatic.com/70cFvHSh_Q1YnxGkpoWK1HF6hhy/it/ u = 1296489273,320485179 & fm = 26 & gp = 0.jpg
Let ’s visit, it
turns out to be the picture detail page we ’re looking for
So next we use pyquery to extract the data to
see if we can extract the link just now
Because this is not the focus of today, I will show you the code directly.
If you want to know how to use pyquery,
you can see my previous blog posts
# 数据初始化
doc = pq(response)
# 通过类选择器main_img img-hover 来提取数据 注意:中间的空格用.代替
main_img = doc(".main_img.img-hover").text()
print(main_img)
Print it and see if we can get the data we want
Oh no, why are there nothing? !
After confirming that there is no problem with the code we wrote,
my first reaction was: being crawled back! ! !
It does not matter, if it is anti-climbing, we just add a few more parameters to the request header.
I added the request data type, user information, and anti-theft chain to the request header
# 请求头
# 浏览器类型
headers ={"User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36",
# 请求数据类型
"Accept':'application/json, text/javascript, */*; q=0.01",
# 用户信息
"Cookie':'BIDUPSID=19D65DF48337FDD785B388B0DF53C923; PSTM=1585231725; BAIDUID=19D65DF48337FDD770FCA7C7FB5EE199:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; indexPageSugList=%5B%22%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8%22%2C%22%E5%A3%81%E7%BA%B8%22%5D; delPer=0; PSINO=1; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BCLID=8092759760795831765; BDSFRCVID=KH_OJeC62A1E9y7u9Ovg2mkxL2uBKEJTH6aoBC3ekpDdtYkQoCaWEG0PoM8g0KubBuN4ogKK3gOTH4AF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJCHoK_MfCD3HJbpq45HMt00qxby26niWNO9aJ5nJDoNhqKw2jJhef4BbN5LabvrtjTGah5FQpP-HJ7tLTbqMn8vbhOkahoy0K6UKl0MLn7Ybb0xynoDLRLNjMnMBMPe52OnaIbp3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDcnK4-XD653jN3P; ZD_ENTRY=baidu; H_PS_PSSID=30963_1440_21081_31342_30824_26350_31164",
# 防盗链
"Referer':'https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=©right="
}
Request again to see if we can get the data we want
。。。。。。
Isn't it ... not yet?
Autistic ...
It's okay, how can this setback frustrate me! ! !
Let's analyze a wave in reverse
We use the class selector main_img img-hover to locate the data,
but we did not get any data. This situation also occurs
under the premise that the code is correct and not anti-climbed.
Then
…there is only one truth!
The homepage data we requested at the beginning did not have the main_img img-hover class selector! ! !
Let's verify if this homepage data is the culprit
First print the requested home page data, and then search for main_img img-hover
The search found that it
turned red and it turned out that this guy was fooling
around. Actually don't give me data!
Then it turns out that the home page is dynamic data, and his data interface is not the URL of the home page!
After this situation, there are two solutions
1. Among these vast interfaces, find the interface of the home page data
2. Use selenium to request the page directly
I don't know which one you choose. Anyway, let me find an interface in so much data, I refuse! ! ! Much effort
Isn't selenium fragrant?
why?
Because the webpage is opened with selenium, all the information will be loaded into Elements, and then, the dynamic webpage can be crawled by the method of static webpage.
It means that
as long as selenium makes a request to the homepage, the data obtained is the source code that we see in the console after pressing f12! You don't have to work hard to find an interface!
Regarding selenium, the most common usage is still automation. Selenium can automatically start a browser and simulate user operations, such as automatic login, automatic page turning and so on.
Want to know more students can refer to this Chinese translation document
https://selenium-python-zh.readthedocs.io/en/latest/
OK, just do it
The first is the guide package
and then the configuration browser. Selenium has a visual mode (open a real browser, you can see the operation of the browser) and a silent mode (run in the background, not visible)
We adopt silent mode today. Because if the focus is on crawlers, you do n’t need to see the browser operation, and opening a browser every time is too annoying and consumes memory.
Another important point is that to use selenium, you must first install a webdriver driver corresponding to your browser and put it in the same path as your py file, so that selenium can achieve simulated browser operations
For download addresses of various browser drivers, please refer to
https://www.jianshu.com/p/6185f07f46d4
The webdriver should be put together with the py file.
The following is the silent mode configuration of selenium. It is
more troublesome, but it is a dead operation, everyone can be familiar with it.
from selenium import webdriver #调用webdriver模块
from selenium.webdriver.chrome.options import Options # 调用Options类
chrome_options = Options() # 实例化Option
chrome_options.add_argument('--headless') # 设置浏览器启动类型为静默启动
driver = webdriver.Chrome(options = chrome_options) # 设置浏览器引擎为Chrome
If you ask me why each step is set up in this way, I do n’t know. If you want to know more, you can go to the documentation
But it ’s just a little more troublesome to set the silent mode, the visual mode is set up with two or three lines of code
After setting,
you can use selenium to send requests
#对首页进行请求
driver.get('https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=©right=')
# 返回页面源码
response = driver.page_source
Note that to get the source code of the page, driver.page_source
is used and the data in string format is directly returned to us.
Let's print the returned data
To be successful to get the data
that we are now again in the data obtained in the search main_img img-hover
look at this is not really got the data we want
Dengdengdengdeng
this time total no problem right, we want Image URL is here
Then you can
use pyquery to extract the image URL.
First, initialize the data, and then extract the data through the class selector. After the traversal, the image link is extracted through the attribute "data-imgurl". The
code is as follows
from pyquery import PyQuery as pq
# 数据初始化
doc = pq(response)
# 通过类选择器提取数据
x = doc(".main_img.img-hover").items()
count = 0
# 遍历数据
for main_img in x :
# 通过属性“data-imgurl”取出图片链接
image_url = main_img.attr("data-imgurl")
print(image_url)
After a lot of hard work, I
finally got the picture URL successfully. It ’s
not easy,
so I printed the picture URL with excitement and
found out ...
WHAT ???
We just had more than 400 pictures on the page,
you give me Return 20 URLs? ? ? Have you eaten the rest? ? ? ?
Interesting. Make me right, but the devil is one foot tall and one foot high
I go back to the webpage
In the source code just now, ctrl + F opens the search function.
Searching main_img img-hover
found that there are only 20 search results.
Then I immediately thought that the more than four hundred pictures we just obtained by continuously pulling down
So it is most likely because the initial page has not been loaded. If you want to get the remaining url, you must let Selenium control the browser to continuously pull down
How to achieve it?
I do n’t know hahahaha,
but I know Baidu, I ’ll ask Du Niang
I found a csdn article that describes how to use selenium to simulate sliding to the bottom of the operation.
Original link
https://blog.csdn.net/weixin_43632109/article/details/86797701
However, be aware that to get more than 400 pictures of data It cannot be achieved by just sliding the bottom of the operation once. Selenium must perform the bottom operation multiple times before obtaining the source code data to get all the data.
So how do I implement multiple operations?
I set up a for loop,
first look at the code
import time
# 执行24次下滑到底部操作
for a in range(25):
# 将滚动条移动到页面的底部
js = "var q=document.documentElement.scrollTop=1000000"
driver.execute_script(js)
# 设置一秒的延时 防止页面数据没加载出来
time.sleep(1)
I set the number of cycles to 25.
Why is 25?
Because I ’ve tested it myself before, and I can get the last picture just 25 times
.
Now let ’s take a look at the link to the picture we got now
Obviously we got a lot more data this time
I click on the last link
As you can see, the last link is actually the link to the last picture,
so we succeeded in getting all the picture links
The next step is to send a request to these pictures, get the data and save it locally.
import request
#定义count初始值为0
count = 0
# 遍历数据
for main_img in x :
# 通过属性提取出图片网址
# 通过属性“data-imgurl”取出图片链接
image_url = main_img.attr("data-imgurl")
#对图片链接发送请求,获取图片数据
image =requests.get(image_url)
# 在海贼王图片下载文件夹下保存为jpg文件,以“wb”的方式写入 w是写入 b是进制转换
f = open("海贼王图片下载/" + "{}.jpg".format(count), "wb")
# 将获取到的数据写入 content是进制转换
f.write(image.content)
# 关闭文件写入
f.close()
#意思是 count = count+1
count+=1
Send requests or use our classic crawler library requests
First send a request to get data
Then save it as a .jpg file in the One Piece picture download folder, write it in the way of "wb", w is written, b is hexadecimal conversion
Write the acquired data into it
Finally close the file write
After executing the program,
see if you can download all the pictures
perfect! The success of a total of 447 pictures on a Web page to download down
Sahua end!
Finally, put the source code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from pyquery import PyQuery as pq
import requests
# 实例化一个options对象
chrome_options =Options()
# 把浏览器设置为静默模式
chrome_options.add_argument("headless")
driver = webdriver.Chrome(options=chrome_options)
# 对首页进行请求
driver.get('https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=©right=')
# 执行24次下滑到底部操作
for a in range(25):
# 将滚动条移动到页面的底部
js = "var q=document.documentElement.scrollTop=1000000"
driver.execute_script(js)
# 设置一秒的延时 防止页面数据没加载出来
time.sleep(1)
# 返回页面源码
response = driver.page_source
# print(response)
# 数据初始化
doc = pq(response)
# 通过类选择器提取数据
x = doc(".main_img.img-hover").items()
#定义count初始值为0
count = 0
# 遍历数据
for main_img in x :
# 通过属性提取出图片网址
# 通过属性“data-imgurl”取出图片链接
image_url = main_img.attr("data-imgurl")
#对图片链接发送请求,获取图片数据
image =requests.get(image_url)
# 在海贼王图片下载文件夹下保存为jpg文件,以“wb”的方式写入 w是写入 b是进制转换
f = open("海贼王图片下载/" + "{}.jpg".format(count), "wb")
# 将获取到的数据写入 content是进制转换
f.write(image.content)
# 关闭文件写入
f.close()
# 意思是 count = count + 1
count+=1
Next to my water blowing
[Yhen said]
Everyone hasn't seen you for a long time. Due to many reasons, I haven't posted reptile articles in a while. After I checked my article at noon the day before yesterday, I was thinking about what to write in the next article in the afternoon. Suddenly thought I might try to climb Baidu's pictures, so I tried it myself. I thought it would be very smooth, just like what I wrote in the article, it can be done directly by the ordinary crawler method. Because this time the content is completely my own way of thinking, not like following the teacher before. The teacher of Six Star also had a tutorial to climb Baidu pictures, but I did not read it before writing this article, just to try to see if I can make a project independently. After writing, I found the teacher's video and found that the teacher used the method of finding the interface. I think this method is an innovation. After using selenium, the problem of incomplete data still appeared, which troubled me for a while. But after I found the information, I solved it. Learning is not the process of constantly discovering problems and then solving them. When I finally climbed down the 447 pictures, I was very fulfilled hahaha. Therefore, I hope you can try when you are free. Under the premise of observing the network protocol, use crawlers to do what you are interested in. You may encounter many frustrations in the process, but when you succeed, how excited you are Only you know it. It also proves that you can really use Python after learning it, and it is not a waste of time, right? Come on!
Yesterday I actually found the article about crawling novels I wrote on an unknown website ... and I didn't add a source. I am ready to contact csdn and related personnel to negotiate, so that sentence is still welcome, everyone is welcome to read my article to learn, but please indicate the source of the reprint, and it is prohibited for commercial use! Thank you for your cooperation.
I'm very happy to share my experience with you here, and I hope it will be helpful to everyone. If there is anything you do not understand or want to give me any suggestions, please leave a message in the comment area!
If you feel that the classmates I wrote are helpful to you, please give me a small praise, it would be better to pay attention Your support is my motivation. In the future, I will share more experience with everyone.
I ’m Yhen, see you next time
[Review of previous articles]
[Crawler] Yhen takes you with your hands and crawls with python