White Python crawler Science (2): Pre-preparation (a) is mounted substantially libraries

Life is short, I used Python

The foregoing Portal:

White school Python Reptile (1): Opening

In this part is longer, ladies and students can then see first collection ~ ~

Before we start talking about reptiles, or the first environment do a good job, 工欲善其事必先利其器 Well ~ ~ ~

This article describes the Python Reptile used to request the library and parsing library request library to request target content, parsing library used to parse the requested content back.

Development environment

First introduced small series of local development environment:

  • Python3.7.4
  • win10

Almost on these, the most basic environment, other environments we need to install one by one, beginning now.

Library request

Although Python is we built a HTTP request library urllib, use posture is not very elegant, but many third-party HTTP libraries do offer more simple and elegant, start following us.

Requests

Requests class library is a class library for transmitting a third-party HTTP request synchronization, as compared to libraries that comes with Python urllib more convenient and simple.

Python provides us with package management tool pip, pip using the installation will be very easy to install command as follows:

pip install requests

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests

First, enter the CMD command line python, python enter the command line mode, and then enter import requestsif there is no error, shows that we have successfully installed Requestslibrary.

Selenium

Selenium is now more used to do automated testing tools, related books, no less, at the same time, we can also use it for reptiles tool, after all, what is automated testing, we can use it to let the browser execute what we want action, such as clicking a button, scroll wheel operation and the like, which we simulate real user operation is very convenient.

Installation command as follows:

pip install selenium

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import selenium

So we did not error on the installation is complete, but do you think this even better it? Tucson broken pattern ah.

ChromeDriver

We also need to support the browser to match seleniumthe work, developers Well, so does the popular browsers are several: Chrome, Firefox, IE students who say, you gave me to stand up, be careful I jumped up and hit you knees, also said 360 browser, you can let it worry province.

Next, install the Chrome browser would not talk about it. . . .

After then the next, we start the installation ChromeDriver, installed ChromeDriver, we can pass just installed seleniumto complete a variety show operations to drive Chrome.

First, we need to see their own version of the Chrome browser, three o'clock in the top right corner of the Chrome browser, click on Help -> About as shown below:

This version will find a small notebook down, Xiao Bian here is version: Version 78.0.3904.97 (official version) (64)

Next we need to go to the official website to see ChromeDriver drive current corresponding Chrome.

Official website address: https://sites.google.com/a/chromium.org/chromedriver/

For some reason, the time required to access some means, not access to see small series for everyone to prepare table corresponds to the version of it. . .

ChromeDriver Version Chrome Version
78.0.3904.11 78
77.0.3865.40 77
77.0.3865.10 77
76.0.3809.126 76
76.0.3809.68 76
76.0.3809.25 76
76.0.3809.12 76
75.0.3770.90 75
75.0.3770.8 75
74.0.3729.6 74
73.0.3683.68 73
72.0.3626.69 72
2.46 71-73
2.45 70-72
2.44 69-71
2.43 69-71
2.42 68-70
2.41 67-69
2.40 66-68
2.39 66-68
2.38 65-67
2.37 64-66
2.36 63-65
2.35 62-64

Xiao Bian found a way to download the corresponding domestic mirror sites, provided by Taobao, as follows:

http://npm.taobao.org/mirrors/chromedriver

Though local and small series of minor version is not on, but it looks like as long as the big version complies should be nothing the problem, so, to mirror sites to download the corresponding version, Xiao Bian here to download the version: 78.0.3904.70, ChromeDriver 78 a small version of the last version.

Once downloaded, the executable file chromedriver.exeis moved to the Python Scripts directory under the installation directory. If you use the default installation directory has not been modified, then the installation directory is: %homepath%\AppData\Local\Programs\Python\Python37\Scriptsif there have been changes, it would be self-reliant. . .

The chromedriver.exeAfter the addition results as shown below:

verification:

Or in the CMD command line, enter the following:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from selenium import webdriver
>>> browser = webdriver.Chrome()

If you open a blank page in the Chrome installation is successful.

GeckoDriver

Above we completed the Selenium and Chrome docking driven by the installation of Chrome, FireFox docking want to complete the Selenium and you need to install another drive GeckoDriver.

FireFox installed small series not presented here, we had better go to the official website to download the installation path is as follows:

FireFox official website address: http://www.firefox.com.cn/

GeckoDriver need to download Github (the world's largest gay dating site), small series have to find a good path to download, you can choose to download the latest version releases.

Download: https://github.com/mozilla/geckodriver/releases

Select the corresponding own environment, Xiao Bian here to choose a win-64, version for download to v0.26.0.

Specific configurations and above, the executable .exefile in %homepath%\AppData\Local\Programs\Python\Python37\Scriptsto the directory.

verification:

Or in the CMD command line, enter the following:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from selenium import webdriver
>>> browser = webdriver.Firefox()

It should be able to open a blank page FireFox normal results are as follows:

Note: GeckoDriver to point out that the current version has a known bug in win, you need to install a plug-in to Microsoft's resolve, reads as follows:

You must still have the Microsoft Visual Studio redistributable runtime installed on your system for the binary to run. This is a known bug which we weren't able fix for this release.

Download plugin: https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads

Will students have chosen their own corresponding system version to download and install.

Aiohttp

Above us the library synchronization Http request library Requests, and Aiohttp a Http request is asynchronous provided.

So, the question is, what is the synchronous request? What is an asynchronous request it?

  • Sync: blocking, simple to understand is that when a request is issued later, the program will wait for a response to this request, until after the response before continuing to do next.
  • Asynchronous: Examples of non-blocking, or above, when a request is issued later, the program will not be back up here, waiting for a request response, but you can do other things.

From the resource consumption and efficiency, the synchronization request is definitely better than asynchronous requests, which is why an asynchronous request will have greater throughput than synchronous request. Using asynchronous request when fetching data, can greatly enhance the efficiency of the grab.

If you want to know much about aiohttp with content, you can visit the official document: https://aiohttp.readthedocs.io/en/stable/ .

Installation aiohttp follows:

pip install aiohttp

aiohttp We also recommend installing the other two libraries, a character encoding detection library cchardet, and the other is to accelerate the DNS resolver library aiodns.

Installation cchardet library:

pip install cchardet

Installation aiodns library:

pip install aiodns

aiohttp very intimate prepare our integrated installation command without a typing a command, as follows:

pip install aiohttp[speedups]

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import aiohttp

Not being given the installation is successful.

Parsing library

lxml

lxml is a Python parsing library that supports HTML and XML parsing, XPath support analytical methods, and analytical efficiency is very high.

What is XPath?

XPath is the XML Path Language (XML Path Language), which is an XML document is used to determine the position of a part of the language.
XPath-based XML tree structure, providing the ability to find a node in the data tree. XPath mind initially proposed it as is a common, between syntax model between XPointer and XSL.

Above source, "Baidu Encyclopedia"

Well, small series that were so, is that you can quickly locate documents from XML or HTML document to the desired location of the path of the language.

I have not read? emmmmmmmmmmm, we can use XPath rapid take out the value of XML or HTML document you want. Usage of words put behind us talk to you.

Installation lxml library:

pip install lxml

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml

Not being given the installation is successful.

Beautiful Soup

Beautiful Soup is also a Python HTML or XML parsing library. It has a strong analytical ability, we can use it more convenient to extract data from HTML documents.

First, put about Beautiful Soup's official website, there are all kinds of questions can view the document in the official website, fellow students develop a good habit, there is a problem to find an official document, although it is in English, using the built-in translation feature Chrome or barely able to watch.

Official Website: https://www.crummy.com/software/BeautifulSoup/

Installation still use pip to install:

pip install beautifulsoup4

Beautiful Soup HTML and XML parser is dependent on lxml library, so before make sure you have successfully installed the lxml library.

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup

Not being given the installation is successful.

pyquery

pyquery is also a web page parsing library, just in front and two distinct jQuery is that it provides a similar syntax to parse an HTML document, an experienced front-end students should be very fond of this parsing library.

First of all official documents or put it in the address pyquery.

The official document: https://pyquery.readthedocs.io/en/latest/

installation:

pip install pyquery

verification:

C:\Users\inwsy>python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyquery

Not being given the installation is successful.

Benpian content on here first over. Please classmate PCs will own the content described above are installed again for subsequent learning to use.

Guess you like

Origin www.cnblogs.com/babycomeon/p/11909567.html