How to get started with Python crawlers? Detailed tutorial here

According to my habits and understanding, use the most concise expression to introduce the definition, components, and crawling process of crawlers, and explain sample codes.

Base

Definition of crawler: A program that directionally grabs Internet content (mostly web pages) and performs automated data processing. It is mainly used to collect and structure loose massive information, and provide raw materials for data analysis and mining.

Today's t article is a huge "reptile".

The crawler consists of a URL library, a collector, and a parser.

process

If the url library to be crawled is not empty, the collector will automatically crawl the relevant content, and send the result to the parser, and the parser extracts the target content and writes it into a file or puts it into the database.
insert image description here

the code

Step 1: Write a Collector

The following is a relatively simple collector function. Need to use the requests library.
First, construct an http header, which contains information such as browser and operating system. Without this forged header, it may be recognized as machine code by the WAF and other protective equipment of the target website and killed.

Then, use the get method of the requests library to get the url content. If the http response code is 200 ok, it means that the page access is normal, and the return value of this function is set as the html code content in text form.

If the response code is not 200 ok, it means that the page cannot be accessed normally, and the function return value is set to a special string or code.
insert image description here
Step 2: Parser

The role of the parser is to filter the html code returned by the collector and extract the required content.
As a loyal user for 14 years, of course I have to use Douban to give a chestnut_

We plan to crawl 8 parameters of Douban ranking TOP250 movies: ranking, movie url link, movie name, director, release year, country, movie type, rating. Organize into a dictionary and write to a text file.

The pages to be crawled are as follows, each page includes 25 movies, a total of 10 pages.
insert image description here
Here, we must praise the front-end engineers of Douban. The layout of html tags is very neat and hierarchical, which is very convenient for information extraction.

The following is the html code corresponding to "The Shawshank Redemption": (The 8 parameters to be extracted are marked with red lines) Write the insert image description here
parser function according to the above html, and extract 8 fields. The function return value is an iterable sequence.
I personally like to use re (regular expressions) to extract content. The 8 (.*?) correspond to the fields that need to be extracted. insert image description here
The extracted content is as follows:
insert image description here
Organized into a complete code: (not considering fault tolerance for the time being)
insert image description here
is very concise, which is very consistent with the simple and efficient characteristics of python.

1. Learning routes in all directions of Python

The technical points in all directions of Python are sorted out to form a summary of knowledge points in various fields. Its usefulness lies in that you can find corresponding learning resources according to the above knowledge points to ensure that you can learn more comprehensively.
insert image description here

2. Essential development tools for Python

insert image description here

3. Excellent Python learning books

When I have learned a certain foundation and have my own understanding ability, I will read some books or handwritten notes compiled by my predecessors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.
insert image description here

4. Python video collection

Watching the zero-based learning video is the fastest and most effective way to learn. Following the teacher's ideas in the video, it is still very easy to get started from the basics to the in-depth.
insert image description here

5. Practical cases

Optical theory is useless, you have to learn to follow along, and you have to do it yourself, so that you can apply what you have learned to practice. At this time, you can learn from some actual combat cases.insert image description here

6. Python exercises

Check the learning results.
insert image description here

7. Interview information

We must learn Python to find high-paying jobs. The following interview questions are the latest interview materials from first-line Internet companies such as Ali, Tencent, and Byte, and Ali bosses have given authoritative answers. After finishing this set The interview materials believe that everyone can find a satisfactory job.
insert image description here
insert image description here
This full version of the full set of learning materials for Python has been uploaded to CSDN. If you need it, you can scan the QR code of CSDN official certification below on WeChat to get it for free [100% free guarantee]
insert image description here

Guess you like

Origin blog.csdn.net/libaiup/article/details/127445741