[Day] Python learning: beginner scrapy framework

------------ ------------ restore content begins

      Hello everyone, this time I'm a little busy, no time to update the blog these days in the framework of this study scrapy learn this new framework, I mainly to learn in two parts, the first step by Baidu search Scrapy frame entry knowledge, the second step to search scrapy through github project to see how the demand for others to do.

  The first step: to master the operation principle scrapy framework

      

 

 

 

     This is a schematic, I crawled on the Internet is down, talk about my understanding.

    Crawler crawls the page with the premise that we normally visit the same Web page, you need to initiate a request and returns the result. Request such requests page, Response return results

    Scrapy frame body by Spider (reptile) as the core of progress, ItemPip (channel) understanding is that the content is defined to be collected, scheduler (Scheduler) understanding that task scheduling, a Web site that contains a lot of content, you need to work more reptiles, The scheduler needs to coordinate

  Downloader is downloaded, a tool can be understood as the process result data

  Scrapy framework of running business processes

  1. Engine taken from a link scheduler (URL) used in the next gripping
  2. The engine encapsulated into a URL request (Request) transmitted downloader
  3. Downloading download the resource, and packaged as a response packet (the Response)
  4. Reptile resolve Response
  5. Parsing entity (Item), the entity to the conduit for further processing
  6. Parsed is the link (URL), put the URL to the scheduler for crawling

 

 

  Step 2: Install the framework Scrapy

      My environment is: window10, python3.7, Anaconda3

      pip install scrapy

     Installation time often fail to install, make sure to install the network, ensure that the installation is complete

   

     Command syntax:

     scrapy startproject Project Name # create a new project to produce conventional directory structure

   

    scrapy list      #查询爬虫列表

 

   scrapy crawl crawlers start crawling Name #

 

    

 

     This is a project I created a directory

    It can be customized according to different business under spiders reptiles directory.

    items can be collected to establish the data format

    Persistence data processing pipelines

    settings.py output data format configuration profile

   

Step Three: Practice Project

   For technical types of learning, I am more respect for practice, practice is the sole criterion for testing truth.

  For the preliminary study, the first not to find a high degree of difficulty, easy to find someone to practice it.

   Goals: data collection Chinese news websites, article directories, to save json format

 

   The project has been generated

 

  The first step: we need to define the pipeline Items

  

 

 We need to capture headlines, links, description, release time, first define it in the pipeline

 

Step Two: Write reptile, first create a reptile

 

 First to define the network path to good collection, remember to write right, must be added the HTTP or HTTPS

 The link request is out of a list of categories, you can see

 

 

   Parse method is to link categories to find out, through a secondary access into the data acquisition

 

  

parse_feed way is to find secondary data link access will need to come out of 

this even if the reptile we finished

I looked at the statistics, export the data json format is very simple, just click on it to configure

 

 

JSON data will be saved to a local file through the pipeline

 

 

 Save to encounter a problem, data is garbled, I am looking at the statistics, in command when we set the output format is UTF-8 on it


'scheduler/dequeued': 33,
'scheduler/dequeued/memory': 33,
'scheduler/enqueued': 33,
'scheduler/enqueued/memory': 33,
'start_time': datetime.datetime(2019, 11, 25, 3, 19, 0, 393647)}
2019-11-25 11:19:03 [scrapy.core.engine] INFO: Spider closed (finished)

D:\Python\Study\test_scrapy>scrapy crawl chinanews -o result.json -s FEED_EXPORT_ENCODING=utf-8

 

正常的数据结果是:

 

 

Scrapy爬虫就先学习到这到

我觉得学习任何东西都是一样,需要有一个老师,三人行必有我师,相互的学习交流,提升技术能力!

加油!

 

 

 

 

 



 

 

 

 

   

    

 

Guess you like

Origin www.cnblogs.com/hcs2020/p/11928283.html