Production Scrapy reptiles
1, the new project (command line, type: scrapy startproject xxx): Create a new project reptile
2, pycharm open project, view the project directory
![13148195-a266765af4506d87.png](https://upload-images.jianshu.io/upload_images/13148195-a266765af4506d87.png)
You have to crawl clear goal: 3, clear objectives (written item.py: put data model code)
![13148195-0d1976d18e12b175.png](https://upload-images.jianshu.io/upload_images/13148195-0d1976d18e12b175.png)
4, making crawler (spider / xxspider.py): Production start crawling web crawlers
(1) Create a file crawler, spider directory under the file will be more xxspider.py
scrapy genspider xxx xxx.com
(2) preparation of reptiles file, requests and responses, and extracting data (yield item)
![13148195-695a545b03a49143.png](https://upload-images.jianshu.io/upload_images/13148195-695a545b03a49143.png)
Crawling content:
①name = 'tencent' # reptile name, startup parameters reptiles need to be
②allowed_domains = [ 'tencent.com'] # crawling gamut, allowing crawling reptiles in the domain (optional)
③start_urls = [] # start URL list, after the execution of the first reptiles request, obtain from this list
5, the stored contents (prepared pipe files pipelines.py): Design pipe crawling content storage, processing returns spider data item, such as local persistent storage
6, write setting.py settings file, start piping components, and related settings
![13148195-f1f4d0c871012cde.png](https://upload-images.jianshu.io/upload_images/13148195-f1f4d0c871012cde.png)
7, implementation of reptiles
scrapy crawl xxx
8, four methods crawler get the data storage information, output file format specified -o
(1) json format: default unicode encoding --scrapy crawl xxx -o xxx.json
(2) json lines Format: default unicode encoding --scrapy crawl xxx -o xxx.jsonl
(3) csv comma expression, can be used to open Excel --scrapy crawl xxx -o xxx.csv
(4) xml format --scrapy crawl xxx -o xxx.xml
Reproduced in: https: //www.jianshu.com/p/f94f4514e60d