python-scrapy framework (4) detailed usage example of settings.py file

The settings.py file is a file used in the Scrapy framework to configure crawling related settings. In Scrapy, we can customize the behavior of crawlers by modifying the settings.py file, including setting global variables, configuring download delays, configuring ua pools, setting agents, and other crawler-related configuration items. The following is a detailed explanation and an example of the usage of the settings.py file:

1. Set global variables
In the settings.py file, we can define some global variables that can be used throughout the crawling process. For example, we can define a USER_AGENT variable to set the User-Agent header information of the request:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

2. Configure the download delay
In the settings.py file, you can configure the download delay by setting the DOWNLOAD_DELAY parameter to control the crawling speed. The unit of DOWNLOAD_DELAY is seconds and can be set to a value of 1 or greater. For example:

DOWNLOAD_DELAY = 1

3. Configure the UA pool
In order to prevent the website from identifying crawlers, we can set up a User-Agent pool and let each request randomly select a User-Agent to send. USER_AGENT_POOL can be set in the settings.py file as follows:

USER_AGENT_POOL = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebK...
]

Then, randomly select a User-Agent in the Spider to send the request:

from scrapy import Spider
from scrapy.utils.project import get_project_settings
from scrapy.utils.httpobj import urlparse_cached

class MySpider(Spider):
    name = 'my_spider'
    
    def __init__(self, name=None, **kwargs):
        self.settings = get_project_settings()
    
    def start_requests(self):
        # ...
        yield scrapy.Request(url, headers={'User-Agent': self.settings['USER_AGENT_POOL'][random.randint(0, len(self.settings['USER_AGENT_POOL'])-1)]})

4. Set the proxy
If you need to crawl through a proxy, you can set the PROXIES parameter in the settings.py file. For example:

PROXIES = [
    'http://proxy1.example.com:8888',
    'http://proxy2.example.com:8888',
    'http://proxy3.example.com:8888',
]

Then, randomly select an agent in the Spider to send the request:

from scrapy import Spider
from scrapy.utils.project import get_project_settings
from scrapy.utils.httpobj import urlparse_cached

class MySpider(Spider):
    name = 'my_spider'
    
    def __init__(self, name=None, **kwargs):
        self.settings = get_project_settings()
    
    def start_requests(self):
        # ...
        yield scrapy.Request(url, meta={'proxy': self.settings['PROXIES'][random.randint(0, len(self.settings['PROXIES'])-1)]})

5. Other crawler-related configuration items
In the settings.py file, you can also set other crawler-related configuration items, such as log level, save path, crawling depth, etc. The following are some common configuration items:

# 日志级别
LOG_LEVEL = 'INFO'

# 爬虫名称
BOT_NAME = 'my_bot'

# 爬取深度限制
DEPTH_LIMIT = 3

# 是否遵循robots.txt
ROBOTSTXT_OBEY = True

# 是否启用缓存
HTTPCACHE_ENABLED = True

# 缓存过期时间
HTTPCACHE_EXPIRATION_SECS = 0

# 缓存存储路径
HTTPCACHE_DIR = 'httpcache'

# 缓存存储方式
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

These are just some common configuration items in the settings.py file, you can add or modify more configuration items as needed. The following are more possible configuration items:

6. Enable and configure custom extensions
The Scrapy framework allows developers to write custom extensions to enhance the functionality of crawlers. These extensions can be enabled and configured via the EXTENSIONS parameter in the settings.py file. For example, to enable and configure a custom extension MyExtension:

EXTENSIONS = {
    'myextension.MyExtension': 500,
}

7. Configure the number of retries
In the process of crawling, requests may fail. You can control the number of automatic retries and HTTP response status codes by configuring the RETRY_TIMES and RETRY_HTTP_CODES parameters. For example, to set the maximum number of retries to 3 and retry only when 500 and 502 are encountered:

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502]

8. Configure the number of concurrent requests
Crawling efficiency can be improved by sending requests concurrently, and the number of simultaneous requests can be set by configuring the CONCURRENT_REQUESTS parameter. For example, to set up to send 10 requests at the same time:

CONCURRENT_REQUESTS = 10

9. Configure downloader middleware and crawler middleware
The Scrapy framework provides downloader middleware and crawler middleware for custom operations during request and response processing. These middlewares can be enabled and configured by configuring the DOWNLOADER_MIDDLEWARES and SPIDER_MIDDLEWARES parameters. For example, to enable and configure the custom downloader middleware MyDownloaderMiddleware and crawler middleware MySpiderMiddleware:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyDownloaderMiddleware': 543,
}
SPIDER_MIDDLEWARES = {
    'myproject.middlewares.MySpiderMiddleware': 543,
}

10. Configure request header information
You can configure the default request header information by setting the DEFAULT_REQUEST_HEADERS parameter. For example, set Referer and Cookie:

DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://www.example.com',
    'Cookie': 'session_id=xxxxx',
}

11. Configure whether to enable redirection
You can control whether to enable request redirection by configuring the REDIRECT_ENABLED parameter. For example, to disable redirection:

REDIRECT_ENABLED = False

12. Configure deduplication filter
The Scrapy framework has a built-in deduplication filter for filtering URLs that have been crawled. The deduplication filter used can be selected by configuring the DUPEFILTER_CLASS parameter. For example, using a Redis-based deduplication filter:

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

These are just some of the possible configuration items in the settings.py file. According to actual needs, you can customize the settings.py file according to various functions provided by the Scrapy framework to meet your crawler needs.