Scrapy框架之【settings.py文件】详解
settings.py
文件的主要作用是对 Scrapy
项目的全局设置进行集中管理。借助修改这个文件中的配置项,你可以对爬虫的行为、性能、数据处理等方面进行灵活调整,而无需修改爬虫代码。
①默认英文注释settings.py
# Scrapy settings for douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = "douban"SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "douban (+http://www.yourdomain.com)"# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
②中文注释settings.py
# Scrapy 豆瓣项目的设置文件
# 为简洁起见,本文件仅包含重要或常用的设置。
# 更多设置可参考官方文档:
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 项目名称
BOT_NAME = "douban"# 蜘蛛模块路径
SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"# 爬虫身份标识(用户代理)
#USER_AGENT = "douban (+http://www.yourdomain.com)"# 是否遵守robots.txt规则
ROBOTSTXT_OBEY = True# 配置Scrapy的最大并发请求数(默认:16)
#CONCURRENT_REQUESTS = 32# 配置对同一网站的请求延迟(默认:0)
# 参见 https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# 另见自动限速设置和文档
#DOWNLOAD_DELAY = 3# 下载延迟设置将只采用以下一项:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# 是否启用Cookies(默认启用)
#COOKIES_ENABLED = False# 是否启用Telnet控制台(默认启用)
#TELNETCONSOLE_ENABLED = False# 覆盖默认请求头:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}# 启用或禁用蜘蛛中间件
# 参见 https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# 启用或禁用下载器中间件
# 参见 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}# 启用或禁用扩展
# 参见 https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}# 配置项目管道
# 参见 https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}# 启用并配置AutoThrottle扩展(默认禁用)
# 参见 https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True# 初始下载延迟
#AUTOTHROTTLE_START_DELAY = 5# 高延迟情况下的最大下载延迟
#AUTOTHROTTLE_MAX_DELAY = 60# Scrapy应向每个远程服务器并行发送的平均请求数
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# 是否显示每个接收到的响应的限速统计信息:
#AUTOTHROTTLE_DEBUG = False# 启用并配置HTTP缓存(默认禁用)
# 参见 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# 将已弃用默认值的设置设置为面向未来的值
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
1.基本项目设置
# 项目名称
BOT_NAME = "douban"# 蜘蛛模块路径
SPIDER_MODULES = ["douban.spiders"]
NEWSPIDER_MODULE = "douban.spiders"
BOT_NAME
:该项目爬虫的名称,此名称会在日志和统计信息里体现。SPIDER_MODULES
:这是一个列表,其中包含了 Scrapy 要查找爬虫的 Python 模块。这里表明 Scrapy 会在douban.spiders
模块中搜寻爬虫类。NEWSPIDER_MODULE
:当你运用scrapy genspider
命令来创建新的爬虫时,新爬虫文件会被生成到douban.spiders
模块里。
2.用户代理与 robots.txt 规则
# 爬虫身份标识(用户代理)
#USER_AGENT = "douban (+http://www.yourdomain.com)"# 是否遵守robots.txt规则
ROBOTSTXT_OBEY = True
USER_AGENT
:注释掉的这一行可用于设定爬虫的用户代理。用户代理能让服务器知晓请求是由何种客户端发出的。你可将其设定为特定值,从而让服务器识别你的爬虫及其所属网站。ROBOTSTXT_OBEY
:若设为True
,Scrapy 就会遵循目标网站的robots.txt
文件规则。这意味着爬虫会依据robots.txt
里的规则来决定是否可以访问某些页面。
3.并发请求与下载延迟设置
# 配置Scrapy的最大并发请求数(默认:16)
#CONCURRENT_REQUESTS = 32# 配置对同一网站的请求延迟(默认:0)
# 参见 https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# 另见自动限速设置和文档
#DOWNLOAD_DELAY = 3# 下载延迟设置将只采用以下一项:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
CONCURRENT_REQUESTS
:可设置 Scrapy 同时执行的最大请求数,默认值是 16。
DOWNLOAD_DELAY
:设置对同一网站的请求之间的延迟时间(单位为秒)。这有助于避免对目标网站造成过大压力。CONCURRENT_REQUESTS_PER_DOMAIN
和CONCURRENT_REQUESTS_PER_IP
:这两个设置项只能启用一个,用于限制对同一域名或同一 IP 地址的并发请求数。
4.Cookies 与 Telnet 控制台设置
# 是否启用Cookies(默认启用)
#COOKIES_ENABLED = False# 是否启用Telnet控制台(默认启用)
#TELNETCONSOLE_ENABLED = False
COOKIES_ENABLED
:若设为False
,Scrapy 就不会处理 cookies。默认情况下,Scrapy 会启用 cookies 处理。TELNETCONSOLE_ENABLED
:若设为False
,则会禁用 Telnet 控制台。默认情况下,Scrapy 会启用 Telnet 控制台,借助该控制台你能在爬虫运行时与其交互。COOKIES_DEBUG
:若设为True
,启用 Cookies 调试模式。Scrapy 会在日志中输出详细的 Cookies 相关信息,包括请求中发送的 Cookies 以及响应中收到的 Cookies。默认是禁用Cookies 调试模式
5.默认请求头设置
# 覆盖默认请求头:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
DEFAULT_REQUEST_HEADERS
:可用于覆盖默认的请求头。这里给出了一个示例,设置了Accept
和Accept-Language
请求头。
6.中间件设置
# 启用或禁用蜘蛛中间件
# 参见 https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "douban.middlewares.doubanSpiderMiddleware": 543,
#}# 启用或禁用下载器中间件
# 参见 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "douban.middlewares.doubanDownloaderMiddleware": 543,
#}
SPIDER_MIDDLEWARES
:用于启用或禁用爬虫中间件。中间件是在爬虫处理请求和响应时执行的组件。数字543
代表中间件的优先级,数值越小优先级越高。DOWNLOADER_MIDDLEWARES
:用于启用或禁用下载器中间件。下载器中间件会在下载请求和响应时发挥作用。
7.扩展设置
# 启用或禁用扩展
# 参见 https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
EXTENSIONS
:用于启用或禁用 Scrapy 的扩展。扩展是能增强 Scrapy 功能的组件。这里的示例是禁用 Telnet 控制台扩展。
8.项目管道设置
# 配置项目管道
# 参见 https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "douban.pipelines.doubanPipeline": 300,
#}
ITEM_PIPELINES
:用于配置项目管道。项目管道是在爬虫提取到数据后对数据进行处理的组件,例如数据清洗、存储等。数字300
代表管道的优先级,数值越小优先级越高。
9.自动节流扩展设置
# 启用并配置AutoThrottle扩展(默认禁用)
# 参见 https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True# 初始下载延迟
#AUTOTHROTTLE_START_DELAY = 5# 高延迟情况下的最大下载延迟
#AUTOTHROTTLE_MAX_DELAY = 60# Scrapy应向每个远程服务器并行发送的平均请求数
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# 是否显示每个接收到的响应的限速统计信息:
#AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_ENABLED
:若设为True
,则启用自动节流扩展。该扩展会依据目标网站的响应时间自动调整请求的延迟。AUTOTHROTTLE_START_DELAY
:初始的下载延迟时间(单位为秒)。AUTOTHROTTLE_MAX_DELAY
:在高延迟情况下允许的最大下载延迟时间(单位为秒)。AUTOTHROTTLE_TARGET_CONCURRENCY
:Scrapy 向每个远程服务器并行发送请求的平均数量。AUTOTHROTTLE_DEBUG
:若设为True
,则会在每次收到响应时显示节流统计信息。
10.HTTP 缓存设置
# 启用并配置HTTP缓存(默认禁用)
# 参见 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
HTTPCACHE_ENABLED
:若设为True
,则启用 HTTP 缓存。HTTPCACHE_EXPIRATION_SECS
:缓存的过期时间(单位为秒)。设为 0 表示缓存永不过期。HTTPCACHE_DIR
:缓存文件的存储目录。HTTPCACHE_IGNORE_HTTP_CODES
:一个列表,包含需要忽略缓存的 HTTP 状态码。HTTPCACHE_STORAGE
:指定缓存的存储方式,这里使用的是文件系统缓存存储。
11.其他设置
# 将已弃用默认值的设置设置为面向未来的值
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
REQUEST_FINGERPRINTER_IMPLEMENTATION
:设置请求指纹的实现版本,"2.7"
是为了保证未来兼容性。TWISTED_REACTOR
:指定 Twisted 事件循环的实现,这里使用AsyncioSelectorReactor
以支持异步 I/O。FEED_EXPORT_ENCODING
:设置数据导出时的编码格式为UTF-8
。
③其他常用settings.py配置
1.日志相关配置
# 日志级别,可选值有 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'
LOG_LEVEL = 'DEBUG'
# 日志文件路径,若设置则日志会输出到该文件,而非控制台
LOG_FILE = 'scrapy.log'
LOG_LEVEL
:借助设定不同的日志级别,你能够控制 Scrapy 输出的日志详细程度。比如,DEBUG
会输出最详尽的日志信息,而CRITICAL
仅输出关键错误信息。LOG_FILE
:把日志保存到指定文件,便于后续查看和分析。
2.下载超时与重试配置
# 下载超时时间(秒)
DOWNLOAD_TIMEOUT = 180
# 重试次数
RETRY_TIMES = 3
# 需要重试的 HTTP 状态码
RETRY_HTTP_CODES = [500, 502, 503, 504, 408]
DOWNLOAD_TIMEOUT
:若请求在规定时间内未得到响应,就会被判定为超时。RETRY_TIMES
:请求失败时的重试次数。RETRY_HTTP_CODES
:遇到这些 HTTP 状态码时,Scrapy 会对请求进行重试。
3.代理配置
# 设置代理服务器
HTTP_PROXY = 'http://proxy.example.com:8080'
HTTPS_PROXY = 'http://proxy.example.com:8080'
4.数据存储配置
# 导出数据的格式,如 'json', 'csv', 'xml' 等
FEED_FORMAT = 'json'
# 导出数据的文件路径
FEED_URI = 'output.json'
5.调度器相关配置
# 调度器队列类型,'priority' 为优先级队列
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
SCHEDULER_DISK_QUEUE
和SCHEDULER_MEMORY_QUEUE
:分别设置调度器的磁盘队列和内存队列类型。
6.爬虫并发配置
# 每个域名的并发请求数
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# 每个 IP 的并发请求数
CONCURRENT_REQUESTS_PER_IP = 8
7.下载器中间件配置
# 随机更换 User-Agent 的中间件
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
-
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
这是 Scrapy 框架内置的下载器中间件,它的主要功能是处理请求时的User-Agent
头。不过,它的功能相对基础,一般只能设置单一的User-Agent
或者从一个简单的列表里随机选择User-Agent
。在上述配置里,将其值设为None
意味着禁用这个默认的中间件。 -
scrapy_fake_useragent.middleware.RandomUserAgentMiddleware
此中间件来自于第三方库scrapy-fake-useragent
。这个库的作用是在爬取过程中随机选取不同的User-Agent
来模拟不同的客户端,从而降低被目标网站识别为爬虫的可能性。pip install scrapy-fake-useragent
安装
8.深度限制配置
# 最大爬取深度
DEPTH_LIMIT = 3
# 深度优先级,值越大表示深度优先
DEPTH_PRIORITY = 1
DEPTH_LIMIT
:限制爬虫的最大爬取深度,避免爬取过深导致数据量过大。DEPTH_PRIORITY
:设置深度优先的优先级,值越大越倾向于深度优先爬取。