当前位置：首页 > news >正文

Python学习第二十八天

news 2025/11/5 18:41:56

日志

日志级别

DEBUG - 调试信息
INFO - 一般信息
WARNING - 警告信息
ERROR - 错误信息
CRITICAL - 严重错误

使用

settings中引入

# 设置日志级别
LOG_LEVEL = 'INFO'

# 日志文件路径 log/ 需要提前建立 如果没有会报错No such file or directory:
LOG_FILE = '../log/scrapy-test.log'

# 日志格式
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

# 日志日期格式
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# 是否覆盖现有日志文件
LOG_FILE_APPEND = False

def parse(self, response):
    self.logger.debug('This is a debug message')
    self.logger.info('This is an info message')
    self.logger.warning('This is a warning')
    self.logger.error('This is an error')

日志格式

参数	描述
`%(asctime)s`	日志记录时间
`%(name)s`	记录器名称 (通常是spider名称)
`%(levelname)s`	日志级别 (DEBUG, INFO等)
`%(message)s`	日志消息文本
`%(pathname)s`	产生日志的源文件路径
`%(filename)s`	文件名部分
`%(module)s`	模块名部分
`%(funcName)s`	函数名
`%(lineno)d`	源代码行号
`%(process)d`	进程ID
`%(thread)d`	线程ID
`%(threadName)s`	线程名称

异常

异常分类

CloseSpider - 主动关闭爬虫
DropItem - 丢弃 item
IgnoreRequest - 忽略请求
NotConfigured - 组件未配置

常见异常

异常类别	异常类名	触发场景	典型处理方式	使用示例
爬虫控制	`CloseSpider`	需要主动终止爬虫运行时	记录日志后停止爬虫	`raise CloseSpider('达到最大页数')`
	`NotConfigured`	组件缺少必要配置时	跳过该组件加载	`raise NotConfigured('缺少API密钥')`
数据处理	`DropItem`	Item数据不符合要求时	丢弃该Item并记录	`raise DropItem('缺失必要字段')`
	`ItemError`	Item处理过程中的通用错误	根据具体子类处理	`raise ItemError('数据格式错误')`
请求控制	`IgnoreRequest`	需要过滤特定请求时	跳过该请求	`raise IgnoreRequest('黑名单域名')`
	`RetryRequest`	需要重试请求时	延迟后重新调度	`raise RetryRequest('服务不可用')`
下载错误	`TimeoutError`	请求超时	重试或记录	`failure.check(TimeoutError)`
	`ConnectionError`	连接失败	检查网络或重试	`except ConnectionError:`
	`DNSLookupError`	DNS解析失败	检查域名或重试	`failure.check(DNSLookupError)`
响应处理	`HttpError`	非200状态码响应	检查状态码处理	`raise HttpError(response)`
	`ResponseNeverReceived`	未收到任何响应	检查网络或重试	`failure.check(ResponseNeverReceived)`

使用

import scrapy
import os

from scrapy.exceptions import NotConfigured, CloseSpider


# 异常测试
class TestExceptSpider(scrapy.Spider):

    def __init__(self):
        # 初始化为0
        self.item_count = 0

    name = "test_except"
    # 或者直接卸载头部的strt_url中 一样的 为什么知道这个方法  查看父类的spider 集成了 所以使用子类会自动覆盖父类相同方法
    def start_requests(self):
        # 获取当前目录的绝对路径
        current_dir = os.path.dirname(os.path.abspath(__file__))
        file_path = os.path.join(current_dir, 'test.html')
        # 替换反斜杠为正斜杠，并添加 file:/// 前缀
        file_url = 'file:///' + file_path.replace('\\', '/')
        # 使用http.request和request一样 使用request更多一些
        yield scrapy.http.Request(url=file_url, callback=self.parse)

    def parse(self, response):
        # 条件满足时停止爬虫
        if self.item_count >= 1000:
            raise CloseSpider('已达到1000条数据限制')

    # 组件配置检查
    class MyExtension:
        def __init__(self, api_key):
            if not api_key:
                raise NotConfigured('API key必须配置')

查看全文

http://www.dtcms.com/a/106098.html