当前位置：首页 > wzjs >正文

杭州靠谱的网站设计保定百度推广联系电话

wzjs 2025/7/30 18:47:00

杭州靠谱的网站设计,保定百度推广联系电话,北京疫情消息最新情况,手机网站建设需要多少钱一、基础教程创建项目命令 scrapy startproject mySpider --项目名称创建爬虫文件 scrapy genspider itcast "itcast.cn" --自动生成 itcast.py 文件爬虫名称爬虫网址运行爬虫 scrapy crawl baidu(爬虫名） 使用终端运行太麻烦了，而且…

一、基础教程

创建项目命令

scrapy startproject mySpider --项目名称

创建爬虫文件

scrapy genspider itcast "itcast.cn" --自动生成 itcast.py 文件

爬虫名称爬虫网址

运行爬虫

scrapy crawl baidu(爬虫名）

使用终端运行太麻烦了，而且不能提取数据，

我们一个写一个run文件作为程序的入口,splite是必须写的，

目的是把字符串转为列表形式，第一个参数是scrapy,第二个crawl,第三个baidu
from scrapy import cmdlinecmdline.execute('scrapy crawl baidu'.split())

创建后目录大致页如下

|-ProjectName #项目文件夹

|-ProjectName #项目目录

|-items.py #定义数据结构

|-middlewares.py #中间件

|-pipelines.py #数据处理

|-settings.py #全局配置

|-spiders

|-__init__.py #爬虫文件

|-itcast.py #爬虫文件

|-scrapy.cfg #项目基本配置文件

全局项目配置文件 settings.py

BOT_NAME：项目名
USER_AGENT：默认是注释的，这个东西非常重要，如果不写很容易被判断为电脑，简单点洗一个Mozilla/5.0即可
ROBOTSTXT_OBEY：是否遵循机器人协议，默认是true，需要改为false，否则很多东西爬不了

CONCURRENT_REQUESTS：最大并发数，很好理解，就是同时允许开启多少个爬虫线程
DOWNLOAD_DELAY：下载延迟时间，单位是秒，控制爬虫爬取的频率，根据你的项目调整，不要太快也不要太慢，默认是3秒，即爬一个停3秒，设置为1秒性价比较高，如果要爬取的文件较多，写零点几秒也行
COOKIES_ENABLED：是否保存COOKIES，默认关闭，开机可以记录爬取过程中的COKIE，非常好用的一个参数
DEFAULT_REQUEST_HEADERS：默认请求头，上面写了一个USER_AGENT，其实这个东西就是放在请求头里面的，这个东西可以根据你爬取的内容做相应设置。

ITEM_PIPELINES：项目管道，300为优先级，越低越爬取的优先度越高

比如我的pipelines.py里面写了两个管道，一个爬取网页的管道，一个存数据库的管道，我调整了他们的优先级，如果有爬虫数据，优先执行存库操作。
ITEM_PIPELINES = {'scrapyP1.pipelines.BaiduPipeline': 300,'scrapyP1.pipelines.BaiduMysqlPipeline': 200,
}

二、案例：豆瓣电影

1. item.py 数据信息类

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass GoogleTrendsCrawlerItem(scrapy.Item):passclass doubanitem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field() #电影名称genre = scrapy.Field() #电影评分# pass

douban.py 爬取信息文件

import scrapy
from ..items import doubanitemclass DoubanSpider(scrapy.Spider):name = 'douban'allowed_domains = ['douban.com']start_urls = ['https://movie.douban.com/top250?start={}&filter=']def start_requests(self):for i in range(0, 121, 25):url = self.url.format(i)yield scrapy.Request(url=url,callback=self.parse)def parse(self, response):items = doubanitem()movies = response.xpath('/html/body/div[3]/div[1]/div/div[1]/ol/li')for movie in movies:items["title"] = movie.xpath('./div/div[2]/div[1]/a/span[1]/text()').extract_first()items["genre"] = movie.xpath('./div/div[2]/div[2]/div/span[2]/text()').extract_first()# 调用yield把控制权给管道，管道拿到处理后return返回，又回到该程序。这是对第一个yield的解释yield items

pipelines.py 处理提取的数据，如存数据库

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
# useful for handling different item types with a single interface
from itemadapter import ItemAdapterfrom google_trends_crawler.items import doubanitemclass GoogleTrendsCrawlerPipeline:def __init__(self):# 初始化数据库连接self.conn = pymysql.connect(host='localhost',  # MySQL服务器地址user='root',  # 数据库用户名password='root',  # 数据库密码database='test',  # 数据库名charset='utf8mb4',cursorclass=pymysql.cursors.DictCursor)self.cursor = self.conn.cursor()# 创建表（如果不存在）self.create_table()def create_table(self):create_table_sql = """CREATE TABLE IF NOT EXISTS douban_movies (id INT AUTO_INCREMENT PRIMARY KEY,title VARCHAR(255) NOT NULL,genre VARCHAR(100),created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)"""self.cursor.execute(create_table_sql)self.conn.commit()def process_item(self, item, spider):if isinstance(item, doubanitem):  # 检查是否是doubanitem# 插入数据到MySQLsql = """INSERT INTO douban_movies (title, genre)VALUES (%s, %s)"""self.cursor.execute(sql, (item['title'], item['genre']))self.conn.commit()spider.logger.info(f"插入数据: {item['title']}")return itemdef close_spider(self, spider):# 爬虫关闭时关闭数据库连接print('爬取完成')self.cursor.close()self.conn.close()

结果展示