当前位置：首页 > wzjs >正文

兰州app定制开发郑州seo使用教程

wzjs 2025/7/27 11:22:30

兰州app定制开发,郑州seo使用教程,重庆市建筑工程网,高端网站建设网页设计Scrapy框架简介 Scrapy是:由Python语言开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据，只需要实现少量的代码，就能够快速的抓取。新建项目 (scrapy startproject xxx)：新建一个新的…

Scrapy框架简介

Scrapy是:由Python语言开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据，只需要实现少量的代码，就能够快速的抓取。

新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标（编写items.py）：明确你想要抓取的目标
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容（pipelines.py）：设计管道存储爬取内容

注意！只有当调度器中不存在任何request了，整个程序才会停止，（也就是说，对于下载失败的URL，Scrapy也会重新下载。）

本篇文章不会讲基本项目创建，创建的话可以移步这个文章，基础的肯定没什么重要性，这篇说明一下一些比较细的知识

Scrapy 官网：https://scrapy.org/
Scrapy 文档：https://docs.scrapy.org/en/latest/
GitHub：https://github.com/scrapy/scrapy/

基本结构

定义爬取的数据结构

首先在items中定义一个需要爬取的数据结构
```
class ScrapySpiderItem(scrapy.Item):# 创建一个类来定义爬取的数据结构name = scrapy.Field()title = scrapy.Field()url = scrapy.Field()
```
那为什么要这样定义：

在Scrapy框架中，scrapy.Field() 是用于定义Item字段的特殊类，它的作用相当于一个标记。具体来说：
1. 数据结构声明
  每个Field实例代表Item中的一个数据字段（如你代码中的name/title/url），用于声明爬虫要收集哪些数据字段。
2. 元数据容器
  虽然看起来像普通赋值，但实际可以通过Field()传递元数据参数：
在这里定义变量之后，后续就可以这样进行使用
```
item = ScrapySpiderItem()
item['name'] = '股票名称'
item['title'] = '股价数据'
item['url'] = 'http://example.com'
```
然后就可以输入scrapy genspider itcast "itcast.cn"命令来创建一个爬虫，爬取itcast.cn域里的代码

数据爬取

注意这里如果你是跟着菜鸟教程来的，一定要改为这样，在itcast.py中

import scrapyclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]def parse(self, response):filename = "teacher.html"open(filename, 'wb').write(response.body)

改为wb，因为返回的是byte数据，如果用w不能正常返回值

那么基本的框架就是这样：

from mySpider.items import ItcastItemdef parse(self, response):#open("teacher.html","wb").write(response.body).close()# 存放老师信息的集合items = []for each in response.xpath("//div[@class='li_txt']"):# 将我们得到的数据封装到一个 `ItcastItem` 对象item = ItcastItem()#extract()方法返回的都是unicode字符串name = each.xpath("h3/text()").extract()title = each.xpath("h4/text()").extract()info = each.xpath("p/text()").extract()#xpath返回的是包含一个元素的列表item['name'] = name[0]item['title'] = title[0]item['info'] = info[0]items.append(item)# 直接返回最后数据return items

爬取信息后，使用xpath提取信息，返回值转化为unicode编码后储存到声明好的变量中，返回

数据保存

主要有四种格式

scrapy crawl itcast -o teachers.json
scrapy crawl itcast -o teachers.jsonl //json lines格式
scrapy crawl itcast -o teachers.csv
scrapy crawl itcast -o teachers.xml

不过上面只是一些项目的搭建和基本使用，我们通过爬虫渐渐进入框架，一定也好奇这个框架的优点在哪里，有什么特别的作用

scrapy结构

pipelines（管道）

这个文件也就是我们说的管道,当Item在Spider中被收集之后，它将会被传递到Item Pipeline(管道)，这些Item Pipeline组件按定义的顺序处理Item。每个Item Pipeline都是实现了简单方法的Python类，比如决定此Item是丢弃而存储。以下是item pipeline的一些典型应用：

验证爬取的数据(检查item包含某些字段，比如说name字段)
查重(并丢弃)
将爬取结果保存到文件或者数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MyspiderPipeline:def process_item(self, item, spider):return item

settings（设置）

代码里给了注释，一些基本的设置

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#项目名称
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守规则协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发量32，默认16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#请求头
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "mySpider.pipelines.MyspiderPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

spiders

爬虫代码目录，定义了爬虫的逻辑

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/"]def parse(self, response):# 获取网站标题list=response.xpath('//*[@id="mCSB_1_container"]/ul/li[@*]')

实战（大学信息）

目标网站：爬取大学信息

base64：

aHR0cDovL3NoYW5naGFpcmFua2luZy5jbi9yYW5raW5ncy9iY3VyLzIwMjQ=

变量命名

在刚刚创建的itcast里更改一下域名，在items里改一下接收数据格式

itcast.py

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["https://www.shanghairanking.cn/rankings/bcur/2024"]def parse(self, response):# 获取网站标题list=response.xpath('(//*[@class="align-left"])[position() > 1 and position() <= 31]')item=ItcastItem()for i in list:name=i.xpath('./div/div[2]/div[1]/div/div/span/text()').extract()description=i.xpath('./div/div[2]/p/text()').extract()location=i.xpath('../td[3]/text()').extract()item['name']=str(name).strip().replace('\\n','').replace(' ','')item['description']=str(description).strip().replace('\\n','').replace(' ','')item['location']=str(location).strip().replace('\\n','').replace(' ','')print(item)yield item

这里xpath感不太了解的可以看我之前的博客

一些爬虫基础知识备忘录-xpath

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ItcastItem(scrapy.Item):name = scrapy.Field()description=scrapy.Field()location=scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
import csv
from itemadapter import ItemAdapterclass MyspiderPipeline:def __init__(self):#在初始化函数中先创建一个csv文件self.f=open('school.csv','w',encoding='utf-8',newline='')self.file_name=['name','description','location']self.writer=csv.DictWriter(self.f,fieldnames=self.file_name)self.writer.writeheader()#写入第一段字段名def process_item(self, item, spider):self.writer.writerow(dict(item))#在写入的时候，要转化为字典对象print(item)return itemdef close_spider(self,spider):self.f.close()#关闭文件

setting.py

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#项目名称
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守规则协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发量32，默认16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#请求头
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"mySpider.pipelines.MyspiderPipeline": 300,
}
LOG_LEVEL = 'WARNING'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

start.py

然后在myspider文件夹下可以创建一个start.py文件，这样我们直接运行这个文件即可，不需要使用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl itcast".split())

然后我们就正常保存为csv格式啦！

一些问题

实现继续爬取，翻页

scrapy使用yield进行数据解析和爬取request，如果想实现翻页或者在请求完单次请求后继续请求，使用yield继续请求如果这里你使用一个return肯定会直接退出，感兴趣的可以去深入了解一下

一些setting配置

LOG_LEVEL = 'WARNING'可以把不太重要的日志关掉，让我们专注于看数据的爬取与分析

然后管道什么的在运行前记得开一下,把原先注释掉的打开就行

另外robot也要记得关一下

然后管道什么的在运行前记得开一下

文件命名

csv格式的文件命名一定要和items中的命名一致，不然数据进不去

到了结束的时候了，本篇文章是对scrapy框架的入门，更加深入的知识请期待后续文章，一起进步！

查看全文

http://www.dtcms.com/wzjs/112711.html

沂南县建设局网站自己怎样开网站

地方门户网站盈利北京百度seo

单网页网站如何做太原竞价托管公司推荐

中国做网站找谁个人如何做百度推广

旅游社做的最好的网站磁力搜索引擎下载

百度云建站WordPress培训方案模板

西宁最好网站建设公司哪家好长春网站建设

建设银行官方网站首页图片搜索图片识别

哪个网站做图片外链深圳营销型网站建设

电脑网站打不开是什么原因造成的南昌seo优化公司

北京做的比较好的网站公司吗英语培训机构

一般做公司网站需要哪几点最新的疫情防控政策和管理措施

网站域名如何使用方法镇江百度公司

做现货黄金网站爱站网关键词查询

美妆网站开发背景semester什么意思

网站审核照片幕布惠州网站营销推广

商业网站成功的原因如何营销推广

龙岗营销网站建设公司宁德市旅游景点大全

北京著名网站建设论述搜索引擎优化的具体措施

建设银行官网的网站首页友情链接地址

网站备案在哪里查询搜索大全引擎入口网站

美工做网站怎么收费必应搜索网站

做网站用什么后台苏州百度推广代理商

做视频资源网站有哪些内容营销活动策划

做网站需要多少钱西安网站自然排名工具

昆明网站制作公司免费html网站制作成品

杭州专门做网站网站设计与实现毕业设计

延安做网站电话广东云浮疫情最新情况

做合法的海外购网站需要什么手续网络营销大师排行榜

用摄像头直播网站怎么做seo优化关键词排名优化