当前位置：首页 > wzjs >正文

网站开发用什么技术asp白云百度seo公司

wzjs 2025/7/22 4:09:09

网站开发用什么技术asp,白云百度seo公司,企业网站建设论文模板,wordpress更新了固定连接文章失效Scrapy框架简介 Scrapy是:由Python语言开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据，只需要实现少量的代码，就能够快速的抓取。新建项目 (scrapy startproject xxx)：新建一个新的…

Scrapy框架简介

Scrapy是:由Python语言开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据，只需要实现少量的代码，就能够快速的抓取。

新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标（编写items.py）：明确你想要抓取的目标
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容（pipelines.py）：设计管道存储爬取内容

注意！只有当调度器中不存在任何request了，整个程序才会停止，（也就是说，对于下载失败的URL，Scrapy也会重新下载。）

本篇文章不会讲基本项目创建，创建的话可以移步这个文章，基础的肯定没什么重要性，这篇说明一下一些比较细的知识

Scrapy 官网：https://scrapy.org/
Scrapy 文档：https://docs.scrapy.org/en/latest/
GitHub：https://github.com/scrapy/scrapy/

基本结构

定义爬取的数据结构

首先在items中定义一个需要爬取的数据结构
```
class ScrapySpiderItem(scrapy.Item):# 创建一个类来定义爬取的数据结构name = scrapy.Field()title = scrapy.Field()url = scrapy.Field()
```
那为什么要这样定义：

在Scrapy框架中，scrapy.Field() 是用于定义Item字段的特殊类，它的作用相当于一个标记。具体来说：
1. 数据结构声明
  每个Field实例代表Item中的一个数据字段（如你代码中的name/title/url），用于声明爬虫要收集哪些数据字段。
2. 元数据容器
  虽然看起来像普通赋值，但实际可以通过Field()传递元数据参数：
在这里定义变量之后，后续就可以这样进行使用
```
item = ScrapySpiderItem()
item['name'] = '股票名称'
item['title'] = '股价数据'
item['url'] = 'http://example.com'
```
然后就可以输入scrapy genspider itcast "itcast.cn"命令来创建一个爬虫，爬取itcast.cn域里的代码

数据爬取

注意这里如果你是跟着菜鸟教程来的，一定要改为这样，在itcast.py中

import scrapyclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/channel/teacher.shtml"]def parse(self, response):filename = "teacher.html"open(filename, 'wb').write(response.body)

改为wb，因为返回的是byte数据，如果用w不能正常返回值

那么基本的框架就是这样：

from mySpider.items import ItcastItemdef parse(self, response):#open("teacher.html","wb").write(response.body).close()# 存放老师信息的集合items = []for each in response.xpath("//div[@class='li_txt']"):# 将我们得到的数据封装到一个 `ItcastItem` 对象item = ItcastItem()#extract()方法返回的都是unicode字符串name = each.xpath("h3/text()").extract()title = each.xpath("h4/text()").extract()info = each.xpath("p/text()").extract()#xpath返回的是包含一个元素的列表item['name'] = name[0]item['title'] = title[0]item['info'] = info[0]items.append(item)# 直接返回最后数据return items

爬取信息后，使用xpath提取信息，返回值转化为unicode编码后储存到声明好的变量中，返回

数据保存

主要有四种格式

scrapy crawl itcast -o teachers.json
scrapy crawl itcast -o teachers.jsonl //json lines格式
scrapy crawl itcast -o teachers.csv
scrapy crawl itcast -o teachers.xml

不过上面只是一些项目的搭建和基本使用，我们通过爬虫渐渐进入框架，一定也好奇这个框架的优点在哪里，有什么特别的作用

scrapy结构

pipelines（管道）

这个文件也就是我们说的管道,当Item在Spider中被收集之后，它将会被传递到Item Pipeline(管道)，这些Item Pipeline组件按定义的顺序处理Item。每个Item Pipeline都是实现了简单方法的Python类，比如决定此Item是丢弃而存储。以下是item pipeline的一些典型应用：

验证爬取的数据(检查item包含某些字段，比如说name字段)
查重(并丢弃)
将爬取结果保存到文件或者数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterclass MyspiderPipeline:def process_item(self, item, spider):return item

settings（设置）

代码里给了注释，一些基本的设置

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#项目名称
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守规则协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发量32，默认16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#请求头
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "mySpider.pipelines.MyspiderPipeline": 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

spiders

爬虫代码目录，定义了爬虫的逻辑

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["http://www.itcast.cn/"]def parse(self, response):# 获取网站标题list=response.xpath('//*[@id="mCSB_1_container"]/ul/li[@*]')

实战（大学信息）

目标网站：爬取大学信息

base64：

aHR0cDovL3NoYW5naGFpcmFua2luZy5jbi9yYW5raW5ncy9iY3VyLzIwMjQ=

变量命名

在刚刚创建的itcast里更改一下域名，在items里改一下接收数据格式

itcast.py

import scrapy
from mySpider.items import ItcastItemclass ItcastSpider(scrapy.Spider):name = "itcast"allowed_domains = ["iscast.cn"]start_urls = ["https://www.shanghairanking.cn/rankings/bcur/2024"]def parse(self, response):# 获取网站标题list=response.xpath('(//*[@class="align-left"])[position() > 1 and position() <= 31]')item=ItcastItem()for i in list:name=i.xpath('./div/div[2]/div[1]/div/div/span/text()').extract()description=i.xpath('./div/div[2]/p/text()').extract()location=i.xpath('../td[3]/text()').extract()item['name']=str(name).strip().replace('\\n','').replace(' ','')item['description']=str(description).strip().replace('\\n','').replace(' ','')item['location']=str(location).strip().replace('\\n','').replace(' ','')print(item)yield item

这里xpath感不太了解的可以看我之前的博客

一些爬虫基础知识备忘录-xpath

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ItcastItem(scrapy.Item):name = scrapy.Field()description=scrapy.Field()location=scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
import csv
from itemadapter import ItemAdapterclass MyspiderPipeline:def __init__(self):#在初始化函数中先创建一个csv文件self.f=open('school.csv','w',encoding='utf-8',newline='')self.file_name=['name','description','location']self.writer=csv.DictWriter(self.f,fieldnames=self.file_name)self.writer.writeheader()#写入第一段字段名def process_item(self, item, spider):self.writer.writerow(dict(item))#在写入的时候，要转化为字典对象print(item)return itemdef close_spider(self,spider):self.f.close()#关闭文件

setting.py

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#项目名称
BOT_NAME = "mySpider"SPIDER_MODULES = ["mySpider.spiders"]
NEWSPIDER_MODULE = "mySpider.spiders"# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "mySpider (+http://www.yourdomain.com)"
#是否遵守规则协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #最大并发量32，默认16# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#下载延迟3秒
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#请求头
DEFAULT_REQUEST_HEADERS = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en",
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderSpiderMiddleware": 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "mySpider.middlewares.MyspiderDownloaderMiddleware": 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {"mySpider.pipelines.MyspiderPipeline": 300,
}
LOG_LEVEL = 'WARNING'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"# Set settings whose default value is deprecated to a future-proof value
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

start.py

然后在myspider文件夹下可以创建一个start.py文件，这样我们直接运行这个文件即可，不需要使用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl itcast".split())

然后我们就正常保存为csv格式啦！

一些问题

实现继续爬取，翻页

scrapy使用yield进行数据解析和爬取request，如果想实现翻页或者在请求完单次请求后继续请求，使用yield继续请求如果这里你使用一个return肯定会直接退出，感兴趣的可以去深入了解一下

一些setting配置

LOG_LEVEL = 'WARNING'可以把不太重要的日志关掉，让我们专注于看数据的爬取与分析

然后管道什么的在运行前记得开一下,把原先注释掉的打开就行

另外robot也要记得关一下

然后管道什么的在运行前记得开一下

文件命名

csv格式的文件命名一定要和items中的命名一致，不然数据进不去

到了结束的时候了，本篇文章是对scrapy框架的入门，更加深入的知识请期待后续文章，一起进步！

查看全文

http://www.dtcms.com/wzjs/45176.html

北京建设银行网站软文推广500字

微信wxid二维码生成器专业的seo排名优化

只做财经的网站百度人工客服在线咨询电话

代做网站推广的公司百度手机seo

大学生网站建设课程总结湖南竞价优化哪家好

个人企业网站焦作网站seo

网站ftp有什么用株洲最新今日头条

网站如何从后台进入2023年新冠疫情最新消息

做网站的基本条件网站结构优化的内容和方法

莱州网站建设关键字排名优化网络托管微信代运营seo系统教程

网站用后台更换图片长尾关键词什么意思

wordpress短信验证天津做优化好的公司

网站模板开发平台怎么做社群营销活动策划方案

apache网站拒绝访问郑州网站seo优化公司

外贸php网站源码谷歌优化排名哪家强

github建wordpress正规seo关键词排名网络公司

网站的代理页面怎么做的兰州seo外包公司

网站建设思路设计地推平台去哪里找

江苏公司网站建设网络营销渠道策略研究

自己能否建设网站免费下载官方百度

备案主体负责人网站负责人网络科技公司经营范围

最火爆的国际贸易网站怎么做网页宣传

河池公司做网站长沙seo免费诊断

wordpress数据库更改账号密码seo优化专员工作内容

室内设计师上网第一站seo短视频发布页

文化馆互联网站建设方案互联网推广运营是做什么的

唐山做网站的公司百度手机提高关键词排名

摄影网站源码国外如何做好一个营销方案

常熟做网站推广的优化关键词有哪些方法

线上推广引流是做网站吗it培训机构排行榜