当前位置：首页 > news >正文

尚硅谷爬虫note008

news 2025/10/19 17:04:53

一、handler处理器

定制更高级的请求头

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 08:55
# @Author : 20250206-里奥
# @File : demo01_urllib_handler处理器的基本使用
# @Project : PythonPro17-21


#  导入
import urllib.request
from cgitb import handler

# 需求：使用handler访问百度，获取网页源码

# url
url ="http://www.baidu.com"

# headers
headers = {
'User-Agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0'
}

#请求对象定制
request = urllib.request.Request(url=url,headers=headers)


# handler   build opener    open
# 1)获取handler对象
handler = urllib.request.HTTPHandler()
# 2)获取opener对象
opener = urllib.request.build_opener(handler)
# 3）调用open方法
response = opener.open(request)


#content
content = response.read().decode('utf-8')
#
print(content)

#

二、代理服务器

常用功能：

突破自身访问限制，访问国外站点

访问内网

提高访问速度

隐藏真实IP

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 09:13
# @Author : 20250206-里奥
# @File : demo02_urllib_代理的基本使用
# @Project : PythonPro17-21

import urllib.request

from demo01_urllib_handler处理器的基本使用 import request, response, handler, opener

# https://www.baidu.com/s?word=ip

url = 'http://www.baidu.com/s?word=ip'

headers = {
'User-Agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0'
}

#请求对象定制
# request = urllib.request.Request(url=url,headers=headers)

#代理IP：以字典形式存在
# 快代理
# 'http':'ip:端口号’
proxies = {
    'http':'117.42.94.226：19198'
}
# handler   buile opener    open
# handler
handler = urllib.request.ProxyHandler(proxies = proxies)
#
opener = urllib.request.build_opener(handler)
#
response = opener.open(request)
#模拟浏览器访问服务器
response = urllib.request.urlopen(request)

#获取响应内容
content = response.read().decode('utf-8')

with open('代理.html','w',encoding='utf-8') as fp:
    fp.write(content)

三、代理池

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 09:48
# @Author : 20250206-里奥
# @File : demo03_urllib_代理池
# @Project : PythonPro17-21
from demo02_urllib_代理的基本使用 import proxies

# 代理池：使用列表实现

# 代理池，随机的特性
proxies_pool = [
    {'http':'117.42.94.226：19198111'},
    {'http':'117.42.94.226：19198222'},
]

# 使用
# 导入random
import random
proxies = random.choice(proxies_pool)

print(proxies)

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 09:48
# @Author : 20250206-里奥
# @File : demo03_urllib_代理池
# @Project : PythonPro17-21
from demo01_urllib_handler处理器的基本使用 import request, response
from demo02_urllib_代理的基本使用 import proxies, handler

import urllib.request
# 代理池：使用列表实现

# 代理池，随机的特性
proxies_pool = [
    {'http':'117.42.94.226：19198111'},
    {'http':'117.42.94.226：19198222'},
]

# 使用
# 导入random
import random
# 随机代理
proxies = random.choice(proxies_pool)

# print(proxies)


url ="http://www.baidu.com"


headers = {
'User-Agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0'
}

request = urllib.request.Request(url = url,headers=headers)

# 要用代理，必须使用handler
handler = urllib.request.ProxyHandler(proxies = proxies)

# opener
opener = urllib.request.build_opener(handler)
#
response = opener.open(request)
#获取响应内容
content = response.read().decode('utf-8')

# 保存文件到本地
with open('代理池.html','w',encoding='utf-8') as fp:
    fp.write(content)

四、解析

1. xpath

1）安装插件

文件： xpath.crx

chrome浏览器——更多工具——扩展程序——将xpath.crx文件拖入

安装成功：chrome浏览器，键入：CTRL + shift + X

显示如下小黑框表示安装成功

报错：程序包无效

解决：将.crx文件后缀修改为.zip

2）使用

a. 安装 lxml库

安装路径：与python解释器路径相同

python解释器路径：文件——设置——项目——python解释器

安装方式：

在cmd中安装 lxml库

eg：python解释器路径如下【lxml库也需安装在该路径下】

安装：安装的库都在script中

cmd——D：——dir——cd SRC——dir ——cd 。。。。。。——直到到达Scipts文件夹下——pip install lxml

查看 lxml库是否安装成功：steps如下

1）新建python文件

2）导入lxml.etree【不报错，则 lxml库安装成功】

3）解析xpath

有2种文件

1. 解析本地文件

使用etree.parse()

tree = etree.parse('html文件路径')

1）新建html文件

2）新建python文件

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 12:46
# @Author : 20250206-里奥
# @File : demo05_解析本地文件
# @Project : PythonPro17-21


 # 导入
from lxml import etree

# 解析
tree = etree.parse('本地文件解析测试.html')

print(tree)

报错：

原因：开始标签和结束标签不是成对出现

解决：在单标签后加入结束标签

报错解决：

2. 解析服务器响应数据

response.read().decode('utf-8')

使用etree.HTNL()

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 15:26
# @Author : 20250206-里奥
# @File : demo06_解析_百度一下
# @Project : PythonPro17-21

import urllib.request


# 1. 获取网页原阿门
# 2. 解析。etree.HTML()
# 3. 打印

url = "https://www.baidu.com/"

headers = {
'User-Agent':
	'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0'
}

request = urllib.request.Request(url = url,headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

# print(content)

# 解析网页源码，获取想要的数据
from lxml import etree

#解析服务器响应的文件
tree = etree.HTML(content)

# 获取想要的数据
# xpath返回值是列表类型的数据
# [0]通过下标访问。只有一个元素
result = tree.xpath('//input[@id = "su"]/@value')[0]

print(result)

2. xpath语法

# _*_ coding : utf-8 _*_
# @Time : 2025/2/17 12:46
# @Author : 20250206-里奥
# @File : demo05_解析本地文件
# @Project : PythonPro17-21


 # 导入
from lxml import etree

# 解析
tree = etree.parse('本地文件解析测试.html')

# print(tree)

# tree.xpath('xpath路径')
# 查找ul下的li
# li_list = tree.xpath('//ul/li')
#
# #判断列表长度
# print(li_list)
# print(len(li_list))

# 查找id属性的li标签
# text()获取标签中的内容
# li_list = tree.xpath('//ul/li[@id]/text()')

# 查找id为1的li标签,字符串1需要“”包裹
# li_list = tree.xpath('//ul/li[@id = "1"]/text()')

# 查询id为1的标签的class属性值
# li_list = tree.xpath('//ul/li[@id = "1"]/@class')


# 查询id中包含l的li标签     li标签中哪些id包含1
# li_list = tree.xpath('//ul/li[contains(@id,"1")]/text()')


# 查询id值是以1开头的li标签
# li_list = tree.xpath('//ul/li[starts-with(@id,"1")]/text()')

# 查询id为111，class为c3的标签
# li_list = tree.xpath('//ul/li[@id = "111" and @class = "c3"]/text()')

# 查询id是111，或者class是c1的li标签
li_list = tree.xpath('//ul/li[@id = "111"]/text() | //ul/li[@class = "c1"]/text()')
print(li_list)
print(len(li_list))

1）路径查询

// ——查找所有子孙节点

/ ——查找所有子节点

# tree.xpath('xpath路径')
# 查找ul下的li
# li_list = tree.xpath('//ul/li')

2）谓词查询

[@id]：查找有id属性的标签

# 查找id属性的li标签
# text()获取标签中的内容
# li_list = tree.xpath('//ul/li[@id]/text()')

3）属性查询

# 查找id为1的li标签,字符串1需要“”包裹
# li_list = tree.xpath('//ul/li[@id = "1"]/text()')

# 查询id为1的标签的class属性值
# li_list = tree.xpath('//ul/li[@id = "1"]/@class')


# 查询id中包含l的li标签     li标签中哪些id包含1
# li_list = tree.xpath('//ul/li[contains(@id,"1")]/text()')

4）模糊查询

# 查询id值是以1开头的li标签
# li_list = tree.xpath('//ul/li[starts-with(@id,"1")]/text()')

5）内容查询

text（）：获取标签中的内容

6）逻辑查询

# 查询id为111，class为c3的标签
# li_list = tree.xpath('//ul/li[@id = "111" and @class = "c3"]/text()')

# 查询id是111，或者class是c1的li标签
li_list = tree.xpath('//ul/li[@id = "111"]/text() | //ul/li[@class = "c1"]/text()')

查看全文

http://www.dtcms.com/a/23408.html

重新求职刷题力扣DAY15

【机器学习第一期】决策树原理及实现步骤：含MATLAB/Python实现代码

大模型常识：什么是大模型/大语言模型/LLM

安卓携手电脑，畅享局域网手机投屏全屏新体验

CentOS7 离线安装 Postgresql 指南

Softing线上研讨会 | 自研还是购买——用于自动化产品的工业以太网

阿波罗STM32F767 FreeRTOS扩展例程

扩增子分析|基于R包ggClusterNet包进行生态网络分析—十种可视化布局包括igraph，Gephi和maptree

STM32的HAL库开发---单通道ADC采集实验

业务架构、数据架构、应用架构和技术架构

DeepSeek与人工智能的结合：探索搜索技术的未来

LeetCode-680. 验证回文串 II

Web开发技术概述

gsoap实现webservice服务

数据结构：算法的时间复杂度和空间复杂度

docker 安装 nacos 与配置持久化详解

【Spring Boot】Spring AOP 快速上手指南：开启面向切面编程新旅程

Unity3D UI菜单与场景切换详解

跨平台AES/DES加密解密算法【超全】

PostgreSQL认证指南

DeepSeek冲击（含本地化部署实践）

NAT模式 vs DR模式：LVS 负载均衡技术的优劣与适用场景

docker批量pull/save/load/tag/push镜像shell脚本

rust学习笔记2-rust的包管理工具Cargo使用

CentOS 7超详细安装教程（含镜像）

LeetCode-76.最小覆盖子串

【Pytorch 库】自定义数据集相关的类

ffmpeg configure 研究2:分析屏幕输出及文件输出的具体过程

STM32 CAN过滤器配置和应用方法介绍