当前位置：首页 > news >正文

数据采集-BeautifulSoup库

news 2025/10/29 6:55:56

一.BeautifulSoup库概述

BeautifulSoup库是一个用于解析HTML和XML文档的Python库，它能够帮你轻松地从网页中提取所需的数据

下面这个表格汇总了其主要的函数和方法

类别	方法/属性	说明
📁 初始化	`BeautifulSoup(markup, parser)`	创建BeautifulSoup对象，`markup`是HTML/XML字符串，`parser`是解析器。
🔍 查找元素	`find(name, attrs, recursive, text, **kwargs)`	返回第一个匹配的标签。
	`find_all(name, attrs, recursive, text, limit, **kwargs)`	返回所有匹配的标签列表，`limit`可限制数量。
	`find_parent()`, `find_parents()`	查找当前元素的父级元素。
	`find_next_sibling()`, `find_previous_sibling()`	查找当前元素之后或之前的同级元素。
	`find_all_next()`, `find_all_previous()`	查找当前元素之后或之前的所有元素。
	`select(css_selector)`	使用CSS选择器查找元素，返回列表。
	`select_one(css_selector)`	使用CSS选择器查找第一个匹配元素。
📥 提取信息	`get_text()`	获取标签及其子孙标签中的所有文本内容。
	`.string`	获取标签内唯一字符串子节点，否则可能返回`None`。
	`.attrs`	以字典形式返回标签的所有属性。
	`.get(key)`	获取标签的指定属性值。
🛠️ 修改文档	`.append(new_tag)`	在当前标签的子元素列表末尾添加新的子元素。
	`.insert(position, new_tag)`	在指定位置插入新的子元素。
	`.extract()`	将标签从文档树中移除并返回。
	`.decompose()`	递归地删除标签及其所有内容。
	`.replace_with(new_content)`	用新内容替换当前标签。
	`.wrap(tag)`	用指定的新标签包裹当前元素。
	`.unwrap()`	移除当前元素的父标签，并将当前元素提升到父标签的位置。
📄 文档输出	`.prettify()`	将解析树格式化为美观的Unicode字符串，便于阅读。

二.创建BeautifulSoup对象

如何创建？

beautifulsoup对象名 = BeautifulSoup(markup,features)

参数解释：

markup：必选参数，表示待解析的内容
features：可选参数，表示指定的解析器（推荐用lxml解析器）
其他参数用到了再说

from bs4 import BeautifulSoup #BeautifulSoup库的包名是bs4,BaautifulSoup是这个包中的一个类html_doc = """<html><head><title>The Dormouse's stroy</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="sotry">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://eaxmple.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""print(html_doc)

运行，输出：

<html><head><title>The Dormouse's stroy</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="sotry">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://eaxmple.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

这看起来非常乱，非结构化的输出，让人不容易看懂

from bs4 import BeautifulSoup #BeautifulSoup库的包名是bs4,BaautifulSoup是这个包中的一个类html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""soup = BeautifulSoup(html_doc,features='lxml') # 创建BeautifulSoup对象  features参数（英文：特征）指定使用哪个解析器来处理HTML或XML文档
print("---------------直接打印soup---------------")
print(soup) # 直接打印输出BeautifulSoup对象会输出整个HTML文档的字符串表示，但格式可能与原始输入不同
print("---------------打印soup即BeautifulSoup对象的类型--------------------")
print(type(soup)) 
print("---------------使用BeautifulSoup对象的prettify()方法美化输出-----------------------")
print(soup.prettify())

输出：

---------------直接打印soup---------------
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;
and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
---------------打印soup即BeautifulSoup对象的类型--------------------
<class 'bs4.BeautifulSoup'>
---------------使用BeautifulSoup对象的prettify()方法美化输出-----------------------
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;
and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a><a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p></body>
</html>

三.BeautifulSoup对象的prettify()方法

prettify的英文意思：美化

作用：将杂乱无章的HTML代码格式转换为结构清晰，缩进正确的格式

四.BeautifulSoup对象的find()和find_all()方法：用于查找信息

BeautifulSoup对象的find()和find_all()方法是解析和提取HTML\XML数据时最常用的两个方法，它们能够帮你快速定位所需的标签或内容

下面这个表格汇总了它们的主要特点和区别：

特性	`find_all()`	`find()`
返回结果	所有匹配元素的列表 📄	第一个匹配的单个元素 🎯
找不到时	返回空列表 `[]`	返回 `None`
主要参数	`name`, `attrs`, `string`, `limit`, `recursive`	`name`, `attrs`, `string`, `recursive`
常用场景	提取多个同类元素，如所有链接、所有列表项	提取唯一元素，如标题、第一个图片

1.按标签名name查找（最直接的查找方式）

参数name表示待查找的节点，它支持字符串、正则表达式、列表三种类型的取值

1.find_all()方法查找所有匹配到的元素（列表形式）

from bs4 import BeautifulSouphtml_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story" id="first-p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')#找到所有的<a>标签
all_links = soup.find_all('a')
print(all_links)

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2.find()方法返回匹配到的第一个元素（列表形式）

#找到第一个<p>标签
first_paragraph = soup.find('p')
print(first_paragraph)

输出：

<p class="title"><b>The Dormouse's story</b></p>

2.按属性attrs查找

参数attrs表示待查找的属性节点，它接收一个字典，字典的键为属性名称，值为该属性对应的值；

它也可以接受一个关键字参数

#查找id为link2的标签
element = soup.find(attrs={'id':'link2'})
print(element)

输出：

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

#查找href为http://example.com/elsie的标签
links = soup.find(attrs={'href':'http://example.com/elsie'})
print(links)

输出：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

elements = soup.find(attrs={'class':'sister',})
print(elements)

输出：（find方法只匹配出现的一个）

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

匹配多个信息：

elements = soup.find(attrs={'class':'sister','id':'link3'})
print(elements)

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

name参数和attrs参数一起使用：

#同时使用name参数和attrs参数来查找
element = soup.find(name='a',attrs={'id':"link2"})
print(element)

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

五.提取信息的三个核心属性\方法

.string .attrs .get() 是BeautifulSoup中用于提取信息的三个核心的属性\方法

🎯 核心概念对比

属性/方法	用途	返回值	适用场景
`.string`	提取标签内的纯文本	字符串或 None	标签内只有文本，没有嵌套标签
`.attrs`	获取标签的所有属性	字典	需要查看标签的所有属性
`.get()`	安全地获取单个属性值	属性值或默认值	安全地获取特定属性，避免异常

1. .sting属性的基本用法

from bs4 import BeautifulSouphtml = """
<div><p>简单文本</p><p>复杂<b>文本</b></p><p><!-- 注释 --></p>
</div>
"""soup = BeautifulSoup(html, 'lxml')# 情况1: 标签内只有文本
simple_p = soup.find('p')
print(simple_p.string)  # 输出: "简单文本"# 情况2: 标签内有嵌套标签
complex_p = soup.find_all('p')[1]
print(complex_p.string)  # 输出: None (因为有嵌套的<b>标签)# 情况3: 标签内只有注释
comment_p = soup.find_all('p')[2]
print(comment_p.string)  # 输出: " 注释 " (注释内容)

2. .string属性的限制

html = """
<p>这段文本包含<b>加粗</b>内容</p>
<p>多段文本<span>和嵌套</span>标签</p>
"""soup = BeautifulSoup(html, 'lxml')p_tags = soup.find_all('p')
for i, p in enumerate(p_tags):print(f"第{i+1}个p标签的.string: {p.string}")
# 输出:
# 第1个p标签的.string: None
# 第2个p标签的.string: None

代替方案：使用.text或者.get_text()

for i, p in enumerate(p_tags):print(f"第{i+1}个p标签的.text: {p.text}")
# 输出:
# 第1个p标签的.text: 这段文本包含加粗内容
# 第2个p标签的.text: 多段文本和嵌套标签

3.attrs属性

html = """
<a href="https://example.com" id="link1" class="external btn-primary" data-info='{"type": "social"}'>示例链接
</a>
<img src="image.jpg" alt="示例图片" width="100" height="80">
"""soup = BeautifulSoup(html, 'lxml')# 获取所有属性
link = soup.find('a')
print(link.attrs)
# 输出: {'href': 'https://example.com', 'id': 'link1', 'class': ['external', 'btn-primary'], 'data-info': '{"type": "social"}'}img = soup.find('img')
print(img.attrs)
# 输出: {'src': 'image.jpg', 'alt': '示例图片', 'width': '100', 'height': '80'}

4.访问特定属性

# 通过 .attrs 字典访问
print(link.attrs['href'])     # "https://example.com"
print(link.attrs['id'])       # "link1"# 处理多值属性（如 class）
print(link.attrs['class'])    # ['external', 'btn-primary']# 检查属性是否存在
if 'data-info' in link.attrs:print("存在data-info属性")

5. .get()方法

html = """
<a href="https://example.com" id="link1" class="btn">示例链接
</a>
<div data-user="john" data-role="admin">用户信息
</div>
"""soup = BeautifulSoup(html, 'lxml')
link = soup.find('a')
div = soup.find('div')# 获取存在的属性
print(link.get('href'))      # "https://example.com"
print(link.get('id'))        # "link1"# 获取不存在的属性
print(link.get('title'))     # None (默认返回值)
print(link.get('title', '默认标题'))  # "默认标题" (设置默认值)# 获取数据属性
print(div.get('data-user'))  # "john"
print(div.get('data-role'))  # "admin"

6. .get()方法与直接访问的对比

# 安全的方式 - 使用 .get()
print(link.get('target'))        # None (不会报错)
print(link.get('target', '_self'))  # "_self" (返回默认值)# 危险的方式 - 直接访问
try:print(link['target'])        # 会抛出 KeyError 异常
except KeyError as e:print(f"错误: {e}")

7.重要注意事项

1. .string的陷阱

html = "<p>文本1<span>文本2</span></p>"
soup = BeautifulSoup(html, 'lxml')
p_tag = soup.find('p')print(p_tag.string)    # None (因为包含嵌套标签)
print(p_tag.text)      # "文本1文本2" (推荐使用)

2.attrs返回的是字典

html = '<div id="main" class="container fluid" data-value="123"></div>'
soup = BeautifulSoup(html, 'lxml')
div_tag = soup.find('div')attrs = div_tag.attrs
print(type(attrs))        # <class 'dict'>
print(attrs['id'])        # "main"
print(attrs['class'])     # ['container', 'fluid'] (注意：class是列表)

3. .get()的安全优势

html = '<a href="/page">链接</a>'
soup = BeautifulSoup(html, 'lxml')
a_tag = soup.find('a')# 安全的方式
print(a_tag.get('target', '_self'))  # 总是有效# 危险的方式
# print(a_tag['target'])  # 如果target不存在会抛出KeyError

六.快捷访问标签的语法（点号访问）

点号访问是BeautifulSoup提供的一种语法糖（语法糖是编程语言中简洁、便捷、可读性更高的语法形式，它不会带来语言本身功能上的增强，只是对已有功能的包装或简化。即“看起来更方便，但本质上没多新功能”）

它可以：

简洁直观：直接通过标签名访问
返回第一个匹配项：只获取文档中第一个指定标签
返回完整标签：包括标签本身、属性和内容
适用于快速测试：在调试或简单脚本中非常方便

对于生产代码，建议使用更明确的 find() 或 select_one() 方法，这样代码意图更清晰。

1.基本标签访问

from bs4 import BeautifulSouphtml = """
<div><h1>主标题</h1><h2>副标题</h2><p>段落内容</p>
</div>
"""soup = BeautifulSoup(html,features='lxml')
print(soup.p)
print(soup.h1)
print(soup.h2)

<p>段落内容</p>
<h1>主标题</h1>
<h2>副标题</h2>

2.嵌套标签访问

from bs4 import BeautifulSouphtml = """
<div class="article"><header><h1>文章标题</h1></header><div class="content"><p>第一段</p><p>第二段</p></div>
</div>
"""soup = BeautifulSoup(html,features='lxml')

print("第一个div标签所包含的内容：\n",soup.div)

输出：

第一个div标签所包含的内容：<div class="article">
<header>
<h1>文章标题</h1>
</header>
<div class="content">
<p>第一段</p>
<p>第二段</p>
</div>
</div>

print(soup.header)

输出：

<header>
<h1>文章标题</h1>
</header>

print(soup.h1)

输出：

<h1>文章标题</h1>

print(soup.p)

输出：

<p>第一段</p>

3.访问有属性的标签

from bs4 import BeautifulSouphtml = """
<a href="/page1" class="link">链接1</a>
<a href="/page2" class="link active">链接2</a>
<img src="image1.jpg" alt="图片1">
<img src="image2.jpg" alt="图片2">
"""soup = BeautifulSoup(html,features='lxml')
print(soup.a)

输出：

<a class="link" href="/page1">链接1</a>

print(soup.img)

输出：

<img alt="图片1" src="image1.jpg"/>

4.重要注意事项

1.只返回第一个匹配项

html = "<p>第一段</p><p>第二段</p><p>第三段</p>"
soup = BeautifulSoup(html, 'lxml')print(soup.p)  # 只输出: <p>第一段</p>

2.找不到标签时返回None

html = "<div>没有p标签</div>"
soup = BeautifulSoup(html, 'lxml')result = soup.p
print(result)        # None
print(type(result))  # <class 'NoneType'>

3.支持链式访问

html = """
<div><ul><li>项目1</li><li>项目2</li></ul>
</div>
"""soup = BeautifulSoup(html, 'lxml')print(soup.ul.li)  # <li>项目1</li> (第一个ul的第一个li)

查看全文

http://www.dtcms.com/a/540123.html

帝国cms的阅读量增加的api接口示例

RDF 实例

面向对象编程：继承从理论到实战

43-基于STM32的医用护理床设计与实现

【经济方向专题会议】第二届经济数据分析与人工智能国际学术会议（EDAI 2025）

Auto CAD二次开发——折线多段线

django做的购物网站海口网站建设优化案例

一个密码破解器

如何查看网站的建设者重庆建设部网站官网

Ansible 的条件语句与循环详解

生产级 Ansible 部署全流程-nginx示例

Ansible Playbook 深度解析：自动化任务编排最佳实践

Ansible生产调优与故障排查全攻略

【笔记】Podman Desktop 部署开源数字人 HeyGem.ai

vue-day02

青岛有哪些做网站的公司公司网页制作需要什么哪些材料

建站公司用wordpress如何查看网站的建设方式

Leetcode 3727. Maximum Alternating Sum of Squares

rtp组播乱序三种策略模式选择（AI）

Docker基础 - 入门基础和Helloworld

集群——GitLabJenkins部署

deepin 终端，但是版本是 deepin 15 的

简单的分布式锁 SpringBoot Redisson‌

如何建立一个视频网站网站域名权

如何修改wordpress站名泰安网站建设哪家不错

【Swift】LeetCode 73. 矩阵置零

益和热力性能优化实践：从 SQL Server 到 TDengine 时序数据库，写入快 20 秒、查询提速 5 倍

蚂蚁集团已在香港申请「ANTCOIN」等 Web3 相关商标

HarmonyOS实战项目：开发一个分布式新闻阅读客户端

“Web3、区块链、稳定币”名词解析