当前位置：首页 > news >正文

BeautifulSoup4用法及示例

news 2025/9/9 10:51:23

BeautifulSoup4 是一个用于解析 HTML 和 XML 文档的 Python 库，它能够从网页中提取数据，非常适合网络爬虫和数据抓取任务。

基本用法示例

python

复制下载

import requestsfrom bs4 import BeautifulSoup

# 获取网页内容

url = "https://example.com"

response = requests.get(url)

html_content = response.text

# 创建 BeautifulSoup 对象

soup = BeautifulSoup(html_content, 'html.parser')

# 查找元素

title = soup.title # 获取标题

title_text = soup.title.text # 获取标题文本

# 通过标签名查找

first_paragraph = soup.p # 第一个 <p> 标签

all_paragraphs = soup.find_all('p') # 所有 <p> 标签

# 通过类名查找

elements = soup.find_all(class_='class-name')

# 通过ID查找

element = soup.find(id='element-id')

# 提取属性

link = soup.a

url = link.get('href') # 获取href属性

# 提取文本

text = soup.get_text()

完整示例程序

下面是一个使用 BeautifulSoup4 抓取网页标题和链接的示例程序：

python

复制下载

import requestsfrom bs4 import BeautifulSoupimport tkinter as tkfrom tkinter import ttk, messagebox

class WebScraperApp:

def __init__(self, root):

self.root = root

self.root.title("BeautifulSoup4 网页抓取工具")

self.root.geometry("600x400")

# 创建界面组件

self.create_widgets()

def create_widgets(self):

# URL输入框

ttk.Label(self.root, text="请输入URL:").pack(pady=5)

self.url_entry = ttk.Entry(self.root, width=50)

self.url_entry.insert(0, "https://")

self.url_entry.pack(pady=5)

# 抓取按钮

self.scrape_button = ttk.Button(self.root, text="抓取网页", command=self.scrape_website)

self.scrape_button.pack(pady=10)

# 结果显示区域

ttk.Label(self.root, text="抓取结果:").pack(pady=5)

self.result_text = tk.Text(self.root, height=15, width=70)

self.result_text.pack(pady=5, padx=10, fill=tk.BOTH, expand=True)

# 添加滚动条

scrollbar = ttk.Scrollbar(self.root, orient=tk.VERTICAL, command=self.result_text.yview)

scrollbar.pack(side=tk.RIGHT, fill=tk.Y)

self.result_text.configure(yscrollcommand=scrollbar.set)

def scrape_website(self):

url = self.url_entry.get()

if not url.startswith('http'):

messagebox.showerror("错误", "请输入有效的URL地址")

return

try:

# 发送HTTP请求

response = requests.get(url, timeout=10)

response.raise_for_status()

# 解析HTML内容

soup = BeautifulSoup(response.text, 'html.parser')

# 提取信息

title = soup.title.string if soup.title else "无标题"

links = soup.find_all('a')

# 显示结果

self.result_text.delete(1.0, tk.END)

self.result_text.insert(tk.END, f"网页标题: {title}\n\n")

self.result_text.insert(tk.END, "页面链接:\n")

for i, link in enumerate(links, 1):

href = link.get('href')

text = link.get_text(strip=True)

if href:

self.result_text.insert(tk.END, f"{i}. {text} -> {href}\n")

except requests.exceptions.RequestException as e:

messagebox.showerror("错误", f"无法访问URL: {e}")

except Exception as e:

messagebox.showerror("错误", f"发生未知错误: {e}")

if __name__ == "__main__":

root = tk.Tk()

app = WebScraperApp(root)

root.mainloop()

运行说明

确保已安装必要的库：

text

复制下载

pip install beautifulsoup4 requests

运行程序后，在输入框中输入要抓取的网址，点击"抓取网页"按钮。

程序将显示网页标题和所有链接。

功能特点

简单的GUI界面，易于使用

显示网页标题和所有链接

错误处理机制

滚动条支持长内容查看

文章转载自：

http://2XoXd20p.wnpps.cn
http://sSApDsRA.wnpps.cn
http://7dwGsAwQ.wnpps.cn
http://oTiPai87.wnpps.cn
http://xEecmuAv.wnpps.cn
http://dTSc1FY5.wnpps.cn
http://y3gK12Cs.wnpps.cn
http://hCJdCu0f.wnpps.cn
http://HuoFwf51.wnpps.cn
http://yfd8Fu1V.wnpps.cn
http://E1DZtHxB.wnpps.cn
http://ss6P2gRX.wnpps.cn
http://biMxzJxr.wnpps.cn
http://4j19WZQi.wnpps.cn
http://IFeeWCOj.wnpps.cn
http://9vHTdKKC.wnpps.cn
http://SFpBayA1.wnpps.cn
http://ZiaNnzT8.wnpps.cn
http://gkHjyOhL.wnpps.cn
http://uHuF6eTc.wnpps.cn
http://VhlQsTRK.wnpps.cn
http://Ru8qV0lC.wnpps.cn
http://Mtzhb3ta.wnpps.cn
http://xOq1UW4e.wnpps.cn
http://oBKHSVaX.wnpps.cn
http://CJU32sH0.wnpps.cn
http://gmD9udVc.wnpps.cn
http://hnBWzJj5.wnpps.cn
http://nJOK0Blj.wnpps.cn
http://miUQoi1X.wnpps.cn

查看全文

http://www.dtcms.com/a/371653.html

宋红康 JVM 笔记 Day13｜String Table

C/C++---变量对象的创建栈与堆

《AI大模型应知应会100篇》第69篇：大模型辅助的数据分析应用开发

基于「YOLO目标检测 + 多模态AI分析」的PCB缺陷检测分析系统(vue+flask+数据集+模型训练)

SpringAMQP 的发布方确认

2.TCP深度解析：握手、挥手、状态机、流量与拥塞控制

Selenium基本使用指南

Java核心概念精讲：JVM内存模型、Java类加载全过程与 JVM垃圾回收算法等（51-55）

如何在Python中使用正则表达式？

Git Bash 中 Git 命令的实用主义指南

Vue → React/Next.js 思维对照表

【Android】内外部存储的读写

[Android]RecycleView的item用法

构建高可用二级缓存系统

hardhat3 框架源码修改后如何使用

Photoshop - Photoshop 创建文档

论文阅读：SaTML 2023 A Light Recipe to Train Robust Vision Transformers

RocketMQ为什么自研Nameserver而不用zookeeper?

技术解析：基于 ZooKeeper 实现高可用的主-从协调系统（通过例子深入理解Zookeeper如何进行协调分布式系统）

虚拟机安装Rocky Linux系统过程中有时会出现一直灰屏情况

CamX-Camera常用编译命令和adb指南

文件操作详解

独角数卡对接蓝鲸支付平台实现个人

[Android] SAI(APKS安装器)v4.5

MySQL 主从读写分离架构

软件可靠性基本概念

无人机自组网系统的抗干扰技术分析

对比Java学习Go——基础理论篇

centos9安装sentinel

小迪安全v2023学习笔记（七十九讲）—— 中间件安全IISApacheTomcatNginxCVE

基本用法示例

完整示例程序

运行说明

功能特点

相关文章：