当前位置：首页 > news >正文

Python 爬虫常用库：requests 与 BeautifulSoup 详解

news 2025/10/20 11:29:26

在 Python 爬虫开发中，requests 和 BeautifulSoup 是两个非常常用的库。requests 用于发送 HTTP 请求，获取网页内容；BeautifulSoup 用于解析 HTML 内容，提取所需数据。今天，就让我们一起深入学习这两个库的使用方法，帮助你更好地进行爬虫开发。

一、`requests` 库

（一）安装 `requests`

pip install requests

（二）发送 HTTP 请求

1. 发送 GET 请求

import requestsresponse = requests.get('https://example.com')
print(response.status_code)  # 输出状态码
print(response.text)         # 输出网页内容

2. 发送 POST 请求

import requestsdata = {'key': 'value'}
response = requests.post('https://example.com', data=data)
print(response.text)

（三）处理请求参数

1. 添加查询参数

import requestsparams = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://example.com', params=params)
print(response.url)  # 输出完整的 URL

2. 添加请求头

import requestsheaders = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://example.com', headers=headers)
print(response.text)

（四）处理响应内容

1. 获取响应文本

import requestsresponse = requests.get('https://example.com')
print(response.text)  # 输出网页内容

2. 获取响应 JSON

import requestsresponse = requests.get('https://api.example.com/data')
data = response.json()  # 将响应内容解析为 JSON
print(data)

（五）异常处理

import requests
from requests.exceptions import HTTPError, ConnectionErrortry:response = requests.get('https://example.com')response.raise_for_status()  # 如果状态码不是 200，抛出 HTTPError
except HTTPError as http_err:print(f'HTTP error occurred: {http_err}')
except ConnectionError as conn_err:print(f'Connection error occurred: {conn_err}')
except Exception as err:print(f'Other error occurred: {err}')
else:print('Success!')

二、`BeautifulSoup` 库

（一）安装 `BeautifulSoup`

pip install beautifulsoup4

（二）解析 HTML 内容

1. 创建 BeautifulSoup 对象

from bs4 import BeautifulSouphtml_doc = """
<html>
<head><title>The Dormouse's story</title>
</head>
<body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>
</body>
</html>
"""soup = BeautifulSoup(html_doc, 'html.parser')

2. 提取标题

print(soup.title.string)  # 输出 The Dormouse's story

3. 提取所有链接

for link in soup.find_all('a'):print(link.get('href'))

（三）搜索文档树

1. 按标签名搜索

print(soup.find_all('p'))  # 输出所有 <p> 标签

2. 按类名搜索

print(soup.find_all(class_='sister'))  # 输出所有 class="sister" 的标签

3. 按 ID 搜索

print(soup.find(id='link1'))  # 输出 id="link1" 的标签

（四）导航文档树

1. 获取子标签

for child in soup.body.children:print(child)

2. 获取父标签

print(soup.a.parent.name)  # 输出 <a> 标签的父标签名称

（五）修改文档树

1. 添加新标签

new_tag = soup.new_tag('li')
new_tag.string = 'New item'
soup.ul.append(new_tag)
print(soup.ul)

2. 删除标签

soup.a.decompose()
print(soup.p)

三、结合 `requests` 和 `BeautifulSoup`

（一）抓取网页数据

import requests
from bs4 import BeautifulSoupurl = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')# 提取标题
print(soup.title.string)# 提取所有链接
for link in soup.find_all('a'):print(link.get('href'))

四、总结

通过本文的介绍，你已经全面掌握了 requests 和 BeautifulSoup 的使用方法。以下是关键点总结：

requests：用于发送 HTTP 请求，获取网页内容。
- 发送 GET 和 POST 请求。
- 处理请求参数，如查询参数和请求头。
- 处理响应内容，如获取响应文本和 JSON。
- 异常处理，处理 HTTP 错误和连接错误。
BeautifulSoup：用于解析 HTML 内容，提取所需数据。
- 创建 BeautifulSoup 对象，解析 HTML 文档。
- 搜索文档树，按标签名、类名、ID 搜索。
- 导航文档树，获取子标签和父标签。
- 修改文档树，添加和删除标签。