当前位置：首页 > news >正文

【数据工程】9. Web Scraping 与 Web API

news 2025/9/17 10:30:19

Web Scraping 与 Web API

1. Web API 概述

许多网站或网络服务提供可编程接口（API），允许开发者通过程序获取数据：

公共接口：任何人可访问，如 Google Maps、OpenStreetMap。
官方注册接口：需注册，如澳大利亚交通数据开放平台。
企业合作接口：只对合作伙伴开放，如 Airbnb。
非官方第三方接口：由第三方提供，可能不稳定或不完整。

优点：返回结构化数据（JSON、XML），无需解析网页，可直接使用。

2. 从网站获取数据

获取网站数据的方式主要有两种：

下载数据文件
- 通过 HTTP 请求获取 CSV、JSON、Excel 等文件。
- 优点：简单、快速、数据完整。
网页抓取（Web Scraping）
- 从网页 HTML 中解析嵌入数据，例如表格、文本。
- 优点：可以获取没有直接下载链接的数据，但需解析网页结构。
  Python 常用库：

requests：发送 HTTP 请求
BeautifulSoup：解析 HTML
html5lib：BeautifulSoup 的解析器
pandas：数据处理与分析
Scrapy：爬虫框架
Selenium：模拟浏览器操作

Web 请求概览

客户端（浏览器或程序）发送 HTTP 请求到服务器。
服务器返回响应：
- 静态内容：HTML 文件、图片、PDF 等。
- 动态内容：根据请求生成的内容，如搜索结果或天气数据。
响应通常是 HTML，浏览器解析显示页面。

理解 HTTP 请求与响应是网页抓取和 API 使用的基础。

3. HTTP 请求基础

GET：获取资源，可带参数
POST：发送数据到服务器

示例：

import requests
from bs4 import BeautifulSoupresponse = requests.get("http://www.example.com")
print(response.status_code)content = BeautifulSoup(response.text, 'html5lib')

requests.get(URL)：获取网页内容
requests.get(URL, params=dict(...))：带参数请求
requests.post(URL, data=dict(...))：POST 请求，提交表单数据

4. URL 与网页结构

URL（统一资源定位符）：网页地址，如 https://www.health.nsw.gov.au/news/Pages/20220329_00.aspx
URL 结构：协议://域名/路径?参数
协议常用：http、https、ftp
路径包含资源名称，可附加查询参数（?key=value&…）

多页爬取与爬虫框架

Scrapy：
- Python 爬虫框架，可实现“蜘蛛”程序，自动跟随网页链接抓取数据。
- 可扩展自定义功能，抓取所需网页内容。
- 官方文档：https://docs.scrapy.org/en/latest/intro/overview.html
Selenium：
- 可编程浏览器，模拟用户操作，包括点击、滚动、运行 JavaScript。
- 常用于网站自动化测试，也可用于复杂交互网站的抓取。
- Python 文档：https://selenium-python.readthedocs.io

5. Python 网页抓取示例

5.1 获取网页

webpage_source = requests.get("https://www.sydney.edu.au/units/COMP5339/2025-S2C-NE-CC").text
print(webpage_source[:500])  # 只打印前500字符

HTML 与 DOM

HTML 基础

HTML 是网页标记语言，用于定义网页结构和内容，包括文本、图片、表格、表单等。
通常还会关联 CSS（样式表）和脚本（JS）。
通过标签 <tag>内容</tag> 标记元素。
浏览器解析 HTML 并显示页面。

HTML 示例

<!DOCTYPE html>
<html>
<head><title>文献列表</title>
</head>
<body>
<h1>参考资料</h1>
<p>以下是关于网页抓取的一些链接：</p>
<div id="biblist">
<ul><li>"Data Science From Scratch", Chapter 23</li><li><a href="http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/">Web Scraping for Data Journalists</a></li>
</ul>
</div>
</body>
</html>

DOM（文档对象模型）

HTML 页面可视为树结构（Element Tree）：根节点是 <html>，下有 <head>、<body> 等元素。
标签有层级关系，可通过父子、兄弟关系定位数据。
使用 Python 的 BeautifulSoup 可方便地导航 DOM 并提取数据。

5.2 解析 HTML

content = BeautifulSoup(webpage_source, 'html5lib')
print(content.title.text)

5.3 遍历 DOM

print(content.body.div.div.text)

数据提取示例（BeautifulSoup + Pandas）

import pandas as pd
import requests
from bs4 import BeautifulSoupresponse = requests.get("http://www.example.com")
content = BeautifulSoup(response.text, 'html5lib')# 获取第一个表格
table = content.find_all('table')[0]
df = pd.read_html(str(table))[0]  # 将 HTML 表格转换为 DataFrame
countries = df["COUNTRY"]
print(countries)

5.4 CSS 选择器与查找

HTML 元素可以有多个 class 和 ID，用于样式和定位。
常用选择器：
- 按 class：table.data
- 按 ID：div#results 或 #results
- 按位置：e:first-child、e:last-child、e:nth-child(n)
- 多元素组合：e:nth-of-type(n)

使用 BeautifulSoup 支持 CSS 选择器：

# 解析网页
elements = content.select("#ship.data")
for e in elements:print(e.text)print(content.find('div', 'primaryNavigation').header.div.a['class'])
print(content.find('div', 'primaryNavigation').header.div.a.findNext('a').img['class'])

5.5 查找所有标签

for heading in content.find_all('h2'):print(heading.text.strip())

5.6 抓取网页底部社交媒体链接

socialmedia = content.find('div', '2/12--tablet-down')
for link in socialmedia.find_all('a'):print(link.text.strip(), '-', link['href'])

6. HTML 表格解析

details = content.find('div', 'teaching-staff__wrapper')for row in details.find_all('tr'):print(row.th.text.strip(), '=', row.td.text.strip())

6.1 提取评估表格数据

assessments = content.find('div', id='assessmentDetails')
headers = [header.text for header in assessments.find_all('th')]
data = []

7. Web Crawling 基础

多页爬取需遵守 robots.txt
增加延时，避免服务器阻塞
先练习单页，再扩展到多页

7.1 链接提取示例

page = requests.get("https://www.sydney.edu.au/units/COMP5339").text
content = BeautifulSoup(page, 'html5lib')
links = []
oldlinks = content.find('div', id='archivedOutlines').find_all('a')
for link in oldlinks:if link.has_attr('href'):links.append(link.get('href'))
newlinks = content.find('div', id='currentOutlines').find_all('a')
for link in newlinks:if link.has_attr('href'):links.append(link.get('href'))# 完整链接
links = ['http://sydney.edu.au'+link for link in links]

7.2 链接遍历与函数封装

def findAssessments(URL):page = requests.get(URL).textcontent = BeautifulSoup(page, 'html5lib')assessments = content.find('div', id='assessmentDetails')headers = [header.text for header in assessments.find_all('th')]data = []for row in assessments.find_all('tr'):rowdata = {}if row.th and row.td:rowdata[row.th.text.strip()] = row.td.text.strip()data.append(rowdata)import pandas as pdreturn pd.DataFrame(data, columns=headers)

7.3 遍历多页抓取

import time as t
df = pd.DataFrame(columns=['Unit', 'Session']+headers[:5])
for link in links:URL = 'http://sydney.edu.au'+linkprint(URL)t.sleep(2)df = pd.concat([df, findAssessments(URL)])

8. 数据存储

df.to_csv("assessments.csv", index=False)

9. 扩展应用：OLE 单元抓取

9.1 获取 OLE 列表

OLEpage = requests.get("https://www.sydney.edu.au/handbooks/interdisciplinary_studies/open_learning_environment/open_learning_environment_ad_table.html").text
OLEcontent = BeautifulSoup(OLEpage, 'html5lib')uoslist = []
for row in OLEcontent.find_all('table')[1].find_all('tr'):uos = row.find('td')if uos and uos.a:uoslist.append(uos.a.string)

9.2 遍历 OLE 抓取评估

t0 = t.time()
OLEdf = pd.DataFrame(columns=['Unit', 'Session']+headers[:5])
for i, uoscode in enumerate(uoslist[:3]):  # 先抓前三个测试print(f'({i}/{len(uoslist)}) {uoscode}')page = requests.get("https://www.sydney.edu.au/units/"+uoscode).textcontent = BeautifulSoup(page, 'html5lib')currentOutlines = content.find('div', id='currentOutlines')

10. 小结与注意事项

数据获取方式：CSV / API / 网页抓取
HTML DOM：理解树结构，方便元素定位
CSS / XPath 选择器：快速找到目标内容
Pandas / Matplotlib：分析与可视化数据
多页爬取：添加延时，避免阻塞
合法性：遵守 robots.txt，注意版权和网站条款

Web API 编程

获取数据（JSON / XML）

API 返回常见格式：JSON 或 XML
示例 JSON：

{"uoscode": "COMP5310","title": "Principles of Data Science","lecturers": ["Ben Hachey","Uwe Roehm"],"description": "COMP5310 is about …"
}

示例 XML：

<uoscode>COMP5310</uoscode>
<title>Principles of Data Science</title>
<lecturers><name>Ben Hachey</name><name>Uwe Roehm</name>
</lecturers>
<description>COMP5310 is about …</description>

Web 服务标准

RESTful API：基于 HTTP 的资源请求，使用 GET/POST 等操作，返回 JSON 或 XML。
SOAP / XML Web Services：复杂的 XML 消息标准，适合企业级应用。
本教程主要讨论 RESTful API。

JSON 结构

基于键值对，支持嵌套和数组。
数据类型：字符串、数字、布尔、null、对象、数组。
示例结构：

{"name": "Alice","age": 25,"courses": ["Math", "CS"]
}

Python 使用 Web API 示例

import requests
import pandas as pdurl = 'https://www.data.qld.gov.au/api/3/action/datastore_search?resource_id=...'
response = requests.get(url + "&limit=5&q=Asia")
data = response.json()  # 返回 JSON
records = pd.json_normalize(data['result']['records'])
print(records[["Vessel","Date of Departure","Place of Arrival"]])

Pandas 可以将 JSON 转为 DataFrame，方便分析。
API 可加参数过滤数据，例如地址、日期等。
许多 API 需要认证（API Key）。

综合实战示例

# 1. 下载 NSW Covid 数据
dfs = pd.read

_html(“https://www.health.nsw.gov.au/news/Pages/20220329_00.aspx”)
covid_df = dfs[0]

可视化

covid_df.plot.bar(x=“LHD”, y=“Total cases”)

调用 API 获取船只记录

api_url = "https://www.data.qld.gov.au/api/3/action/datastore_search"resource_id = "dbcfa4a6-3ec7-4264-bcee-43b21a470d34"params = {"limit": 10, "q": "Asia"}api_response = requests.get(api_url, params={**params, "resource_id": resource_id})ship_data = pd.json_normalize(api_response.json()['result']['records'])print(ship_data.head())

Python Web API 示例

import requests
import pandas as pdurl = 'https://www.data.qld.gov.au/api/3/action/datastore_search?resource_id=...'
response = requests.get(url + "&limit=5&q=Asia")
data = response.json()
records = pd.json_normalize(data['result']['records'])
print(records[["Vessel","Date of Departure","Place of Arrival"]])