当前位置：首页 > news >正文

数据科学与计算：爬虫和数据分析案例笔记

news 2025/8/13 11:19:26

案例 1：中国大学排名爬取与分析

一、任务描述

目标：爬取高三网中国大学排名一览表，提取学校名称、总分、全国排名、星级排名、办学层级等数据，并保存为 CSV 文件。

网址：2021中国的大学排名一览表_高三网

二、任务分析

数据来源：网页中的表格数据，包含 “名次”“学校名称”“总分”“全国排名”“星级排名”“办学层次” 等字段。

网页结构：数据嵌套在table > tbody > tr标签中，需通过解析 HTML 提取表格行数据。

三、代码实现

1. 核心库导入

python

运行

import requests  # 发送HTTP请求
from bs4 import BeautifulSoup  # 解析HTML
import csv  # 处理CSV文件

2. 功能函数

获取网页内容（get_html）

python

运行

def get_html(url, time=3):try:r = requests.get(url, timeout=time)  # 发送GET请求r.encoding = r.apparent_encoding  # 自动识别编码r.raise_for_status()  # 状态码非200时抛出异常return r.text  # 返回网页文本except Exception as error:print(error)

解析网页数据（parser）

python

运行

def parser(html):soup = BeautifulSoup(html, "lxml")  # 解析HTMLout_list = []for row in soup.select("table>tbody>tr"):  # 遍历表格行td_html = row.select("td")  # 获取单元格row_data = [td_html[1].text.strip(),  # 学校名称td_html[2].text.strip(),  # 总分td_html[3].text.strip(),  # 全国排名td_html[4].text.strip(),  # 星级排名td_html[5].text.strip()   # 办学层次]out_list.append(row_data)return out_list

保存为 CSV 文件（save_csv）

python

运行

def save_csv(item, path):with open(path, "wt", newline="", encoding="utf-8") as f:csv_write = csv.writer(f)csv_write.writerows(item)  # 写入多行数据

3. 主程序

python

运行

if __name__ == "__main__":url = "http://www.bspider.top/gaosan/"html = get_html(url)  # 获取网页out_list = parser(html)  # 解析数据save_csv(out_list, "school.csv")  # 保存文件

四、数据预处理（处理缺失值）

针对 “总分” 列的空数据，使用 pandas 处理：

删除含空字段的行

python

运行

import pandas as pd
df = pd.read_csv("school.csv")
new_df = df.dropna()  # 删除缺失值所在行
print(new_df.to_string())

用指定内容替换空字段

python

运行

df.fillna("暂无分数信息", inplace=True)  # 替换缺失值为指定文本

用均值替换空字段

python

运行

x = df["总分"].mean()  # 计算均值
df["总分"].fillna(x, inplace=True)  # 填充缺失值

用中位数替换空字段

python

运行

x = df["总分"].median()  # 计算中位数
df["总分"].fillna(x, inplace=True)  # 填充缺失值

五、数据分析与可视化

1. 数据概况

共 820 所学校，按星级分布：8 星（8 所）、7 星（16 所）、6 星（36 所）、5 星（59 所）、4 星（103 所）、3 星（190 所）、2 星（148 所）、1 星（260 所）。

2. 可视化图表

柱形图（横向 / 纵向）

python

运行

import matplotlib.pyplot as plt
import numpy as npx = np.array(["8星","7星","6星","5星","4星","3星","2星","1星"])
y = np.array([8, 16, 36, 59, 103, 190, 148, 260])plt.title("不同星级的学校个数")
plt.rcParams["font.sans-serif"] = ["SimHei"]  # 显示中文
plt.bar(x, y)  # 纵向柱形图
# plt.barh(x, y)  # 横向柱形图
plt.show()

饼图（占比分布）

python

运行

y = np.array([1, 2, 4.5, 7.2, 12.5, 23.1, 18, 31.7])  # 各星级占比（%）
plt.pie(y, labels=["8星","7星","6星","5星","4星","3星","2星","1星"])
plt.title("不同星级的学校个数占比")
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.show()