当前位置：首页 > news >正文

波士顿房价线性回归预测讲解

news 2025/8/23 15:19:24

1. 导入必要的库

python

运行

import pandas as pd                  # 数据处理
import numpy as np                   # 数值计算
import matplotlib.pyplot as plt      # 数据可视化
import matplotlib as mpl             #  matplotlib配置
from sklearn.datasets import fetch_openml  # 获取公开数据集
from sklearn.model_selection import train_test_split  # 划分训练集和测试集
from sklearn.linear_model import LinearRegression      # 线性回归模型
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score  # 评估指标
from sklearn.preprocessing import StandardScaler        # 特征标准化
from sklearn.pipeline import Pipeline                  # 构建模型管道

2. 配置中文显示

python

运行

plt.rcParams["font.family"] = ["Microsoft YaHei"]  # 设置中文字体
plt.rcParams['axes.unicode_minus'] = True           # 正确显示负号

3. 数据加载与基本信息查看

python

运行

# 加载波士顿房价数据集
boston = fetch_openml(name='boston', version=1, as_frame=True)
X = boston.data  # 特征数据
y = boston.target  # 目标变量：房价（单位：千美元）# 输出数据集基本信息
print(f"数据集形状: {X.shape}")
print(f"特征名称: {X.columns.tolist()}")
print(f"目标变量名称: 房价(千美元)")# 查看数据集统计信息
print("\n数据集基本统计信息:")
print(X.describe())

4. 数据集划分

python

运行

# 将数据集分为训练集(80%)和测试集(20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42  # random_state确保结果可重现
)print(f"\n训练集大小: {X_train.shape[0]} 样本")
print(f"测试集大小: {X_test.shape[0]} 样本")

5. 模型构建与训练

python

运行

# 使用Pipeline构建模型流程，包含特征标准化和线性回归
model = Pipeline([('scaler', StandardScaler()),  # 特征标准化处理('regressor', LinearRegression())  # 线性回归模型
])# 训练模型
print("\n开始训练线性回归模型...")
model.fit(X_train, y_train)
print("模型训练完成!")

6. 模型预测

python

运行

# 在训练集和测试集上进行预测
y_pred_train = model.predict(X_train)  
y_pred_test = model.predict(X_test)

7. 模型评估指标计算

计算了多种回归模型评估指标：

误差平方和 (SSE)
总平方和 (SST)
均方误差 (MSE)
均方根误差 (RMSE)
平均绝对误差 (MAE)
R 方值 (R²)

这些指标从不同角度衡量模型的预测效果，其中 R² 值越接近 1 表示模型拟合效果越好。

8. 模型系数分析

python

运行

# 获取模型系数和截距
coefficients = model.named_steps['regressor'].coef_
intercept = model.named_steps['regressor'].intercept_# 创建系数数据框并按影响程度排序
coef_df = pd.DataFrame({'特征名称': X.columns,'系数值': coefficients,'影响程度': np.abs(coefficients)
})
coef_df = coef_df.sort_values(by='影响程度', ascending=False)

系数分析可以帮助我们理解各个特征对房价的影响：