当前位置: 首页 > news >正文

糖尿病预测多个机器学习维度预测

资料来源

目录

  • 1. 简介

    • 问题陈述
    • 数据描述
  • 2. 导入库

  • 3. 基础探索

    • 读取数据集
    • 基本信息
    • 数据可视化
  • 4. 数据预处理

  • 5. 机器学习模型

    • 逻辑回归
    • 随机森林分类器
    • 随机森林超参数调优
    • 决策树分类器
    • 决策树超参数调优
    • K近邻分类器模型
    • KNN超参数调优
    • 支持向量分类器
    • AdaBoost分类器
    • 梯度提升分类器
    • XGBoost分类器
  • 6. 结论

    • 总结
    • 未来可能的工作
  • 7. 作者寄语

1 | 简介

1.1 | 问题陈述

该数据集的目标是构建一个预测模型,用于诊断至少21岁且具有皮马印第安血统的女性患者是否患有糖尿病。该模型应根据多项诊断指标预测患者是否患有糖尿病(结果=1)或未患糖尿病(结果=0),这些指标包括葡萄糖水平、血压、皮肤厚度、胰岛素水平、BMI、糖尿病谱系功能和年龄。

1.2 | 数据描述

编号列名含义
1Pregnancies怀孕次数
2Glucose血糖葡萄糖水平
3BloodPressure血压测量值
4SkinThickness皮肤厚度
5Insulin血液胰岛素水平
6BMI身体质量指数
7DiabetesPedigreeFunction糖尿病遗传概率
8Age年龄
9Outcome最终结果 (1: 患有糖尿病; 0: 未患糖尿病)

2 | 导入库

#导包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
import joblib
import warnings
warnings.filterwarnings('ignore')

3 | 基础探索

3.1 | 读取数据集

df = pd.read_csv('diabetes.csv')

3.2 | 基础信息

3.2.1 | 显示数据内容

styled_df = df.head(5).style# 设置整个DataFrame的背景颜色、文字颜色和边框
styled_df.set_properties(**{"background-color": "#254E58", "color": "#e9c46a", "border": "1.5px solid black"})# 修改表头(th)的颜色和背景颜色
styled_df.set_table_styles([{"selector": "th", "props": [("color", 'white'), ("background-color", "#333333")]}
])

3.2.2 | 行数与列数

rows , col =  df.shape
print(f"行数 : {rows} \n列数 : {col}")

输出:

行数 : 768 
列数 : 9

3.2.3 | 基本信息

df.info()

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):#   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  0   Pregnancies               768 non-null    int64  1   Glucose                   768 non-null    int64  2   BloodPressure             768 non-null    int64  3   SkinThickness             768 non-null    int64  4   Insulin                   768 non-null    float645   BMI                       768 non-null    float646   DiabetesPedigreeFunction  768 non-null    float647   Age                       768 non-null    int64  8   Outcome                   768 non-null    int64  
dtypes: float64(3), int64(6)
memory usage: 54.1 KB

3.2.4 | 统计空值/缺失值

df.isnull().sum()

结果:

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
未发现缺失值

3.2.5 | 数据描述

styled_df = df.describe().style \.set_table_styles([{'selector': 'th', 'props': [('background-color', '#254E58'), ('color', 'white'), ('font-weight', 'bold'), ('text-align', 'left'), ('padding', '8px')]},{'selector': 'td', 'props': [('padding', '8px')]}]) \.set_properties(**{'font-size': '14px', 'background-color': '#F5F5F5', 'border-collapse': 'collapse', 'margin': '10px'})# Display the styled DataFrame
styled_df
import missingno as msnonum_columns = len(df.columns)
colors = plt.cm.viridis(np.linspace(0, 1, num_columns))  msno.bar(df, color=colors)
plt.show()

在这里插入图片描述

3.3 | 数据可视化

3.3.1 | 属性分布

df.hist(figsize = (10,10))
plt.show()

在这里插入图片描述

3.3.2 | 箱线图

num_rows, num_cols = 3, 3# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))# Flatten the axes for easier iteration
axes = axes.flatten()# Loop through numeric columns and create boxplots
for i, column in enumerate(df.columns):sns.boxplot(data=df, x=column, ax=axes[i])axes[i].set_title(f'Boxplot for {column}')# Remove any remaining empty subplots
for j in range(len(df.columns), len(axes)):fig.delaxes(axes[j])# Adjust layout
plt.tight_layout()
plt.show()

在这里插入图片描述

3.3.3 | 属性配对图

sns.pairplot(data = df, hue = 'Outcome' )
plt.show()

在这里插入图片描述

3.3.4 | 年龄与结果

sns.set(rc={"axes.facecolor":"#EAE7F9","figure.facecolor":"#EAE7F9"})
p=sns.catplot(x="Outcome",y="Age", data=df, kind='box')
plt.title("Age and Outcome Correlation", size=20, y=1.0);

在这里插入图片描述

3.3.5 | Glucose and Outcome Correlation

sns.set(rc={"axes.facecolor":"#EAE7F9","figure.facecolor":"#EAE7F9"})
p=sns.catplot(x="Outcome",y="Glucose", data=df, kind='box')
plt.title("Glucose and Outcome Correlation", size=20, y=1.0);

在这里插入图片描述

3.3.6 | 属性间相关性

plt.figure(figsize=(20, 17))
matrix = np.triu(df.corr())
sns.heatmap(df.corr(), annot=True, linewidth=.8, mask=matrix, cmap="rocket");

在这里插入图片描述

plt.figure(figsize=(16,9))
sns.heatmap(df.corr(), annot=True);

在这里插入图片描述

hig_corr = df.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["Outcome"]) >= 0.2]
hig_corr_features

结果:

Index(['Pregnancies', 'Glucose', 'BMI', 'Age', 'Outcome'], dtype='object')

3.3.7 | 标准差

#Standard Deviation
df.var()

结果:

Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
Outcome                         0.227483
dtype: float64

4 | 数据预处理

处理异常值

numeric_columns = ['Insulin', 'DiabetesPedigreeFunction',]for column_name in numeric_columns:Q1 = np.percentile(df[column_name], 25, interpolation='midpoint')Q3 = np.percentile(df[column_name], 75, interpolation='midpoint')IQR = Q3 - Q1low_lim = Q1 - 1.5 * IQRup_lim = Q3 + 1.5 * IQR# Find outliers in the specified columnoutliers = df[(df[column_name] < low_lim) | (df[column_name] > up_lim)][column_name]# Replace outliers with the respective lower or upper limitdf[column_name] = np.where(df[column_name] < low_lim, low_lim, df[column_name])df[column_name] = np.where(df[column_name] > up_lim, up_lim, df[column_name])

获取输入目标

X = df.drop('Outcome', axis = 1)
y = df['Outcome']

划分训练数据

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.20)

5 | 机器学习模型

5.1 | 逻辑回归

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=1, penalty='l2', solver='liblinear', max_iter=200)
log_reg.fit(X_train, y_train)

结果:

LogisticRegression(C=1, max_iter=200, solver='liblinear')
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as snsdef predict_and_plot(model, inputs, targets, name=''):preds = model.predict(inputs)accuracy = accuracy_score(targets, preds)print("Accuracy: {:.2f}%".format(accuracy * 100))cf = confusion_matrix(targets, preds, normalize='true')plt.figure()sns.heatmap(cf, annot=True)plt.xlabel('Prediction')plt.ylabel('Target')plt.title('{} Confusion Matrix'.format(name))return preds# Predict and plot on the training data
train_preds = predict_and_plot(log_reg, X_train, y_train, 'Train')# Predict and plot on the validation data
val_preds = predict_and_plot(log_reg, X_test, y_test, 'Validation')

输出:

Accuracy: 78.18%
Accuracy: 75.32%

在这里插入图片描述

在这里插入图片描述

评估:逻辑回归模型
训练准确率 - 77.54%
验证准确率 - 77.68%

5.2 | 随机森林

from sklearn.ensemble import RandomForestClassifier
model_2 = RandomForestClassifier(n_jobs =-1, random_state = 42)
model_2.fit(X_train,y_train)

结果:

RandomForestClassifier(n_jobs=-1, random_state=42)
model_2.score(X_train,y_train)

结果:

1.0
def predict_and_plot(model, inputs,targets, name = ''):preds = model.predict(inputs)accuracy = accuracy_score(targets, preds)print("Accuracy: {:.2f}%".format(accuracy*100))cf = confusion_matrix(targets, preds, normalize = 'true')plt.figure()sns.heatmap(cf, annot = True)plt.xlabel('Prediction')plt.ylabel('Target')plt.title('{} Confusion Matrix'. format(name))return predstrain_preds = predict_and_plot(model_2, X_train, y_train, 'Train')# Predict and plot on the validation data
val_preds = predict_and_plot(model_2, X_test, y_test, 'Validation')

输出:

Accuracy: 100.00%
Accuracy: 74.03%

在这里插入图片描述
在这里插入图片描述

评估:随机森林模型:调优前
训练准确率 - 96.00%
验证准确率 - 78.08%
该模型似乎存在过拟合,因为训练准确率很高而验证准确率相对较低。

随机森林超参数调优

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_scoreparam_grid = {'n_estimators': [10, 20, 30],  # Adjust the number of trees in the forest'max_depth': [10, 20, 30],  # Adjust the maximum depth of each tree'min_samples_split': [2, 5, 10, 15, 20],  # Adjust the minimum samples required to split a node'min_samples_leaf': [1, 2, 4, 6, 8]  # Adjust the minimum samples required in a leaf node
}model = RandomForestClassifier(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_best_model.fit(X_train, y_train)# Evaluate the model on the training and validation data
train_accuracy = best_model.score(X_train, y_train)
val_accuracy = best_model.score(X_test, y_test)# Print the results
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)

输出:

Training Accuracy: 89.2%
Validation Accuracy: 87.6%
评估:超参数调优后的随机森林模型
训练准确率 - 89.2%
验证准确率 - 87.6%
与初始模型相比,过拟合现象有所减少,且准确率得到了提升。

5.4 | 决策

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_scoredecision_tree_model = DecisionTreeClassifier(random_state=42)decision_tree_model.fit(X_train, y_train)train_accuracy = decision_tree_model.score(X_train, y_train)
val_accuracy = decision_tree_model.score(X_test, y_test)print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)

输出:

Training Accuracy: 1.0
Validation Accuracy: 0.9415584415584416
评估:决策树模型:调优前
训练准确率 - 100%
验证准确率 - 75.0%
决策树模型对训练数据存在过拟合,在训练数据上达到了完美准确率,但在验证数据上准确率较低。

决策树超参数调优

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCVparam_grid = {'max_depth': [None, 5, 10, 15, 20],'min_samples_split': [2, 5, 10, 15, 20, 25],'min_samples_leaf': [1, 3, 5, 7],'criterion': ['gini', 'entropy']  # Add criterion hyperparameter
}decision_tree_model = DecisionTreeClassifier(random_state=42)grid_search = GridSearchCV(decision_tree_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)best_model = grid_search.best_estimator_best_model.fit(X_train, y_train)train_accuracy = best_model.score(X_train, y_train)
val_accuracy = best_model.score(X_test, y_test)print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
评估:决策树模型
训练准确率 - 82.2%
验证准确率 - 85.5%
与初始模型相比,过拟合现象有所减少,且结果得到了改善。

5.6 | K近邻分类器模型

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrixX_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)knn_model = KNeighborsClassifier(n_neighbors=5)knn_model.fit(X_train, y_train)y_train_pred = knn_model.predict(X_train)y_val_pred = knn_model.predict(X_val)train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)confusion = confusion_matrix(y_val, y_val_pred)plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()

输出:

Training Accuracy: 0.8045602605863192
Validation Accuracy: 0.6688311688311688

在这里插入图片描述

评估:K近邻分类器:调优前
训练准确率 - 80.0%
验证准确率 - 66.00%

KNN超参数调优

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_scoreparam_grid = {'n_neighbors': [1, 3, 5, 7, 9]  # Adjust the number of neighbors to explore
}knn_model = KNeighborsClassifier()grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)best_model = grid_search.best_estimator_y_train_pred = best_model.predict(X_train)y_val_pred = best_model.predict(X_val)train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)print("Training Accuracy with Best Hyperparameters:", train_accuracy)
print("Validation Accuracy with Best Hyperparameters:", val_accuracy)

输出:

Training Accuracy with Best Hyperparameters: 0.7947882736156352
Validation Accuracy with Best Hyperparameters: 0.7272727272727273
评估:调优后的KNN模型
训练准确率 - 79.4%
验证准确率 - 72.7%

5.8 | 支持向量分类器

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)svm_model = SVC(kernel='linear')svm_model.fit(X_train, y_train)y_train_pred = svm_model.predict(X_train)y_val_pred = svm_model.predict(X_val)train_accuracy = accuracy_score(y_train, y_train_pred)val_accuracy = accuracy_score(y_val, y_val_pred)print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)train_confusion = confusion_matrix(y_train, y_train_pred)
val_confusion = confusion_matrix(y_val, y_val_pred)plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(train_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Training)')plt.subplot(1, 2, 2)
sns.heatmap(val_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()

输出:

Training Accuracy: 0.7785016286644951
Validation Accuracy: 0.7727272727272727

在这里插入图片描述

评估:支持向量分类器
训练准确率 - 77.8%
验证准确率 - 77.2%

5.9 | AdaBoost分类器

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, confusion_matrixadaboost_model = AdaBoostClassifier(n_estimators=50, random_state=42)adaboost_model.fit(X_train, y_train)y_train_pred_adaboost = adaboost_model.predict(X_train)y_val_pred_adaboost = adaboost_model.predict(X_val)train_accuracy_adaboost = accuracy_score(y_train, y_train_pred_adaboost)val_accuracy_adaboost = accuracy_score(y_val, y_val_pred_adaboost)print("AdaBoost Training Accuracy:", train_accuracy_adaboost)
print("AdaBoost Validation Accuracy:", val_accuracy_adaboost)confusion_adaboost = confusion_matrix(y_val, y_val_pred_adaboost)# Plot the confusion matrix
plt.figure()
sns.heatmap(confusion_adaboost, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('AdaBoost Confusion Matrix (Validation)')
plt.show()

输出:

AdaBoost Training Accuracy: 0.8224755700325733
AdaBoost Validation Accuracy: 0.7402597402597403

在这里插入图片描述

评估:AdaBoost分类器
训练准确率 - 82.4%
验证准确率 - 74.0%

5.7 | Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier# 创建梯度提升分类器
gbm_model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)# 将GBM模型拟合到训练数据
gbm_model.fit(X_train, y_train)# 对训练数据进行预测
y_train_pred_gbm = gbm_model.predict(X_train)# 对验证数据进行预测
y_val_pred_gbm = gbm_model.predict(X_val)# 计算训练准确率
train_accuracy_gbm = accuracy_score(y_train, y_train_pred_gbm)# 计算验证准确率
val_accuracy_gbm = accuracy_score(y_val, y_val_pred_gbm)# 打印训练和验证准确率
print("GBM训练准确率:", train_accuracy_gbm)
print("GBM验证准确率:", val_accuracy_gbm)

输出:

GBM Training Accuracy: 0.9429967426710097
GBM Validation Accuracy: 0.7532467532467533
评估:梯度提升分类器
训练准确率 - 94.2%
验证准确率 - 75.3%

5.11 | XGBoost分类器

from xgboost import XGBClassifier# Create an XGBoost classifier
xgboost_model = XGBClassifier(n_estimators=100, max_depth=3, random_state=42)# Fit the XGBoost model to the training data
xgboost_model.fit(X_train, y_train)# Make predictions on the training data
y_train_pred_xgboost = xgboost_model.predict(X_train)# Make predictions on the validation data
y_val_pred_xgboost = xgboost_model.predict(X_val)# Calculate the training accuracy
train_accuracy_xgboost = accuracy_score(y_train, y_train_pred_xgboost)# Calculate the validation accuracy
val_accuracy_xgboost = accuracy_score(y_val, y_val_pred_xgboost)# Print the training and validation accuracies
print("XGBoost Training Accuracy:", train_accuracy_xgboost)
print("XGBoost Validation Accuracy:", val_accuracy_xgboost)

输出:

[14:40:09] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBoost Training Accuracy: 0.988599348534202
XGBoost Validation Accuracy: 0.7272727272727273
评估:XGBoost分类器
训练准确率 - 98.4%
验证准确率 - 72.7%

6 | 总结

结论

评估:超参数调优后的随机森林模型
训练准确率 - 89.2%
验证准确率 - 87.6%

未来可能的工作

- 使用不同的超参数进行调优以改进结果
- 实现更多模型以获得更好的结果
- 使用相同的方法预测响应
## [资料来源](https://mbd.pub/o/bread/YZWYmZxxZQ==)
http://www.dtcms.com/a/564855.html

相关文章:

  • CSP-J教程——第一阶段——第三课:基本的输入与输出
  • 营销网站怎么做合适全站搜索
  • 解决IntelliJ IDEA控制台输出中文乱码问题
  • 昆仑芯 X HAMi X 百度智能云 | 昆仑芯 P800 XPU/vXPU 双模式算力调度方案落地
  • HarmonyOS6.0开发实战:HTTP 网络请求与 API 交互全指南
  • 合肥网站开发建设wordpress使用难不难
  • 杭州市上城区建设局网站江阴网页设计
  • 【软考】信息系统项目管理师-进度管理论文范文
  • 开关电源的短路保护如何测试?又需要哪些仪器呢?-纳米软件
  • 从 0 到 1 掌握医学图像分割 的完整实战指南
  • HTML应用指南:利用POST请求获取全国爱回收门店位置信息
  • 在线下载免费软件的网站网页设计模板图片html
  • 第九天 - psutil系统监控库 - 资源监控仪表盘 - 练习:实时CPU/Memory监控
  • CentOS/AlmaLinux 9 中 SSH 服务启动失败:OpenSSL 版本不匹配解决
  • MAC-SQL 论文翻译
  • 海宁最火高端网站设计推荐crack wordpress
  • Kanass零基础学习,如何进行任务管理
  • 3 个诊断 Linux 服务器的脚本
  • Spring Boot Bean 生命周期注解深度解析:@PostConstruct 与 @PreDestroy 面试高频考点 + 实战案例
  • 深入浅出 Java 虚拟机之实战部分
  • 营销型网站建设的认识wordpress支持python吗
  • iOS 26 CPU 使用率监控策略 多工具协同构建性能探索体系
  • iOS 文件管理与导出实战,多工具协同打造高效数据访问与调试体系
  • 文件上传(vue3+element-plus+php)
  • Unity与iOS原生交互开发入门篇 - 打开iOS设置
  • Python循环continue与break
  • 网站开发 外包空心哪家网站设计比较好
  • Python scikit-learn详解:从入门到实战,机器学习的“瑞士军刀”
  • [论文阅读] AI+ | 从 “刚性科层” 到 “智能协同”:一文读懂 AI 应对国家安全风险的核心逻辑
  • 西安网站托管商家成都比较好的室内设计公司有哪些