当前位置: 首页 > news >正文

【统计方法】LASSO筛变量

比较原始做LASSO包是library(glmnet)

若目标是纯 LASSO 分析,alpha 必须设为 ​​1

​​标准化数据​​:LASSO 对特征的尺度敏感,需对数据进行标准化(均值为0,方差为1)。

cv.glmnet​获得的lambda.min 或者 lambda.1se 传递给
glmnet::glmnet(lambda = ???)

# 加载数据(以 mtcars 为例)
data(mtcars)
x <- as.matrix(mtcars[, -1])  # 特征矩阵(mpg 是响应变量)
y <- mtcars$mpg

# 交叉验证选择最优 lambda(自动 LASSO)
cv_fit <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_fit$lambda.min

# 用最优 lambda 训练最终模型
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)

# 查看筛选的变量
selected_vars <- rownames(coef(final_model))[coef(final_model)[, 1] != 0]
print(selected_vars)

手动标准化特征矩阵
x_scaled <- scale(x)

分类变量区别-测试


library(glmnet)

data(iris)
str(iris$Species)
df=iris
design_matrix <- model.matrix(~ Species, data = df)
x<-as.matrix(data.frame(Sepal.Width=df$Sepal.Width, 
                        Petal.Length=df$Petal.Length,
                        Petal.Width=df$Petal.Width,design_matrix))

fit1 <- cv.glmnet(x = x,y = df$Sepal.Length)
fit1
plot(fit1)

iris$Species_num <- as.numeric(iris$Species)
x2 <- as.matrix(iris[, c(2, 3, 4, 5)])
fit2 <- cv.glmnet(x = x, y = iris$Sepal.Length)
fit2
plot(fit2)

食管癌的

# -----01-Lasso----
set.seed(123)
train_index <- caret::createDataPartition(1:nrow(df), p = 0.7, list = T)[["Resample1"]]
test_index= setdiff(1:nrow(df), train_index)


library(glmnet)
df <- read.csv("tab.csv")
library(glmnet)
# 先进行参数查找
cv.glmnet()

# 
names(df)
df[,4:15]<-lapply(df[,4:15],as.factor)

paste(names(df[,4:15]),collapse = "+")
design_matrix <- model.matrix(
  ~ Smoking_status+Alcohol_consumption+Tea_consumption+Sex+Ethnic.group+Residence+Education+Marital.status+History_of_diabetes+Family_history_of_cancer+Occupation+Physical_Activity, data = df)
df[,16:48] <- scale(df[,16:48])
summary(df$AAvsEPA);sd(df$AAvsEPA)
x <- as.matrix(data.frame(df[,16:48],design_matrix))

 

fit1 <- cv.glmnet(x = x[train_index,],y = df[train_index,]$Group,
                  alpha=1, nfolds = 5,
                  type.measure = "mse",family="binomial")
plot(fit1)
fit1
mean(fit1$cvm)
best_lambda <- fit1$lambda.1se
coeficients <- coef(fit1, s = best_lambda)
selected_vars <- rownames(coeficients)[coeficients[, 1] != 0]
print("Selected variables in test prediction:")
print(selected_vars)

lasso_pred <- predict(fit1, s = best_lambda, newx = x[test_index,])

mse <- mean((lasso_pred - df[test_index,]$Group)^2)
cat("Test MSE:", mse, "")



fit<- glmnet(x, df$Group, family =  "cox", maxit = 1000)plot(fit)

final_model <- glmnet(x[train_index,], 
                      df[train_index,]$Group,  
                      # 重新运行 glmnet(使用相同的 lambda 值)
                     lambda = fit1$lambda,
                      alpha = 1
                      )
plot(final_model,label = T)
plot(final_model, xvar = "lambda", label = TRUE)
plot(final_model, xvar = "dev", label = TRUE)

Feature selection
We found 44 potential features, including demographics and clinical and laboratory variables (Table 1). We performed feature selection using the least absolute shrinkage and selection operator (LASSO), which is among the most widely used feature selection techniques. LASSO constructs a penalty function that compresses some of the regression coefcients, i.e., it forces the sum of the absolute values of the coefcients to be less than some fxed value while setting some regression coefcients at zero, thus obtaining a more refned model. LASSO retains the advantage of subset shrinkage as a biased estimator that deals with data with complex covariance. This algorithm uses LassoCV, a fvefold cross-validation approach, to automatically eliminate factors with zero coefcients (Python version: sklearn 0.22.1)

2.2.2. Feature Selection.
Feature selection was performed by using least absolute shrinkage and selection operator
(LASSO) regression. The LASSO regression model improves the prediction performance by adjusting the hyperparameter λ to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction. To determine the best λ value, λ was selected by minimum mean error using 10-fold cross-validation.

Detailed steps were as follows: (1) Screening characteristic factors: First, R software (glmnet4.1.2) was used to conduct the least absolute shrinkage and selection operator (LASSO) regression analysis and adjusting the variable screening and complexity. Then, LASSO regression analysis results were used to conduct multifactor
logistic regression analysis with SPSS, and finally, we obtained the characteristic factors of p < 0.05. (2) Data division: Pyskthon (0.22.1) random number method was used to randomly divide the gout patients into training set and test set according to the ratio of 7:3, of which 491 were in the training set and 211 were in the testing set. (3) Classified multi-model comprehensive analysis: eXtreme Gradient Boosting (XGBoost)

http://www.dtcms.com/a/111822.html

相关文章:

  • 循环引用问题和专门用来解决的weak_ptr
  • 第二十八章:Python可视化图表扩展-和弦图、旭日图、六边形箱图、桑基图和主题流图
  • 算法设计学习8
  • 从零构建大语言模型全栈开发指南:第五部分:行业应用与前沿探索-5.2.3前沿方向:MoE架构、世界模型与具身智能
  • html5炫酷3D文字效果项目开发实践
  • Flink 1.20 Kafka Connector:新旧 API 深度解析与迁移指南
  • 泰博云平台solr接口存在SSRF漏洞
  • Docker安装、配置Mysql5.7
  • hackmyvm-Principle
  • Java 大视界 -- 基于 Java 的大数据机器学习模型在图像识别中的迁移学习与模型优化(173)
  • 软路由安装指南
  • MySQL体系架构
  • leetcode数组-移除元素
  • 基于RDK X3的“校史通“机器人:SLAM导航+智能交互,让校史馆活起来!
  • SpringKafka消息消费:@KafkaListener与消费组配置
  • 大模型如何优化数字人的实时交互与情感表达
  • 【小沐杂货铺】基于Three.JS绘制三维数字地球Earth(GIS 、WebGL、vue、react)
  • Oracle SQL 执行计划分析与优化指南
  • autoconf 笔记250404
  • 原始字符串字面量(Raw String Literal)
  • Qt 中 findChild和findChildren绑定自定义控件
  • leetcode-代码随想录-链表-移除链表元素
  • Docker安装、配置Nacos
  • 网络安全基础知识总结
  • RabbitMQ高级特性2
  • MINIQMT学习课程Day6
  • React项目在ts文件中使用router实现跳转
  • 搜索与图论 树的广度优先遍历 图中点的层次
  • MusicMint ,AI音乐生成工具
  • bun 版本管理工具 bum 安装与使用