单细胞测序数据分析试验设计赏析(二)
单细胞测序数据分析试验设计赏析(二)
这次的单细胞测序数据分析的试验设计是单细胞测序分析+机器学习(with SHAP分析),也是常见的试验设计之一,重点是可以用于筛选鉴定基因调控网络,也可以是构建(基因)预测模型。发表的论文信息如下:
Wang F, Liang Y, Wang QW. Interpretable machine learning-driven biomarker identification and validation for Alzheimer’s disease. Sci Rep. 2024 Dec 28;14(1):30770. doi: 10.1038/s41598-024-80401-6. PMID: 39730451; PMCID: PMC11680850.
Abstract
Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by limited effective treatments, underscoring the critical need for early detection and diagnosis to improve intervention outcomes. This study integrates various bioinformatics methodologies with interpretable machine learning to identify reliable biomarkers for AD diagnosis and treatment. By leveraging differentially expressed genes (DEGs) analysis, weighted gene co-expression network analysis (WGCNA), and construction of Protein-Protein Interaction (PPI) Networks, we meticulously analyzed the AD dataset from the GEO database to pinpoint Hub genes. Subsequently, various machine learning algorithms were employed to construct diagnostic models, which were then elucidated using SHapley Additive exPlanations (SHAP). To visualize our findings, we generated an insightful bioinformatics map of 10 Hub genes. We then conducted experimental validation on less-studied Hub genes, revealing significant differential mRNA expression of MYH9 and RHOQ in an AD cell model. Finally, we explored the biological significance of these two genes at the single-cell transcriptome level. This study not only introduces interactive SHAP panels for precise decision-making in AD but also offers novel insights into the identification of AD biomarkers through interpretable machine learning diagnostic models. Particularly, MYH9 has emerged as a promising new potential biomarker, pointing the way towards enhanced diagnostic accuracy and personalized therapeutic strategies for AD. Although the mRNA expression patterns of RHOQ are opposite in AD cell models and human brain tissue samples, the role of RHOQ in AD remains worthy of further exploration due to the diversity and complexity of biological molecular regulation.
试验设计描述:
首先从 GEO 数据库获取相关数据集,运用 WGCNA 和 DEGs 分析筛选出差异共表达基因。之后利用机器学习算法(如 LightGBM)提取特征基因,并进行 GO 和 KEGG 分析、PPI 网络分析以确定 Hub 基因。接着,基于这些 Hub 基因,使用多种机器学习算法构建 AD 诊断模型,经 5 折交叉验证评估模型性能,选取最佳模型并用 SHAP 进行解释和可视化。还开展基因功能分析,涵盖 GSEA、免疫浸润分析、免疫检查点分析等,同时通过 RT-qPCR、数据库验证和单细胞转录组分析对基因功能进行实验验证。
与单纯单细胞测序分析相比
- 多维度数据分析:单纯单细胞测序主要聚焦细胞层面基因表达差异,该文档试验设计整合多种分析方法。WGCNA 和 DEGs 分析从整体转录组层面筛选差异共表达基因,挖掘与疾病关联的基因模块;KEGG 和 GO 分析明确基因功能和富集通路;PPI 网络分析确定 Hub 基因,从多个维度深入剖析基因与疾病关系,提供更全面信息。
- 构建诊断模型:能利用机器学习算法构建 AD 诊断模型,并对模型进行评估和优化。通过交叉验证和不同算法比较,找到最佳模型,为 AD 诊断提供有效工具。单纯单细胞测序分析通常不涉及诊断模型构建,在疾病诊断应用方面存在局限。
- 可解释性:利用 SHAP 对诊断模型进行解释和可视化,展示每个 Hub 基因对疾病发生发展的影响,使模型结果更易理解和解释。而单细胞测序数据解释相对复杂,单纯分析难以直接明确基因与疾病关系的内在机制,该设计在可解释性上优势明显。
- 功能验证全面:不仅进行单细胞转录组分析,还结合 RT-qPCR、数据库验证等多种实验手段对基因功能进行验证。从细胞模型、人体组织样本到动物模型,多层面验证确保研究结果可靠性和准确性,单纯单细胞测序分析难以达到如此全面的验证效果。
- 采用了机器学习+SHAP分析的方式从另一个角度来展示基因与临床表征(是否是癌症)之间的关联性,特别是SHAP分析可以对基因的重要性进行排序,从而筛选相对重要的基因。
可以进一步改进的方面
- 仅仅采用了一种临床表征,对癌症相关基因网络的阐述不够全面;
- SHAP没有对单个基因进行进一步分析,也没有对基因间的交互作用进行进一步分析,不够全面;
- 作者对研究的目的还是有一些模糊,在鉴定基因调控网络和构建预测模型两个方面都有着力,但是都不够确实,个人认为可以侧重于第一个方面,增加交互作用SHAP分析,单个基因的SHAP分析。