SSCLMD模型代码实现详解
SSCLMD模型代码实现详解
1. 项目源码结构
SSCLMD项目的源码结构如下:
SSCLMD-main/
├── README.md
├── ST4.xlsx
├── Supplementary File.docx
├── code/
│ ├── calculating_similarity.py
│ ├── data_preparation.py
│ ├── data_preprocess.py
│ ├── layer.py
│ ├── main.py
│ ├── parms_setting.py
│ ├── train.py
│ └── utils.py
└── data/├── dataset1.rar└── dataset2.rar
2. 模型核心组件详解
2.1 模型定义(layer.py)
模型在layer.py
文件中定义,主要包含以下几个关键类:
- Attention类:
class Attention(nn.Module):def __init__(self, in_size, hidden_size=128): # LDA:128 MDA,LMI:16super(Attention, self).__init__()self.project = nn.Sequential(nn.Linear(in_size, hidden_size),nn.Tanh(),nn.Linear(hidden_size, 1, bias=False))def forward(self, z):w = self.project(z)beta = torch.softmax(w, dim=1)return (beta * z).sum(1), beta
这是一个注意力机制的实现,通过计算不同视图的权重,实现对不同视图特征的加权聚合。
- GCN类:
class GCN(nn.Module):def __init__(self, nfeat, nhid, out, dropout = 0.5):super(GCN, self).__init__()self.gc1 = GCNConv(nfeat, nhid)self.prelu1 = nn.PReLU(nhid)self.gc2 = GCNConv(nhid, out)self.prelu2 = nn.PReLU(out)self.dropout = dropoutdef forward(self, x, adj):x = self.prelu1(self.gc1(x, adj))x = F.dropout(x, self.dropout, training=self.training)x = self.prelu2(self.gc2(x, adj))return x
这是图卷积网络的实现,用于从图结构中提取节点特征。
- Discriminator类:
class Discriminator(nn.Module):def __init__(self, dim):super(Discriminator, self).__init__()self.fn = nn.Bilinear(dim, dim, 1)def forward(self, h1, h2, h3, h4, c1, c2):c_x1 = c1.expand_as(h1).contiguous()c_x2 = c2.expand_as(h2).contiguous()# positivesc_1 = self.fn(h1, c_x1).squeeze(1)sc_2 = self.fn(h2, c_x2).squeeze(1)# negativesc_3 = self.fn(h3, c_x1).squeeze(1)sc_4 = self.fn(h4, c_x2).squeeze(1)logits = th.cat((sc_1, sc_2, sc_3, sc_4))return logits
这是自监督对比学习的判别器,用于区分正样本和负样本。
- SSCLMD类:
class SSCLMD(nn.Module):def __init__(self, in_dim, hid_dim, out_dim, decoder1):super(SSCLMD, self).__init__()self.encoder1 = GCN(in_dim, hid_dim, out_dim)self.encoder2 = GCN(in_dim, hid_dim, out_dim)self.encoder3 = GCN(in_dim, hid_dim, out_dim)self.encoder4 = GCN(in_dim, hid_dim, out_dim)self.pooling = AvgReadout()self.attention = Attention(out_dim)self.disc = Discriminator(out_dim)self.act_fn = nn.Sigmoid()self.local_mlp = nn.Linear(out_dim, out_dim)self.global_mlp = nn.Linear(out_dim, out_dim)self.decoder1 = nn.Linear(out_dim * 4, decoder1)self.decoder2 = nn.Linear(decoder1, 1)
这是SSCLMD模型的主要类,整合了编码器、注意力机制、判别器和解码器。
2.2 网络前向传播过程
SSCLMD模型的前向传播过程如下:
def forward(self, data_s, data_f, idx):# 获取特征和图结构feat, s_graph = data_s.x, data_s.edge_indexshuff_feat, f_graph = data_f.x, data_f.edge_index# 结构图和特征图编码h1 = self.encoder1(feat, s_graph)h2 = self.encoder2(feat, f_graph)h1 = self.local_mlp(h1)h2 = self.local_mlp(h2)# 负样本编码h3 = self.encoder1(shuff_feat, s_graph)h4 = self.encoder2(shuff_feat, f_graph)h3 = self.local_mlp(h3)h4 = self.local_mlp(h4)# 额外的编码用于关系预测h5 = self.encoder3(feat, s_graph)h6 = self.encoder3(feat, f_graph)# 全局表示c1 = self.act_fn(self.global_mlp(self.pooling(h1)))c2 = self.act_fn(self.global_mlp(self.pooling(h2)))# 自监督对比学习out = self.disc(h1, h2, h3, h4, c1, c2)# 多视图融合h_com = (h5 + h6)/2emb = torch.stack([h1, h2, h_com], dim=1)emb, att = self.attention(emb)# 根据任务类型选择实体if args.task_type == 'LDA':entity1 = emb[idx[0]]entity2 = emb[idx[1] + 386]if args.task_type == 'MDA':entity1 = emb[idx[0] + 702]entity2 = emb[idx[1] + 386]if args.task_type == 'LMI':entity1 = emb[idx[0]]entity2 = emb[idx[1] + 702]# 多关系建模解码器add = entity1 + entity2product = entity1 * entity2concatenate = torch.cat((entity1, entity2), dim=1)feature = torch.cat((add, product, concatenate), dim=1)log1 = F.relu(self.decoder1(feature))log = self.decoder2(log1)return out, log
3. 数据预处理过程详解
数据预处理主要在data_preprocess.py
文件中实现,关键步骤包括:
- 数据加载与正负样本构建:
positive = np.loadtxt(args.in_file, dtype=np.int64)
link_size = int(positive.shape[0])
np.random.seed(args.seed)
np.random.shuffle(positive)
positive = positive[:link_size]negative_all = np.loadtxt(args.neg_sample, dtype=np.int64)
np.random.shuffle(negative_all)
negative = np.asarray(negative_all[:positive.shape[0]])positive = np.concatenate([positive, np.ones(positive.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)
negative = np.concatenate([negative, np.zeros(negative.shape[0], dtype=np.int64).reshape(-1, 1)], axis=1)all_data = np.vstack((positive, negative))
- 构建K折交叉验证数据集:
kf = KFold(n_splits=n_splits, shuffle=True, random_state=args.seed)cv_train_loaders = []
cv_test_loaders = []for train_index, test_index in kf.split(all_data):train_data = all_data[train_index]test_data = all_data[test_index]train_positive = train_data[train_data[:, 2] == 1][:, :2]# 构建邻接矩阵...# 构建数据加载器training_set = Data_class(train_data)train_loader = DataLoader(training_set, **params)test_set = Data_class(test_data)test_loader = DataLoader(test_set, **params)cv_train_loaders.append(train_loader)cv_test_loaders.append(test_loader)
- 构建图数据结构:
# 构建边索引
edges_s = s_adj.nonzero()
edge_index_s = torch.tensor(np.vstack((edges_s[0], edges_s[1])), dtype=torch.long)edges_f = f_adj.nonzero()
edge_index_f = torch.tensor(np.vstack((edges_f[0], edges_f[1])), dtype=torch.long)# 转换特征为张量
x = torch.tensor(node_feature, dtype=torch.float)
shuf_feature = torch.tensor(shuf_feature, dtype=torch.float)# 创建PyG的Data对象
data_s = Data(x=x, edge_index=edge_index_s)
data_f = Data(x=shuf_feature, edge_index=edge_index_f)
4. 训练过程详解
训练过程在train.py
文件中实现,主要包括以下几个步骤:
- 模型初始化:
model = SSCLMD(in_dim = args.dimensions, hid_dim= args.hidden1, out_dim = args.hidden2, decoder1=args.decoder1)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)m = torch.nn.Sigmoid()
loss_fct = torch.nn.BCEWithLogitsLoss()
loss_node = torch.nn.BCELoss()
- 训练循环:
for epoch in range(args.epochs):t = time.time()print('-------- Epoch ' + str(epoch + 1) + ' --------')y_pred_train = []y_label_train = []lbl_1 = torch.ones(997 * 2) # dataset1: 997, dataset2: 1071lbl_2 = torch.zeros(997 * 2)lbl = torch.cat((lbl_1, lbl_2)).cuda()for i, (label, inp) in enumerate(train_loader):if args.cuda:label = label.cuda()model.train()optimizer.zero_grad()# 前向传播output, log = model(data_s, data_f, inp)log = torch.squeeze(m(log))# 计算损失loss_class = loss_node(log, label.float())loss_constra = loss_fct(output, lbl)loss_train = loss_class + args.loss_ratio1 * loss_constra# 反向传播loss_train.backward()optimizer.step()# 收集预测结果label_ids = label.to('cpu').numpy()y_label_train = y_label_train + label_ids.flatten().tolist()y_pred_train = y_pred_train + log.flatten().tolist()if i % 100 == 0:print('epoch: ' + str(epoch + 1) + '/ iteration: ' + str(i + 1) + '/ loss_train: ' + str(loss_train.cpu().detach().numpy()))# 计算训练集上的ROC AUCroc_train = roc_auc_score(y_label_train, y_pred_train)print('epoch: {:04d}'.format(epoch + 1),'loss_train: {:.4f}'.format(loss_train.item()),'auroc_train: {:.4f}'.format(roc_train),'time: {:.4f}s'.format(time.time() - t))
- 测试过程:
def test(model, loader, data_s, data_f, args):m = torch.nn.Sigmoid()loss_fct = torch.nn.BCEWithLogitsLoss()loss_node = torch.nn.BCELoss()# 设置标签lbl_1 = torch.ones(997 * 2)lbl_2 = torch.zeros(997 * 2)lbl = torch.cat((lbl_1, lbl_2)).cuda()inp_id0 = []inp_id1 = []model.eval()y_pred = []y_label = []with torch.no_grad():for i, (label, inp) in enumerate(loader):inp_id0.append(inp[0])inp_id1.append(inp[1])if args.cuda:label = label.cuda()# 前向传播output, log = model(data_s, data_f, inp)log = torch.squeeze(m(log))# 计算损失loss_class = loss_node(log, label.float())loss_constra = loss_fct(output, lbl)loss = loss_class + args.loss_ratio1 * loss_constra# 收集预测结果label_ids = label.to('cpu').numpy()y_label = y_label + label_ids.flatten().tolist()y_pred = y_pred + log.flatten().tolist()outputs = np.asarray([1 if i else 0 for i in (np.asarray(y_pred) >= 0.5)])# 计算评估指标return roc_auc_score(y_label, y_pred), average_precision_score(y_label, y_pred), f1_score(y_label, outputs), loss
5. 主程序流程(main.py)
主程序的流程非常简洁:
# 参数设置
args = settings()# CUDA设置
args.cuda = not args.no_cuda and torch.cuda.is_available()
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.cuda:torch.cuda.manual_seed(args.seed)# 加载数据
data_s, data_f, train_loader, test_loader = load_data(args, n_splits=5)# 对每个fold进行训练和测试
for fold, (train_loader, test_loader) in enumerate(zip(train_loader, test_loader)):print(f"Training on fold {fold+1}")train_model(data_s, data_f, train_loader, test_loader, args)
6. 参数设置(parms_setting.py)
模型参数设置在parms_setting.py
中定义,主要包括:
def settings():parser = argparse.ArgumentParser()# 公共参数parser.add_argument('--seed', type=int, default=0,help='Random seed. Default is 0.')parser.add_argument('--no-cuda', action='store_true', default=False,help='Disables CUDA training.')parser.add_argument('--workers', type=int, default=0,help='Number of parallel workers. Default is 0.')# 数据路径参数parser.add_argument('--in_file', default="dataset1/LDA.edgelist",help='Path to data fold. e.g., data/LDA.edgelist')parser.add_argument('--neg_sample', default="dataset1/no_LDA.edgelist",help='Path to data fold. e.g., data/LDA.edgelist')parser.add_argument('--task_type', default="LDA", choices=['LDA', 'MDA','LMI'],help='Initial prediction task type. Default is LDA.')# 训练参数parser.add_argument('--lr', type=float, default=5e-4,help='Initial learning rate. Default is 5e-4.')parser.add_argument('--dropout', type=float, default=0.5,help='Dropout rate. Default is 0.5.')parser.add_argument('--weight_decay', default=5e-4,help='Weight decay (L2 loss on parameters) Default is 5e-4.')parser.add_argument('--batch', type=int, default=25,help='Batch size. Default is 25.')parser.add_argument('--epochs', type=int, default=80,help='Number of epochs to train. Default is 80.')parser.add_argument('--loss_ratio1', type=float, default=0.1,help='Ratio of self_supervision. Default is 1 (LDA), 0.1 (MDA,LMI)')# 模型参数parser.add_argument('--dimensions', type=int, default=512,help='dimensions of feature d. Default is 512 (LDA), 1024 (LDA and LMI)')parser.add_argument('--hidden1', default=256,help='Embedding dimension of encoder layer 1 for SSCLMD. Default is d/2.')parser.add_argument('--hidden2', default=128,help='Embedding dimension of encoder layer 2 for SSCLMD. Default is d/4.')parser.add_argument('--decoder1', default=512,help='Embedding dimension of decoder layer 1 for SSCLMD. Default is 512.')args = parser.parse_args()return args
7. 计算相似性(calculating_similarity.py)
该文件主要用于计算不同类型节点之间的相似性,构建拓扑图的内边关系。
8. 数据准备(data_preparation.py)
该文件用于计算lncRNA/miRNA的k-mer特征并构建基于属性的KNN图。
9. 工具函数(utils.py)
utils.py
包含一些辅助函数,如拉普拉斯归一化、行归一化等。
10. 项目复现步骤细节
-
环境准备:
- 安装Python 3.7+
- 安装必要的依赖:numpy, torch, sklearn, torch-geometric
-
数据准备:
- 解压
data/dataset1.rar
和data/dataset2.rar
- 解压
-
特征预处理:
- 运行
data_preparation.py
生成k-mer特征和属性图 - 运行
calculating_similarity.py
计算相似性和拓扑图内边
- 运行
-
模型训练与测试:
- 运行
main.py
启动训练和测试过程 - 根据需要修改
parms_setting.py
中的参数
- 运行
-
结果评估:
- 查看输出的AUROC、AUPRC和F1分数
- 可以保存模型以便后续使用
11. 代码优化建议
- 代码模块化:将数据加载、模型定义、训练和测试过程更好地模块化
- 参数管理:使用配置文件而不是硬编码的参数值
- 日志记录:添加更详细的日志记录,方便调试和分析
- 可视化:添加训练过程的可视化,如损失曲线和性能指标变化
- 数据并行:对于大规模数据集,添加数据并行处理能力
- 模型保存:添加定期保存模型检查点的功能
- 早停策略:实现早停策略,避免过拟合