处理猪hypertension转录组数据
写在前面:这次换了另外一个服务器,所以可能会遇到不同的问题,包括:
- 提交任务的命令
- 软件版本
1 准备工作
1.1 下载公司测序返回的数据
1.2 安装软件:trim-galore(0.6.10) 、star (2.7.11b)、htseq (2.0.5)
1.3 下载参考基因组和注释文件(放在reference目录中)
wget https://ftp.ensembl.org/pub/release-115/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.115.gtf.gz
wget https://ftp.ensembl.org/pub/release-115/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
gzip -d Sus_scrofa.Sscrofa11.1.115.gtf.gz
gzip -d Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
2 上游分析
2.1 数据清洗-trim-galore
# 当前目录:/share/home/Pengyu/RNA_seq_analysis/ZXL
ls 00_rawdata | grep "R1" > gz1
ls 00_rawdata | grep "R2" > gz2
paste gz1 gz2>config_file
cat > run_clean.sh << 'EOF' 直接在Linux系统下写脚本,可以避免报错(之前是先在Windows下先写好的脚本,再上传到服务器中,会由于字符的原因出现报错)
windows下编写的脚本文件,放到Linux中无法执行解决方法
具体的脚本内容如下:
> cat config_file | while read id
> do
> sample_dir="/share/home/Pengyu/RNA_seq_analysis/ZXL/00_rawdata"
> output_dir="/share/home/Pengyu/RNA_seq_analysis/ZXL/01_cleandata"
> arr=($id)
> fq1=${arr[0]}
> fq2=${arr[1]}
> sample_dir1="$sample_dir/$fq1"
> sample_dir2="$sample_dir/$fq2"
> trim_galore -q 20 --max_n 1 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $output_dir $sample_dir1 $sample_dir2
> done
> EOF
#运行目录
#/share/home/Pengyu/RNA_seq_analysis/ZXL
2.2 构建STAR索引
注意:需要提前查看测序读段的碱基序列长度:
zcat XJ9_R2_val_2.fq.gz | head -n 4 |tail -n 1 | tr -d '+\n' | wc -c
#151
所以其中的sjdbOverhang参数应该为150
cat > run_index_star.sh << 'END'
#具体的脚本内容
> #!/bin/bash
> fa_dir="/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa"
> gtf_dir="/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.gtf"
>
> STAR --runMode genomeGenerate \
> --runThreadN 6 \
> --genomeDir "/share/home/Pengyu/RNA_seq_analysis/reference/index/star/sus_scrofa" \
> --genomeFastaFiles $fa_dir \
> --sjdbGTFfile $gtf_dir \
> --sjdbOverhang 150
> END
#让脚本变为可运行状态
chmod 550 run_index_star.sh
运行的时候又碰到另外一个问题
STAR: /usr/lib64/libstdc++.so.6: version 'GLIBCXX_3.4.29' not found (required by STAR)
按照这篇推文,重新新建一个环境 问题解决✅
解决libstdc++.so.6: version `GLIBCXX_3.4.29‘ not found
提交脚本的命令如下:
dsub -n "make_index" -N 4 -R "cpu=4;mem=8192" -oo "/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/run_index_out_%J.txt" -eo "/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/run_index_err_%J.txt" --script run_index_star.sh#运行目录
#/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa
2.3 比对
vim run_mapping.sh
#脚本内容如下
#!/bin/bash
GENOME_INDEX="/share/home/Pengyu/RNA_seq_analysis/reference/index/star/sus_scrofa"
OUTPUT_DIR="/share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping"
THREADS=20for R1_FILE in /share/home/Pengyu/RNA_seq_analysis/ZXL/01_cleandata/*_R1_val_1.fq.gz; do# 提取样本名时去除 '_R1_val_1.fq.gz' 部分SAMPLE_NAME=$(basename "$R1_FILE" _R1_val_1.fq.gz)# 构建R2文件名,根据新的命名规则R2_FILE="${SAMPLE_NAME}_R2_val_2.fq.gz"# 确认R2文件存在if [[ ! -f "$R2_FILE" ]]; thenecho "Missing R2 file for $R1_FILE"continuefiOUT_PREFIX="$OUTPUT_DIR/$SAMPLE_NAME":echo "Processing $SAMPLE_NAME"# 注意:此处未直接解压fastq.gz文件,如果文件实际上是gzip压缩的,请先解压或修改命令处理STAR --runThreadN $THREADS \--genomeDir "$GENOME_INDEX" \--readFilesIn "$R1_FILE" "$R2_FILE" \--readFilesCommand zcat \--outFileNamePrefix "$OUT_PREFIX" \--outSAMtype BAM SortedByCoordinate \--sjdbOverhang 150 \
#sjdbOverhang后面的值一定要和构建索引时的值是一致的 if [ $? -eq 0 ]; thenecho "Completed: $SAMPLE_NAME"elseecho "Failed: $SAMPLE_NAME"fi
done
chmod 550 run_mapping.sh
提交脚本的命令如下:
dsub -n "run_mapping" -N 4 -R "cpu=4;mem=8192" -oo "/share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping/run_mapping_out_%J.txt" -eo "/share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping/run_mapping_err_%J.txt" --script run_mapping.sh
#运行目录
#/share/home/Pengyu/RNA_seq_analysis/ZXL/01_cleandata
2.4 定量
- 判断链是否为特异性的
gtfToGenePred -genePredExt -geneNameAsName2 /share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.gtf gene.tmp
#从GenePred文件提取信息得到bed12文件
awk '{print $2"\t"$4"\t"$5"\t"$1"\t0\t"$3"\t"$6"\t"$7"\t0\t"$8"\t"$9"\t"$10}' gene.tmp > genes_refseq.bed12###得到bed12文件即可使用infer_experiment.py判断数据是否为链特异性
#检验,要进入有bed12文件的目录运行下列代码
infer_experiment.py -r genes_refseq.bed12 -i /share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping/XJ6:Aligned.sortedByCoord.out.bam
#当前目录
#/share/home/Pengyu/RNA_seq_analysis/ZXL/03_RseQC
是非链特异性的
- 只用编码蛋白的基因做定量
less -SN *.gtf | grep 'protein_coding' > Sus_scrofa.Sscrofa11.1.115.PCGs.gtf
awk -F'"' '{print $2}' Sus_scrofa.Sscrofa11.1.115.PCGs.gtf | uniq > sus_pcg_ensemid_unique.txt
#当前目录
#/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa
定量脚本如下:
#!/bin/bash
GENE_ANNOTATION="/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf" #基因注释文件路径,注意,只用编码基因的注释文件
COUNT_OUTPUT_DIR="/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count"
HTSEQ_STRAND="no" # 或者 "no"、"yes"、"reverse"根据您的实验 strandedness 设置# 遍历STAR生成的已排序BAM文件
for BAM_FILE in /share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping/*Aligned.sortedByCoord.out.bam; do# 获取样本名SAMPLE_NAME=$(basename "$BAM_FILE" _Aligned.sortedByCoord.out.bam)# 定义输出计数文件的全路径COUNT_FILE="$COUNT_OUTPUT_DIR/${SAMPLE_NAME}_htseq.counts"echo "Quantifying gene expression for $SAMPLE_NAME using htseq-count..."# 执行htseq-counthtseq-count -f bam -r pos -s "$HTSEQ_STRAND" "$BAM_FILE" "$GENE_ANNOTATION" > "$COUNT_FILE"if [ $? -eq 0 ]; thenecho "Quantification completed: $SAMPLE_NAME"elseecho "Quantification failed: $SAMPLE_NAME"fi
done
#提交脚本
dsub -n "run_count" -N 4 -R "cpu=4;mem=8192" -oo "/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count/run_count_out_%J.txt" -eo "/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count/run_count_err_%J.txt" --script run_count.sh
#当前目录
#/share/home/Pengyu/RNA_seq_analysis/ZXL/02_mapping
- 合并
count文件
head -n 4 *counts
import pandas as pd
import os
import sys# 获取目标目录下的所有.counts文件列表
directory = '/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count'
counts_files = [f for f in os.listdir(directory) if f.endswith('.counts')]
file_base_names = [os.path.splitext(f)[0] for f in counts_files] # 提取ERR编号# 初始化一个空的DataFrame,用于存储最终结果
df_all = pd.DataFrame()# 遍历每个.counts文件
for file_name, base_name in zip(counts_files, file_base_names):# 构建完整路径file_path = os.path.join(directory, file_name)# 读取文件,假设文件格式为两列,第一列基因ID,第二列计数df_temp = pd.read_csv(file_path, sep='\t', header=None, names=['GeneID', 'Count'])# 将当前文件的计数作为新列添加到总的DataFrame中,基因ID作为索引df_all[base_name] = df_temp.set_index('GeneID')['Count']# 填充缺失值,如果某个基因在某些文件中没有出现,则填充为0
df_all.fillna(0, inplace=True)# 重置索引,使得基因ID成为新的一列
df_all.reset_index(inplace=True)# 输出到新的csv文件,或者根据需要调整输出格式
output_file = 'merged_counts.csv'
df_all.to_csv(output_file, index=False)print(f"Merged counts saved to {output_file}")##当前目录
##/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count
上面的脚本内容存成run_merge_counts.py 脚本并运行
3 下游分析
3.1 计算非冗余外显子的长度
#!/usr/bin/perluse warnings;
use strict;my %hash;open IN,"<","/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf" or die "Cannot open file /share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf: $!";open OUT, ">", "Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.Efflens" or die "Can't open file Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.Efflens: $!";while(<IN>){chomp;next if /^#/;my @a=split /\t/,$_;if($a[2] eq 'exon'){my $g=(split /;/,$a[8])[0];$g=~s/gene_id "//;$g=~s/"//;foreach my $n($a[3]..$a[4]){$hash{$g}{$n}=1;}}
}
close IN;foreach my $g(sort keys%hash){my $len=0;foreach my $p(sort {$a<=>$b} keys%{$hash{$g}}){$len++;}print OUT "$g\t$len\n";
}
close OUT;#当前目录
#/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa
#上述脚本命名为 1_Calculate_Efflength.pl
运行命令 perl 1_Calculate_Efflength.pl
head -n 10 Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.Efflens
#结果如下
ENSSSCG00000000002 2393
ENSSSCG00000000003 2215
ENSSSCG00000000005 2612
ENSSSCG00000000006 3680
ENSSSCG00000000007 1628
ENSSSCG00000000010 3782
ENSSSCG00000000014 2436
ENSSSCG00000000018 7034
ENSSSCG00000000019 5380
ENSSSCG00000000020 2691
3.2 计算TPM表达量
#!/usr/bin/perluse warnings;
use strict;# 读取基因长度信息
my %LEN;
{open my $IN, "<", "/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.Efflens" or die "Cannot open gene length file: $!";while (<$IN>) {chomp;my @a = split;$LEN{$a[0]} = $a[1];}close $IN;
}# 遍历命令行参数(所有filter文件)
foreach my $filter (@ARGV) {open my $OUT, ">", "$filter.TPM" or die "Cannot open output file for $filter: $!";my $cousum = 0;# 计算总表达量 cousum{open my $IN, "<", "$filter" or die "Cannot open input file $filter: $!";while (<$IN>) {chomp;my @a = split;next unless exists $LEN{$a[0]};my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));$cousum += $r1;}close $IN;}# 计算并打印每个基因的TPM值{open my $IN, "<", "$filter" or die "Cannot reopen input file $filter: $!";while (<$IN>) {chomp;my @a = split;next unless exists $LEN{$a[0]};my $r1 = ($a[1] / ($LEN{$a[0]} / 1000));my $tpm = keep2pos(1000000 * ($r1 / $cousum));print $OUT "$_\t$tpm\n";}close $IN;}
}sub keep2pos{my $inv = shift;my $ouv;if($inv < 0.01){$ouv=0;}else{my @q=split /\./,$inv;$ouv=$q[0] + 0.01*(substr($q[1],0,2));}return $ouv;
}#当前目录
#/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count
#上述脚本命名为 2_Calculate_TPM.pl
运行命令perl 2_Calculate_TPM.pl *.counts
head -n 10 *.TPM
#结果如下
==> XJ10:Aligned.sortedByCoord.out.bam_htseq.counts.TPM <==
ENSSSCG00000000002 591 18.72
ENSSSCG00000000003 204 6.98
ENSSSCG00000000005 1289 37.41
ENSSSCG00000000006 1418 29.21
ENSSSCG00000000007 252 11.73
ENSSSCG00000000010 1549 31.04
ENSSSCG00000000014 545 16.96
ENSSSCG00000000018 1275 13.74
ENSSSCG00000000019 1163 16.38
ENSSSCG00000000020 11 0.3
3.3 从GTF文件中筛选出蛋白质编码基因的转录起始位点以及基因symbol
use strict;my %CHR;
foreach my $c(1..18){$CHR{$c}=1;
}open IN,"<","/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf" or die;open OUT, ">", "Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.TSS" or die "Can't open file Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.TSS: $!";while(<IN>){chomp;next if /^#/;my @a=split /\t/,$_;#next unless (exists $CHR{$a[0]});if($a[2] eq 'gene'){if(/protein_coding/){my $gid='';my $gsm='Unknown';my @b=split /;/,$a[8];foreach my $f(@b){$gid=$1 if ($f=~/gene_id \"(\S+)\"/);$gsm=$1 if ($f=~/gene_name \"(\S+)\"/);}my $ginfo="$gid\_$gsm";my $stp=($a[6] eq '+') ? $a[3] : $a[4];my $enp=$stp+1;print OUT "$a[0]\t$stp\t$enp\t$ginfo\n";}}
}#当前目录
#/share/home/Pengyu/RNA_seq_analysis/reference/Sus_scrofa
#上述脚本命名为 3_Pick_TSS_symbolID_pig.pl
运行命令perl 3_Pick_TSS_symbolID_pig.pl ,得到Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.TSS文件
3.4 差异分析
- 首先看一下PCA,决定用哪些样本进行差异分析
counts<-read.csv("merged_counts.csv",header=TRUE, check.names=FALSE,row.names = 1)
group <- read.csv("group.csv", header=TRUE,check.names=FALSE)
data <- scale(counts)
data <- t(scale(t(data),center=TRUE,scale=F))
### 主成分分析###
pca <- prcomp(t(data),center=TRUE,scale=FALSE)
###计算方差贡献百分比###
explained_variance <- round(pca$sdev^2 / sum(pca$sdev^2) * 100, 1)
###提取PC1和PC2的得分###
pc1_scores <- pca$x[, 1]
pc2_scores <- pca$x[, 2]
pca_mat <- data.frame(PC1=pc1_scores,PC2=pc2_scores,sample=colnames(data),group=c(rep("NC",6),rep("HB",6)))
ellipse_colors <- c("#780522", "#f1a5a9")
group_colors <- c("#780522", "#f1a5a9")library(ggplot2)
library(ggrepel)
p1 <- ggplot(pca_mat, aes(x = PC1, y = PC2,colour = group)) +geom_point(aes(colour = group), size = 1, alpha = 0.8) +geom_text_repel(aes(label = sample,color = group,),size = 3,show.legend = FALSE,max.overlaps = 20) +labs(x = paste0('Component 1: ', explained_variance[1], "%"),y = paste0('Component 2: ', explained_variance[2], "%")) +theme_bw() +theme(plot.title = element_text(hjust = 0.5),axis.title.x = element_text(size = 8),axis.title.y = element_text(size = 8),axis.text.x = element_text(size = 8),axis.text.y = element_text(size = 8)) +stat_ellipse(geom = "path", # 改为路径模式(显示边框)aes(group = group), # 按分组绘制椭圆level = 0.90, # 置信区间color = "#6d6f6e", # 边框设为黑色linetype = 2,linewidth = 0.4, # 边框粗细show.legend = FALSE) +scale_colour_manual(values = group_colors) +coord_cartesian(xlim = c(-30, 30), ylim = c(-30, 30))+theme(# 核心设置:移除背景和网格线panel.background = element_blank(), # 移除绘图区灰色背景plot.background = element_blank(), # 移除整个图的背景(如导出PDF需保留可设为element_rect(fill = "white"))panel.grid.major = element_blank(), # 移除主网格线panel.grid.minor = element_blank(), # 移除次网格线axis.line = element_line(color = "black"), # 保留坐标轴线axis.title = element_text(size = 8),axis.text = element_text(size = 8, color = "black"),legend.position = "right")
不用剔掉数据
- 差异分析
library(DESeq2)
counts<-read.csv("merged_counts.csv",header=TRUE, check.names=FALSE,row.names = 1)
group <- read.csv("group.csv", header=TRUE,check.names=FALSE)colDate<-data.frame(row.names=c("XJ1","XJ2","XJ4","XJ5","XJ6","XJ9","XJ10","XJ11","XJ12","XJ14","XJ18","XJ20"),condition=factor(c("NC","NC","NC","NC","NC","NC","HB","HB","HB","HB","HB","HB")),levels=c("NC","NC","NC","NC","NC","NC","HB","HB","HB","HB","HB","HB"))
dds<-DESeqDataSetFromMatrix(countData = counts,colData = colDate,design = ~condition)dds<-DESeq(dds)
sizeFactors(dds)res <- results(dds, contrast = c("condition","HB","NC"), cooksCutoff = FALSE)
resclass(res)
res<-as.data.frame(res)
head(res)library(dplyr)res%>%mutate(Sig = case_when(log2FoldChange >= 0.58 & padj <= 0.05 ~ "UP",log2FoldChange <= -0.58 & padj <= 0.05 ~ "DOWN",TRUE ~ "Not_change")) -> res_1table(res_1$Sig)res_1<-cbind(rownames(res_1),res_1)
head(res_1)
colnames(res_1)<- c('gene',"baseMean","log2FoldChange","lfcSE","stat","pvalue","padj","Sig")write.table(res_1,"pig_hypertensin_count_DEseq.csv",sep = ',',col.names = T,row.names = F,quote = FALSE)
得到差异结果
3.5 整合结果
将以上得到的文件都移动到一个文件夹中05_DEG
- 准备
sample.list文件
Sample Group
XJ10 HB
XJ11 HB
XJ12 HB
XJ14 HB
XJ18 HB
XJ20 HB
XJ1 NC
XJ2 NC
XJ4 NC
XJ5 NC
XJ6 NC
XJ9 NC
- 准备原始
count文件,命名为hypertensin__combined.counts.Normalized.ints
GeneID XJ10 XJ11 XJ12 XJ14 XJ18 XJ20 XJ1 XJ2 XJ4 XJ5 XJ6 XJ9
ENSSSCG00000000002 591 1260 602 485 543 549 529 541 1541 650 601 799
ENSSSCG00000000003 204 402 198 263 292 170 245 237 319 252 247 354
ENSSSCG00000000005 1289 1478 1355 822 1411 1274 1095 1211 1341 1397 1216 1072
3.准备差异分析结果文件,命名为hypertension_combined.counts.Normalized.ints.DEseq2.txt.cut
gene log2FoldChange pvalue padj
ENSSSCG00000000002 -0.228626867 0.384739061 0.875126382
ENSSSCG00000000003 -0.090610052 0.667286359 0.947832158
ENSSSCG00000000005 0.102807751 0.35433116 0.860959814
ENSSSCG00000000006 0.196009426 0.295940924 0.831609655
- 运行以下代码,得到最终整理好的表格
use strict;my %MTPM;open IN,"<","sample.list" or die;
open OUT, ">", "hypertension_combined.counts.Normalized.ints.DEseq2.txt.cut.xls" or die "Can't open file hypertension_combined.counts.Normalized.ints.DEseq2.txt.cut.xls: $!";
print OUT "GeneID\tSymbol\tLog2FC(HB/NC)\tFC(HB/NC)\tPvalue\tFDR\tHB_mean_TPM\tNC_mean_TPM\tHB_mean_Counts\tNC_mean_Counts\n";while(<IN>){chomp;next if /^Sample/;my @a=split;#open INN,"<","../2.Correlation/$a[0]\Aligned.out.bam.Counts.TPM" or die;open INN,"<","/share/home/Pengyu/RNA_seq_analysis/ZXL/04_count/$a[0]:Aligned.sortedByCoord.out.bam_htseq.counts.TPM" or die;while(<INN>){chomp;my @b=split;$MTPM{$a[1]}{$b[0]}+=$b[2];}close INN;
}
close IN;my %AVTP;my %MCOUNT;
open IN,"<","hypertensin__combined.counts.Normalized.ints" or die;
while(<IN>){chomp;next if /^GeneID/;my @a=split;my $v1=($a[1]+$a[2]+$a[3]+$a[4]+$a[5]+$a[6]);my $v2=($a[7]+$a[8]+$a[9]+$a[10]+$a[11]+$a[12]);
#根据自己的数据生物学重复$MCOUNT{'HB'}{$a[0]}=$v1;$MCOUNT{'NC'}{$a[0]}=$v2;
}
close IN;my %MAP;
open IN,"<","/share/home/Pengyu/RNA_seq_analysis/ZXL/05_DEG/Sus_scrofa.Sscrofa11.1.115.PCGs.gtf.TSS" or die;
while(<IN>){chomp;my @a=split;my @b=split /_/,$a[3];$MAP{$b[0]}=$b[1];
}
close IN;open IN,"<","hypertension_combined.counts.Normalized.ints.DEseq2.txt.cut" or die;
while(<IN>){chomp;next if /^gene/;my @a=split;my $fc = 2 ** $a[1];my $Dtpm=$MTPM{'HB'}{$a[0]}/6;my $Wtpm=$MTPM{'NC'}{$a[0]}/6;my $Dcou=$MCOUNT{'HB'}{$a[0]}/6;my $Wcou=$MCOUNT{'NC'}{$a[0]}/6;print OUT "$a[0]\t$MAP{$a[0]}\t$a[1]\t$fc\t$a[2]\t$a[3]\t$Dtpm\t$Wtpm\t$Dcou\t$Wcou\n";
}