一個完整的RNA-seq分析pipeline之HISAT2+ StrintTie+ Ballgown以及StringTie和Kallisto的比較(包括得到FPKM、TPM、readscounts)...

weixin_33670713發表於2018-11-12

原文網址 : https://blog.csdn.net/weixin_33670713/article/details/87119790

最重要的參考連結

https://pmbio.org/course/#module-06-rnaseq
github連結：https://github.com/griffithlab/rnaseq_tutorial
由於可能國內進不去，我這裡將關鍵程式碼儲存下來。

目錄截圖

image.png

得到FPKM、TPM、Readscounts

cd /workspace/rnaseq/ref-only-expression
wget https://raw.githubusercontent.com/griffithlab/rnaseq_tutorial/master/scripts/stringtie_expression_matrix.pl
chmod +x stringtie_expression_matrix.pl

./stringtie_expression_matrix.pl --expression_metric=TPM --result_dirs='RNAseq_Norm_Lane1,RNAseq_Norm_Lane2,RNAseq_Tumor_Lane1,RNAseq_Tumor_Lane2' --transcript_matrix_file=transcript_tpm_all_samples.tsv --gene_matrix_file=gene_tpm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=FPKM --result_dirs='RNAseq_Norm_Lane1,RNAseq_Norm_Lane2,RNAseq_Tumor_Lane1,RNAseq_Tumor_Lane2' --transcript_matrix_file=transcript_fpkm_all_samples.tsv --gene_matrix_file=gene_fpkm_all_samples.tsv
./stringtie_expression_matrix.pl --expression_metric=Coverage --result_dirs='RNAseq_Norm_Lane1,RNAseq_Norm_Lane2,RNAseq_Tumor_Lane1,RNAseq_Tumor_Lane2' --transcript_matrix_file=transcript_coverage_all_samples.tsv --gene_matrix_file=gene_coverage_all_samples.tsv

head transcript_tpm_all_samples.tsv gene_tpm_all_samples.tsv

Ballgown差異分析程式碼

# change working directory
cd /workspace/rnaseq/

# start R
R

# In R, run the following commands
library(ballgown)
library(genefilter)
library(dplyr)
library(devtools)

# create the phenotype data for each sample
#pheno_data = read.csv("RNA_data.csv")
pheno_data <- data.frame(ids=c("RNAseq_Norm", "RNAseq_Norm_Lane1", "RNAseq_Norm_Lane2", "RNAseq_Tumor", "RNAseq_Tumor_Lane1", "RNAseq_Tumor_Lane2"), type=c("normal", "normal", "normal", "tumor", "tumor", "tumor"))

# Load ballgown data structures for each sample
bg = ballgown(dataDir = "ballgown", samplePattern = "RNAseq", pData=pheno_data)

# Filter low-abundance genes
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)

# Identify signficant differently expressed Transcripts
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")

# Identify significant differently expressed Genes
results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")

# Add transcript/gene names and transcript/gene IDs to the results_transcripts data frame
results_transcripts = data.frame(transcriptNames=ballgown::transcriptNames(bg_filt),transcriptIDs=ballgown::transcriptIDs(bg_filt),geneNames=ballgown::geneNames(bg_filt),geneIDs=ballgown::geneIDs(bg_filt),results_transcripts)

# Add common gene names on
tmp <- unique(results_transcripts[,c("geneNames", "geneIDs")])
results_genes <- merge(results_genes, tmp, by.x=c("id"), by.y=c("geneIDs"), all.x=TRUE)

# Sort from the smallest P value to largest
results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)

# Output as CSV
write.csv(results_transcripts,"RNAseq_transcript_results.csv",row.names=FALSE)
write.csv(results_genes,"RNAseq_genes_results.csv",row.names=FALSE)

# Output as TSV
write.table(results_transcripts,"RNAseq_transcript_results.tsv",sep="\t", quote=FALSE, row.names=FALSE)
write.table(results_genes,"RNAseq_gene_results.tsv",sep="\t", quote=FALSE, row.names=FALSE)

# Identify genes with p value < 0.05
subset(results_transcripts,results_transcripts$pval<0.05)
subset(results_genes,results_genes$pval<0.05)

# quit R
q()

差異分析視覺化

# start R
R

# set the working directory
setwd("~/workspace/rnaseq")

# load libraries
library(ggplot2)
library(viridis)

# load in the ballgown DE data
de_genes <- read.delim("RNAseq_gene_results.tsv")

# load in the FPKM values from stringtie
expr_tumor <- read.delim("~/workspace/rnaseq/ballgown/RNAseq_Tumor_gene_abundance.out")

# merge the expression and DE results
merged_results <- merge(de_genes, expr_tumor, by.x=c("id"), by.y=c("Gene.ID"), all.x=TRUE)

# log2 the fold change
merged_results$log2_fc <- log2(as.numeric(merged_results$fc))

# remove entries with an FPKM of 0
merged_results <- merged_results[merged_results$FPKM > 1,]

# create an MA plot for both genes and transcripts
pdf(file="ma_plot.pdf", height=5, width=10)
ggplot(data=merged_results) + geom_point(aes(y=log2_fc, x=FPKM, color=qval)) + ylim(c(-10, 10)) + xlim(c(0, 1000)) + scale_colour_viridis(direction=-1, trans='sqrt') + theme_bw() + xlab("FPKM") + ylab("log2 Fold Change")
dev.off()

ma_plot

一個完整的RNA-seq分析pipeline
2021-09-09
flutter仿boss直聘，一個比較完整的例子（一）
2018-03-14
Flutter
Go和Python比較的話，哪個比較好？
2019-04-03
GoPython
類和類之間的比較
2020-11-21
幾個比較火的BI分析工具
2020-09-23
netty原始碼分析之pipeline(一)
2019-02-28
Netty原始碼
MVC、MVP和MVVM以及MVA比較
2018-11-24
MVCMVPMVVM
Lock的獨佔鎖和共享鎖的比較分析
2019-01-19
Java，Go和Rust之間的比較 - Dexter
2020-04-30
JavaGoRust
一個比較麻煩的限流需求
2024-06-07
一個比較float是否相等的工具類
2021-07-21
【原創】InnoDB 和TokuDB的讀寫分析與比較
2021-09-09
==和equals方法的比較
2020-11-06
ImageMagic 和 GraphicsMagick 的比較
2024-10-26
ArrayList和LinkedList的比較
2024-07-05
一個 Pipeline 的使用場景
2022-06-09
探討一個比較複雜的查詢
2020-10-19
個人比較反感的一些寫法
2020-03-24
比較 Pandas、Polars 和 PySpark：基準分析
2024-05-21
Spark
TreeMap和HashMap的元素比較
2021-04-07
HashMap
js 深比較和淺比較
2020-11-26
JS
對VM逆向的分析(CTF)(比較經典的一個虛擬機器逆向題目)
2021-06-04
虛擬機
Windows、Linux 和 Mac：作業系統之間的比較
2023-11-03
WindowsLinuxMac作業系統
Go 與 C++ 的對比和比較
2021-07-12
GoC++
kookeey、Luminati 和 Smartproxy 海外代理的特點和優缺點分析比較
2023-05-09
Cesium 比較常用的幾個方法
2024-09-15
netty原始碼分析之pipeline(二)
2019-03-03
Netty原始碼
[C#] string 和 StringBuilder 的比較
2018-08-21
C#UI
tbase和postgres-xl的比較
2020-07-20
EXCEL,POI,EASYEXCEL的使用和比較
2020-11-18
Excel
mongodb和hbase的簡單比較
2019-05-17
MongoDB
powershell中的where和foreach比較
2024-05-24
Mysql中的Datetime和Timestamp比較
2021-09-09
MySql
jQuery的prop和attr方法比較
2021-09-09
jQuery
BigDecimal的equals() 和 compareTo() 方法比較
2021-06-18
Decimal
Python小知識之物件的比較
2021-10-12
Python物件
對比和分析幾個流行的前端框架
2019-04-11
前端框架
【教程】一個比較良心的C++程式碼混淆器
2024-02-05
C++

一個完整的RNA-seq分析pipeline之HISAT2+ StrintTie+ Ballgown以及StringTie和Kallisto的比較(包括得到FPKM、TPM、readscounts)...

最重要的參考連結

相關文章