CHIP-seq流程學習筆記(8)-使用MACS2 bdgdiff提取差異peak進行註釋

垚垚爸愛學習發表於2020-11-06

參考文章:

使用MACS2進行差異peak分析
重點推薦:Call differential binding events

MACS2作為使用最廣泛的peak calling軟體,在v2版本中新增了差異peak分析的功能,它的子命令功能如下:
在這裡插入圖片描述

1. 軟體說明

# 用法說明:
usage: macs2 bdgdiff [-h] --t1 T1BDG --t2 T2BDG --c1 C1BDG --c2 C2BDG
                     [-C CUTOFF] [-l MINLEN] [-g MAXGAP] [--d1 DEPTH1]
                     [--d2 DEPTH2] [--outdir OUTDIR]
                     (--o-prefix OPREFIX | -o OFILE OFILE OFILE)
# 可選引數:
optional arguments:
# 引數 --t1是讀取MACS pileup bedGraph for condition 1. 
# 引數 --t2是讀取MACS pileup bedGraph for condition 2. 
# 引數 --c1是讀取MACS control lambda bedGraph for condition 1.
# 引數 --c2是讀取MACS control lambda bedGraph for condition 2.
# 引數 -g 是Maximu gap to merge nearby differential regions.
# 引數 -l Minimum length of differential region. Try bigger value to remove small regions. DEFAULT: 200
# 引數 --d1  Sequencing depth (# of non-redundant reads in million) for condition 1. 
# 引數 --d2  Sequencing depth (# of non-redundant reads in million) for condition 2. 
# 引數 --o-prefix diff_c1_vs_c2儲存輸出檔名。

2. 操作記錄

Step 1: Generate pileup tracks using callpeak module

  1. 使用predictd模式對不同樣品的片段長度進行預測,以設定相同的extsize大小。
    操作記錄如下:
(base) zexing@DNA:~/projects/daizhongye/ChIP_seq/2020_10_29/bam_sort$ macs2 predictd -i Scr.bam.sort
INFO  @ Fri, 06 Nov 2020 16:24:45: # read alignment files...
INFO  @ Fri, 06 Nov 2020 16:24:45: # read treatment tags...
INFO  @ Fri, 06 Nov 2020 16:24:45: Detected format is: BAM
INFO  @ Fri, 06 Nov 2020 16:24:45: * Input file is gzipped.
INFO  @ Fri, 06 Nov 2020 16:24:50:  1000000
INFO  @ Fri, 06 Nov 2020 16:24:56:  2000000
INFO  @ Fri, 06 Nov 2020 16:25:01:  3000000
INFO  @ Fri, 06 Nov 2020 16:25:07:  4000000
INFO  @ Fri, 06 Nov 2020 16:25:12:  5000000
INFO  @ Fri, 06 Nov 2020 16:25:16: 5565504 reads have been read.
INFO  @ Fri, 06 Nov 2020 16:25:16: tag size is determined as 150 bps
INFO  @ Fri, 06 Nov 2020 16:25:16: # tag size = 150
INFO  @ Fri, 06 Nov 2020 16:25:16: # total tags in alignment file: 5565504
INFO  @ Fri, 06 Nov 2020 16:25:16: # Build Peak Model...
INFO  @ Fri, 06 Nov 2020 16:25:16: #2 looking for paired plus/minus strand peaks...
INFO  @ Fri, 06 Nov 2020 16:25:19: #2 number of paired peaks: 75474
INFO  @ Fri, 06 Nov 2020 16:25:19: start model_add_line...
INFO  @ Fri, 06 Nov 2020 16:25:19: start X-correlation...
INFO  @ Fri, 06 Nov 2020 16:25:19: end of X-cor
INFO  @ Fri, 06 Nov 2020 16:25:19: # finished!
INFO  @ Fri, 06 Nov 2020 16:25:19: # predicted fragment length is 275 bps
INFO  @ Fri, 06 Nov 2020 16:25:19: # alternative fragment length(s) may be 275 bps
INFO  @ Fri, 06 Nov 2020 16:25:19: # Generate R script for model : predictd

(base) zexing@DNA:~/projects/daizhongye/ChIP_seq/2020_10_29/bam_sort$ macs2 predictd -i shTgm1-2.bam.sort shTgm2-1.bam.sort
INFO  @ Fri, 06 Nov 2020 16:28:59: # read alignment files...
INFO  @ Fri, 06 Nov 2020 16:28:59: # read treatment tags...
INFO  @ Fri, 06 Nov 2020 16:28:59: Detected format is: BAM
INFO  @ Fri, 06 Nov 2020 16:28:59: * Input file is gzipped.
INFO  @ Fri, 06 Nov 2020 16:29:04:  1000000
INFO  @ Fri, 06 Nov 2020 16:29:09:  2000000
INFO  @ Fri, 06 Nov 2020 16:29:15:  3000000
INFO  @ Fri, 06 Nov 2020 16:29:20:  4000000
INFO  @ Fri, 06 Nov 2020 16:29:25:  5000000
INFO  @ Fri, 06 Nov 2020 16:29:30:  6000000
INFO  @ Fri, 06 Nov 2020 16:29:36: 6730717 reads have been read.
INFO  @ Fri, 06 Nov 2020 16:29:36: Detected format is: BAM
INFO  @ Fri, 06 Nov 2020 16:29:36: * Input file is gzipped.
INFO  @ Fri, 06 Nov 2020 16:29:41:  1000000
INFO  @ Fri, 06 Nov 2020 16:29:46:  2000000
INFO  @ Fri, 06 Nov 2020 16:29:51:  3000000
INFO  @ Fri, 06 Nov 2020 16:29:57:  4000000
INFO  @ Fri, 06 Nov 2020 16:30:02:  5000000
INFO  @ Fri, 06 Nov 2020 16:30:07: 5895422 reads have been read.
INFO  @ Fri, 06 Nov 2020 16:30:07: tag size is determined as 150 bps
INFO  @ Fri, 06 Nov 2020 16:30:07: # tag size = 150
INFO  @ Fri, 06 Nov 2020 16:30:07: # total tags in alignment file: 12626139
INFO  @ Fri, 06 Nov 2020 16:30:07: # Build Peak Model...
INFO  @ Fri, 06 Nov 2020 16:30:07: #2 looking for paired plus/minus strand peaks...
INFO  @ Fri, 06 Nov 2020 16:30:11: #2 number of paired peaks: 94479
INFO  @ Fri, 06 Nov 2020 16:30:11: start model_add_line...
INFO  @ Fri, 06 Nov 2020 16:30:12: start X-correlation...
INFO  @ Fri, 06 Nov 2020 16:30:12: end of X-cor
INFO  @ Fri, 06 Nov 2020 16:30:12: # finished!
INFO  @ Fri, 06 Nov 2020 16:30:12: # predicted fragment length is 258 bps
INFO  @ Fri, 06 Nov 2020 16:30:12: # alternative fragment length(s) may be 258 bps
INFO  @ Fri, 06 Nov 2020 16:30:12: # Generate R script for model : predictd

對於大多數的轉錄因子chip_seq資料,推薦值為200, 對於大部分組蛋白修飾的chip_seq資料,推薦值為147,本次預測的均值為(275+258)/2=260bp。

  1. 根據上邊預測的插入片段長度進行peak calling,並提取每個樣本過濾之後的reads數目(使用-B引數,不能使用–SPMR引數)

    vim 新建macs2_bdgdiff_callpeak_script指令碼如下:

#! /bin/bash
#上面一行宣告這個script的語法使用bash語法,當程式被執行時,能夠載入bash的相關環境配置檔案。
# Program:
#       This program is used for peak calling of ChIP-seq data by macs2.
# History:
# 2020/11/06        zexing              First release
#
# macs2 callpeak --help
# usage: macs2 callpeak -t TFILE     # treat組
#                       [-c [CFILE]] # control 或 mock(非特異性抗體,如IgG)組
#                                    # control:input DNA,沒有經過免疫共沉澱處理;
#                                    # mock:1)未使用抗體富集與蛋白結合的DNA片段;2)非特異性抗體,如IgG
#                       [-f]         # MACS2讀入檔案格式,預設自動檢測輸入檔案格式
#                       [-g GSIZE]   # 有效基因組大小(可比對基因組大小),人類hs,小鼠mm
#                       [-s TSIZE]   # 測序讀長;如果不設定,MACS 利用輸入的前10個序列自動檢測
#                       [--outdir OUTDIR] # MACS2結果檔案儲存路徑
#                       [-n NAME]    # 為MACS2輸出檔案命名
#                       [-B]  # 保留the fragment pileup, control lambda, -log10pvalue 和 -log10qvalue scores到bedGraph 檔案。
#                       [-q QVALUE | -p PVALUE] # qvalue (minimum FDR)設定call significant regions的閾值;
#                                               # 預設,0.01,對於 broad marks(組蛋白修飾的chipseq),可以使用0.05;
#                                               # Q-values are calculated from p-values using Benjamini-Hochberg procedure.
#                                               # 設定p值時, qvalue不再起作用。
# 利用egrep "tags after filtering in treatment|tags after filtering in control"提取每個樣本過濾之後的reads數目。
# 對變數${i}利用 for ${i} in A B C D 的方式遍歷指定
dir=/f/xudonglab/zexing/projects/daizhongye/ChIP_seq/2020_10_29/macs2_bdgdiff
for i in Scr shTgm1-2 shTgm2-1
do
macs2 callpeak -t /f/xudonglab/zexing/projects/daizhongye/ChIP_seq/2020_10_29/bam_sort/${i}.bam.sort --outdir ${dir}/ -n ${i} -q 0.05 -g mm -B --nomodel --extsize 260
egrep "tags after filtering in treatment|tags after filtering in control" ${dir}/${i}_peaks.xls > ${dir}/${i}_tags
done

後臺執行macs2_bdgdiff_callpeak_script指令碼如下:

(base) zexing@DNA:~/projects/daizhongye/ChIP_seq/2020_10_29/scripts_log$ nohup bash macs2_bdgdiff_callpeak_script > macs2_bdgdiff_callpeak_script_log &

Step 2: Call differential regions

vim 新建macs2_bdgdiff_script指令碼如下:

#! /bin/bash
#上面一行宣告這個script的語法使用bash語法,當程式被執行時,能夠載入bash的相關環境配置檔案。
# Program:
#       This program is used for calling differential binding events of ChIP-seq data by macs2.
# History:
# 2020/11/06        zexing              First release
# 用法說明:
# usage: macs2 bdgdiff [-h] --t1 T1BDG --t2 T2BDG --c1 C1BDG --c2 C2BDG
#                     [-C CUTOFF] [-l MINLEN] [-g MAXGAP] [--d1 DEPTH1]
#                      [--d2 DEPTH2] [--outdir OUTDIR]
#                     (--o-prefix OPREFIX | -o OFILE OFILE OFILE)
# 可選引數:
# optional arguments:
# 引數 --t1是讀取MACS pileup bedGraph for condition 1. 
# 引數 --t2是讀取MACS pileup bedGraph for condition 2. 
# 引數 --c1是讀取MACS control lambda bedGraph for condition 1.
# 引數 --c2是讀取MACS control lambda bedGraph for condition 2.
# 引數 -g 是Maximu gap to merge nearby differential regions.
# 引數 -l Minimum length of differential region. Try bigger value to remove small regions. DEFAULT: 200
# 引數 --d1  Sequencing depth (# of non-redundant reads in million) for condition 1. 
# 引數 --d2  Sequencing depth (# of non-redundant reads in million) for condition 2. 
# 引數 --o-prefix diff_c1_vs_c2儲存輸出檔名。
dir=/f/xudonglab/zexing/projects/daizhongye/ChIP_seq/2020_10_29/macs2_bdgdiff

macs2 bdgdiff --t1 ${dir}/Scr_treat_pileup.bdg --c1 ${dir}/Scr_control_lambda.bdg --d1 4236187 \
--t2 ${dir}/shTgm1-2_treat_pileup.bdg --c2 ${dir}/shTgm1-2_control_lambda.bdg --d2 5103633 \
-g 60 -l 260 --o-prefix ${dir}/diff_Scr_vs_shTgm1-2

macs2 bdgdiff --t1 ${dir}/Scr_treat_pileup.bdg --c1 ${dir}/Scr_control_lambda.bdg --d1 4236187 \
--t2 ${dir}/shTgm2-1_treat_pileup.bdg --c2 ${dir}/shTgm2-1_control_lambda.bdg --d2 4491626 \
-g 60 -l 260 --o-prefix ${dir}/diff_Scr_vs_shTgm2-1

後臺執行macs2_bdgdiff_script指令碼如下:

(base) zexing@DNA:~/projects/daizhongye/ChIP_seq/2020_10_29/scripts_log$ nohup bash macs2_bdgdiff_script > macs2_bdgdiff_script_log &

其中-d1和-d2的值就是第二步執行時輸出的reads數目,-o引數指定輸出檔案的字首。執行成功後,會產生3個檔案

diff_Scr_vs_shTgm1-2_c3.0_cond1.bed   # 儲存在condition1中上調的peak

diff_Scr_vs_shTgm1-2_c3.0_cond2.bed   #  儲存了在condition2中上調的peak

diff_Scr_vs_shTgm1-2_c3.0_common.bed  # 儲存的是沒有達到閾值的,非顯著差異peak

上述3個檔案格式是完全相同的,最後一列的內容為log10 likehood ratio值,用來衡量兩個條件之間的差異,預設閾值為3,大於閾值的peak為組間差異顯著的peak, 這個閾值可以通過-c引數進行調整。

Step 3: Annotate peaks from callpeak

相關文章