scPagwas-gwas data pruning的處理-inhouse 【未完成整理】

corrschi發表於2024-04-27

總共三個大步驟:
step1:提取503例EUR-Sample的1000G.EUR.QC.chr,透過python指令碼批次跑plink得到
step2: 提取my-MDD中SNP的1000G.EUR.QC.chr-sub-chr
,透過python指令碼批次跑plink得到
step3: 進行pruning,得到MDD.chr*_plink_prune_EUR_filtered_LD0.8.prune.in,透過python指令碼批次跑plink得到

且對每個python指令碼,要提前準備好兩個引數檔案:1.keep_sample/SNP_file 2.plink_list_file [1列file_name,1列chr_name]

暫存版本

##樣本Sample的提取 cat get_extract_sample.py
import sys
import os

if __name__ == '__main__':
    keep_sample_file = sys.argv[1]
    plink_list_file = sys.argv[2]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --keep {keep_sample_file} --make-bed --out 1000G.EUR.QC.{chrome}'''
            print(cmd)
            print('#' * 20)
            print()


##/data/home/lingw/bin/plink --bfile /data/home/lingw/workSpace/reference/1000Genomes/raw_vcf/plink/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --keep for-subset-sample.txt --make-bed --out 1000G.EUR.QC.chr10

python get_extract_sample.py for-subset-sample.txt 1kg-plink.txt > run.sh

nohup sh run.sh

#執行結果後check是否是503個樣本的結果
wc -l 1000G.EUR.QC.chr22.fam

##plink
#準備檔案1: sub-sample-plink-list.txt
ls 1000G.EUR.QC.chr*.bim|cut -f1-4 -d '.' > aa.txt
ls 1000G.EUR.QC.chr*.bim|cut -f4 -d '.' > bb.txt
#paste  -d\\t aa.txt bb.txt 
paste  -d aa.txt bb.txt  > sub-sample-plink-list.txt

#準備檔案2: extract_snp.txt

touch
ll -rt

awk -F' ' '{print $3}' MDD_gwas_data.txt > extract_snp.txt #awk工具預設是空格,-F指定分隔符,

##keep sample
##extract snp

cat > get_extract_snp.py
import sys
import os

if __name__ == '__main__':
    keep_snp_file = sys.argv[1]
    plink_list_file = sys.argv[2]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --extract {keep_snp_file} --make-bed --out 1000G.EUR.QC.{chrome}-sub-SNP'''
            print(cmd)
            print('#' * 20)
            print()


python get_extract_snp.py extract_snp.txt sub-sample-plink-list.txt > run-extract-snp.sh

nohup sh ./run-extract-snp.sh &
#nohup 斷網仍舊執行,&後臺執行
jobs

##檢視MDD_gwas_file中有多少snp,提取到的檔案有多少snp
wc -l extract_snp.txt  #8481297 extract_snp.txt
wc -l 1000G.EUR.QC.chr*-sub-SNP.bim  #7829695 total

少了60多萬snp屬於正常,如提取出的只有幾十萬,丟失的就太多,需要重新提取。
原因是:1.1000G位點snpID不同 2.下載的snp-summary沒有被1000G覆蓋,1000G資料過老。一般不懂管,除非丟失過多。


#### filter LD information
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f1-3 -d '-' > aa.txt
ls 1000G.EUR.QC.chr*-sub-SNP.bim|cut -f4 -d '.' |cut -f1 -d '-' >bb.txt
paste aa.txt bb.txt  > sub-sample-SNP-list.txt

(plink --make-bed 生成新的plink file;去掉make-bed就不生成新的plink file,就不會覆蓋原來的。
 allow no-sex必須加,有些sample的sex資訊NA,不加一定會報錯)

--indep-pairwise 50 5 0.8 


import sys
import os

if __name__ == '__main__':
    # keep_snp_file = sys.argv[1]
    plink_list_file = sys.argv[1]
    
    with open(plink_list_file) as f:
        for line in f:
            b_file, chrome, = line.split()
            cmd = rf'''/data/home/lingw/bin/plink --allow-no-sex --bfile {b_file} \
                --indep-pairwise 50 5 0.8  --out MDD.{chrome}_plink_prune_EUR_filtered_LD0.8'''
            print(cmd)
            print('#' * 20)
            print()

python get_pruning.py sub-sample-SNP-list.txt > run-pruning.sh

wc -l MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.in MDD.chr7_plink_prune_EUR_filtered_LD0.8.prune.out

wc -l 1000G.EUR.QC.chr7-sub-SNP.bim

head MDD_EUR_LD0.8.prune
wc -l MDD_EUR_LD0.8.prune  #2716381

相關文章