最近幫女朋友做畢業設計的時候用到了 PAML這個軟體的codeml功能,發現網上相關的資料很少,於是把自己踩的一些坑分享一下,希望能幫到其他有相同困難的人
一、下載與安裝
PAML軟體下載地址
http://abacus.gene.ucl.ac.uk/software/paml4.9j.tgz
DAMBE軟體下載地址
http://dambe.bio.uottawa.ca/DAMBE/dambe_install_win.aspx
二、使用方法
首先準備好你的fas檔案
我們需要將fas檔案轉換一下格式,方法很多,我這邊說兩種方法,這兩種方法最後得到的檔案內容完全相同,只是字尾名不同
方法一:
使用python指令碼轉換
將你的*.fas檔案與指令碼放在同一目錄下,執行指令碼,會生成一個.phy檔案
import re with open('seven.fas', 'r') as fin: sequences = [(m.group(1), ''.join(m.group(2).split())) for m in re.finditer(r'(?m)^>([^ \n]+)[^\n]*([^>]*)', fin.read())] with open('seven.phy', 'w') as fout: fout.write('%d %d\n' % (len(sequences), len(sequences[0][1]))) for item in sequences: fout.write('%-20s %s\n' % item)
方法二:
使用DAMBE軟體轉換格式
1.開啟DAMBE,選擇 File -> Open standard sequence file -> 檔案型別選擇為包含 fas 型別 -> 選擇你的fas檔案
2.點選 go
3.點選 File -> save or convert sequence format -> 選擇 paml 格式
4.手動修改 *.pml 的字尾名為 *.nuc
通過以上兩個方法會得到一份 *.phy 或者 *.nuc 檔案
接下來需要去除序列中的終止密碼子
你可以全選檔案內容查詢替換 將 TAG/TAA/TGA 替換為 ---
也可以使用下面這個python指令碼
import re with open(r'seven.phy', 'r') as f: content = f.read() content = content.replace("TAG","---") content = content.replace("TAA", "---") content = content.replace("TGA", "---") # print(content) with open('sevenend.phy', 'w') as f: f.write(content)
會生成一個去除過終止密碼子的檔案
現在將這個處理過後的序列檔案*.phy與樹檔案、配置檔案codeml.ctl三個放在 \paml4.9j\bin 目錄下
配置檔案codeml.ctl內容如下可參考 一般修改前面三行即可 按順序為序列檔名 樹檔名 輸出檔名
seqfile = seven.nuc * sequence data filename treefile = Newick * tree structure file name outfile = test.txt * main result file name noisy = 0 * 0,1,2,3,9: how much rubbish on the screen verbose = 0 * 0: concise; 1: detailed, 2: too much runmode = -2 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table * ndata = 5504 clock = 0 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis aaDist = 0 * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a aaRatefile = dat/jones.dat * only used for aa seqs with model=empirical(_F) * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own model = 0 * models for codons: * 0:one, 1:b, 2:2 or more dN/dS ratios for branches * models for AAs or codon-translated AAs: * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189) NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs; * 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1; * 13:3normal>0 icode = 0 * 0:universal code; 1:mammalian mt; 2-10:see below Mgene = 0 * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff * AA: 0:rates, 1:separate fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = .4 * initial or fixed omega, for codons or codon-based AAs fix_alpha = 1 * 0: estimate gamma shape parameter; 1: fix it at alpha alpha = 0. * initial or fixed alpha, 0:infinity (constant rate) Malpha = 0 * different alphas for genes ncatG = 8 * # of categories in dG of NSsites models getSE = 0 * 0: don't want them, 1: want S.E.s of estimates RateAncestor = 1 * (0,1,2): rates (alpha>0) or ancestral states (1 or 2) Small_Diff = .5e-6 cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)? * fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional method = 0 * Optimization method 0: simultaneous; 1: one branch a time * Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt., * 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt., * 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt., * 10: blepharisma nu. * These codes correspond to transl_table 1 to 11 of GENEBANK.
在此目錄下開啟命令列
輸入一下命令即可
codeml
當前目錄下就會出現結果檔案 test.txt 以及其他檔案了
在過程中我遇到過許多報錯提供給大家參考一下
67 columns are converted into ??? because of stop codons
這個報錯是因為沒有去除檔案中的終止密碼子,可以參考上面的步驟去除
Error: Error in sequence data file: . in 1st seq.?.
Error: check #seqs and tree: perhaps too many '('?.
Make sure to separate the sequence from its name by 2 or more spaces.
以上報錯均為你的序列檔案內容/格式有問題,麻煩按照上面的步驟重新生成序列檔案或者參考其他人的檔案格式