ubuntu下使用sratoolkit將sra檔案轉換成fastq檔案

KeepLearningBigData發表於2016-01-13

ubuntu下使用sratoolkit將sra檔案轉換成fastq檔案:

環境:ubuntu14.04

sratoolkit.2.5.5-ubuntu64


1.下載

下載地址:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#


2.將sra轉換成fastq:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump SRR003161
<pre name="code" class="plain">hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq  SRR002664.sra  SRR003161.fastq  SRR003161.sra
 


資料檔案請見:http://blog.csdn.net/xubo245/article/details/50507222

3.檢視fastq:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fastq 

@SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGAATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGNAGNNNANNNNNNNNNNNNNNNNNN
+SRR003161.1 FEKQ5UX01AS5XC length=124
AAAAAAAAAAAAAAAA:::BAAFAABAAB?>>=44!39=<!:866699888220862!08:8002!0200000!022200800!20660000600!000!06!!!6!!!!!!!!!!!!!!!!!!
@SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGGCGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTCCTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCCCAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCAC
ATTGGTCAGGCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCAAAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAATCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTCCTCCCTGGTTATCAAGCTTTCTAATATATATG
GATGTATCTTCCAAGGTTTTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCAGTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGCNCAGG
+SRR003161.2 FEKQ5UX01AOE96 length=505
CCCCCCCCCCCCCCCC@@@CCCFEEEFEEG888EEEFFEEEEFGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCA<777@@CCCBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAACCCCCCCCCCCCCCCCCCCCCCC:93339@A>77//39AC666666C22CAAAA93333///7-0017
>9999>>A???ACCCCCCC2239322>9977<?????CCCCCCCCC877777777111111::::5555:555:::::::::;:555:;;::::0040-----***--467::::;;;;;;:::511155555:555:::;::::::7777744-------///245::;;;::::::;;;;;;;;:5555
4774----------44-----064---------6---522451115247644255-----,4---24464422---------!,,,4464224!11:::7:::111111--7777---!----
@SRR003161.3 FEKQ5UX01ARXN7 length=645
TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGTATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTTGAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCAGTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGA
NAGAATCATTCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTACCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTACTCGGTCTTTGGTTAATGTCTGCAAAGGTGGATTATTGGACCTTCTTGAGGTCCCTTTCGTTGCGTAAAACCGGGGTTTCTT
CCTTTCACTTAGTCGTACGTAACGTAAACGTAAAAGGTAAAGGTTACGTTACGTTAACGTTTAAACGTTTTTTTAACGTTTTGGTTTGGTTTGGTTGTTAGTTTACTTAACCTTAACCTAACCTAAACGTAAAGGTTTAACGGTTAAACCGTTAACGTTACGTTTAACGTTAAGGTAAGGAAGGACGAGTA
AGTTAAGTTAAACTAAACTACTAGTAGACGACGACAACGAAGGAGAGAGAGACGACACGAGGAGGAGNGNNN
+SRR003161.3 FEKQ5UX01ARXN7 length=645
AAAAAAAAAAAAAAAAAAAAAAIFAABA?7792222.,,:3<<<<:0222276:220::20020028662222022000002,220006666=9000669600000!0699788...4877873...!,.333.!......4447........!....!....4!...!..66.!..!....4+++*.!..
!.33333686--!---------!--3!332,!,,,,,,,,*,,,,2,,,,,,,,,2,,,,,,,,,,,,.,,((((,(,,,,,),,,,,,,,,,..000----,,(,,,,,,,,,,,,,,,,)),,,,,10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,),,,1..,,,,,,,,,,,,))
,,,,,,,,,,,,,,,,,,,03330,,,,,,,)))),,0(((,,,,,100,,,,,,,,0,,,,,,-03,----)))),,'''',,(((,,))),,)),,,,,,,,,,))00,,,,,,,,000,,,,,,,,,))),,,,,)),,,)),,,,,0,,)),,11133-,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,-,,))),,),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10000,,,,,,,,,,!,!!!
@SRR003161.4 FEKQ5UX01AMUAT length=587
TCAGGTTTGGAATGTGGGCTCTGAAGCCATACAACACAGTTTCTACTCTTTATCTTACACCTCCTGACTTTGTGACATTGGTTAAATATTTTATTTATTATNNCATAACTTACTACTTTGTTAAATTAGAAGTACGACTGTCTACACTCTTAGGTAGTTGGTCTGTTGAAATTAAATAATAGNACTTTAAC
TTACTTAAATAGANATACACACGACTTAGTTAGTTGTTGGCTGGAAATTAGGTATNTGTTTTAGTTCCTACACCTTACTTAACCCTAACCTACCATNTAATACTTTTACTTGTTCTCNGANANATNATAGTNTCTACGTTGAGTATATTACTTATATTACACGGTACGACGGACCGACGTCGTACACGTCT
CGTCTTCTNCNANNATGTAGTGAGTCTNTTTATTNTTTCTTAACTACTACTACTCGTTGTAGTAAGTAATAATAANTNNTCTACACCTACGACTGTATTGTAAGTACAAGAAGGACCGACGTTTCGTTACCTTTCTTCTTCGTCCTCTACTTAACCTGTTACTACGTACGCGAACACGGACGTAGGAGGAG
GAGGACACGAACGG
+SRR003161.4 FEKQ5UX01AMUAT length=587
AAAAAAAAAAAAAAAAAAAAAAIEEAIIIIIIAAIIIA:666AAE???<<<@AA===A=>>AAAAAAAAAAAAA?@???980000040....0/**04490!!00000600.........,,.....,.....74..............33.....7.....4..............++664!.000000.
135855----*--!3------------33,,,,,,,,,2222222,,,,*,,,,,!,,,,,,3,((,,,,00,,,,,,,,,,,,,1,,)),,,,01!333001,,,,03((,,,,,,!,,!,!,,!,,3,,!,1,,,,,,,,,,,,,,,,,,,,,,,,3,,,,,433,,,,,,,,13,,,,,,,,,04,,,
,,,,,,,,!,!,!!,,,,10,,,311,!,,,1))!,,,,)),,30,,,0330,,,,,,003333,,,,,0003,,!,!!,,01,,,033,,,,,1,,,,,,,,00,,,,,,,,,1331313/.,,,)),,,,,,,)),,,,,,,,,,010,,,,,,,,,,3303,,,,0000000,,,,03,,,,,0,,,,
,,,,34333,,,,,

4.sra轉換成fasta:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 20 SRR003161
2016-01-13T05:33:42 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161

=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================


hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATG
GTCTCGAATGGAATCNTCTA
NAGAGATGGAATGTATCNCT
CGCCANACGACACNCGAACA
GGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNN
NNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTC
ATTCTTGTTGCCCAGGCTGG
AGTGCAATGGCGCAATCTCA
GCTCACAGCAACCTCCGCCT
CCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTA

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR003161
2016-01-13T05:36:52 fastq-dump.2.5.5 err: timeout exhausted while reading file within network system module - failed SRR003161

=============================================================
An error occurred during processing.
A report was generated into the file '/home/hadoop/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
SRR002664.fastq  SRR002664.sra  SRR003161.fasta  SRR003161.fastq  SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR003161.fasta
>SRR003161.1 FEKQ5UX01AS5XC length=124
TCAGATGCAATCATCGAATGGTCTCGAATGGAATCNTCTANAGAGATGGA
ATGTATCNCTCGCCANACGACACNCGAACAGGGNAAGGCAAGCAGNAGGN
AGNNNANNNNNNNNNNNNNNNNNN
>SRR003161.2 FEKQ5UX01AOE96 length=505
TCAGTTTGAGATGGAGTTTCATTCTTGTTGCCCAGGCTGGAGTGCAATGG
CGCAATCTCAGCTCACAGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTC
CTGCCTCAGCCTCTCGAGTAGCTGGGATTACAGGCATGCACCATCACGCC
CAGCTAATTTGCATTTTTTATTAGAGATGGGGTTTCTCCACATTGGTCAG
GCTGATCTCGAACTCCTGACCTCAGGTGATCTGCCTGCCTTGGCCTCCCA
AAGTGCTGGGATTACAGGCATGAGCCTGAGCCCAACCTATTTACTTTCAA
TCCATCTTTTCAATAACTTAAATACAAGTGTCAATATATACAATCTTTTC
CTCCCTGGTTATCAAGCTTTCTAATATATATGGATGTATCTTCCAAGGTT
TTTGATCCCATTTTACTTTACAGGCTCACTGCTGTGGAACCCAGAGAGCA
GTCTCTTTTCAAGGNGGGCTGAGACNCGCAACAGGGGATTAGGCCAAGGC
NCAGG
>SRR003161.3 FEKQ5UX01ARXN7 length=645
TCAGCATGCTAGACAGAAGAATTCTCAGTAACTTTCTTTGTGCTGTGTGT
ATTCAACTCACAGAGTTGGAACCGTTCCTTTGTCAACAGAGCTAGAATTT
GAAACCNCTCTTGAGGACTACGCGAAANAGGGGANAAGGTCCAAAGGCCA
GTANAGGGNTCGGANGTANAAGATNCTNAAAATAAAACNGANAGAATCAT
TCTNAAGAAACTTNTTGNATGTNTGCCCTTTCAAACTCAACAGGAGTTTA
CCAAACCTTTTCTTTTCTAAAGGAGACTAAGGTTTTAAGAAAACCACTTA

暫時沒解決err、、、



換個資料集就可以了,

成功的:faste 50 為每行50個鹼基

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --fasta 50 SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ls
back  SRR002664.fasta  SRR002664.sra  SRR003161.fasta  SRR003161.fastq  SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 986M
drwxrwxr-x 3 hadoop hadoop 4.0K  1月 13 13:40 ./
drwxrwxr-x 5 hadoop hadoop 4.0K  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K  1月 13 13:39 back/
-rw-rw-r-- 1 hadoop hadoop 150M  1月 13 13:40 SRR002664.fasta
-rw-r--r-- 1 hadoop hadoop  17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M  1月 13 13:36 SRR003161.fasta
-rw-rw-r-- 1 hadoop hadoop 538M  1月 13 13:00 SRR003161.fastq
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ more SRR002664.fasta 
>SRR002664.1 FC20KVN01EFCX9 length=192
TCAGCTCACGTCTGTAATCCTAGCATTTTGGGAGGCTGAGACGGGCAGAT
CACTTGAGGTCATGAGTTCGAGACCAGCCTGGCAACCATGGCGAAACCCT
GTCTCTACTAAAATACAAAATTAGCCAGGCATGGTGGCGCATGCCTGTCT
GAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
>SRR002664.2 FC20KVN01ELL46 length=127
TCAGCAAAGAAAACAAATTCCTTTCTGGCACCACCTCAAAGAAGAATTTC
在用fastq驗證:

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump  SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664

5.split

將雙端測序檔案分開

(1)split-files生成兩個fastq檔案

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-files SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll -h
total 924M
drwxrwxr-x 3 hadoop hadoop 4.0K  1月 13 14:05 ./
drwxrwxr-x 5 hadoop hadoop 4.0K  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop 4.0K  1月 13 13:52 back/
-rw-rw-r-- 1 hadoop hadoop  44M  1月 13 14:05 SRR002664_1.fastq
-rw-rw-r-- 1 hadoop hadoop 291M  1月 13 14:05 SRR002664_2.fastq
-rw-rw-r-- 1 hadoop hadoop 291M  1月 13 14:02 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  17M 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop 274M  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop 9.0M 12月 15 23:12 SRR003161.sra

(2)--split-3

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-3 SRR002664
Rejected 487522 READS because of filtering out non-biological READS
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1192100
drwxrwxr-x 3 hadoop hadoop      4096  1月 13 14:21 ./
drwxrwxr-x 5 hadoop hadoop      4096  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop      4096  1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 304893796  1月 13 14:21 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop  42893052  1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770  1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop   9353980 12月 15 23:12 SRR003161.sra


對於–split-3引數,是這樣介紹的:
Legacy 3-file splitting for mate-pairs: first biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq. Biological reads and above are ignored

也就是說如果SRA檔案中只有一個檔案,那麼這個引數就會被忽略。如果原檔案中有兩個檔案,那麼它就會把成對的檔案按*_1.fastq, *_2.fastq這樣分開。如果還有出現了第三個檔案,就意味著這個檔案本身是未成配對的部分。可能是當初提交的時候因為事先過濾過了一下,所以有一部分資料被刪除了

借鑑參考【4】



(3)--split-spot

hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ../../code/sratoolkit.2.5.5-ubuntu64/bin/fastq-dump --split-spot SRR002664
Read 487522 spots for SRR002664
Written 487522 spots for SRR002664
hadoop@Mcnode1:~/cloud/adam/down/data/SRA$ ll
total 1236636
drwxrwxr-x 3 hadoop hadoop      4096  1月 13 14:53 ./
drwxrwxr-x 5 hadoop hadoop      4096  1月 12 21:31 ../
drwxrwxr-x 2 hadoop hadoop      4096  1月 13 14:21 back/
-rw-rw-r-- 1 hadoop hadoop 350498654  1月 13 14:54 SRR002664.fastq
-rw-r--r-- 1 hadoop hadoop  16874064 12月 15 22:13 SRR002664.sra
-rw-rw-r-- 1 hadoop hadoop  42893052  1月 13 14:16 SRR003161_1.fastq
-rw-rw-r-- 1 hadoop hadoop 559892770  1月 13 14:16 SRR003161_2.fastq
-rw-rw-r-- 1 hadoop hadoop 286773153  1月 13 13:56 SRR003161.fasta
-rw-r--r-- 1 hadoop hadoop   9353980 12月 15 23:12 SRR003161.sra

    --split-spot Split spots into individual reads.





參考:

【1】 http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump

【2】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

【3】 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software#

【4】 http://www.bbioo.com/lifesciences/40-112832-1.html

相關文章