YXF-本體論和生物學資料

weixin_34128411發表於2017-11-11

一直都感覺模模糊糊,先把弄明白的寫下來吧

  1. 本體論就相當於給一個事物或者現象一個確定的命名好讓所有人都用這一個詞來描述這一事物或現象以免使別人疑惑---也就是制定術語(term)。本體論分為SO 和GO, SO 是給sequence feature命名, GO是給基因功能命名
    基因本體論:
    連線基因與它的一個或多個功能
    分三部分:

  2. cellular component: where does the product exhibit its effect

  3. molecular function: how does it work

  4. biological process:ehat is the propose of the gene product
    基因本體論是個有向環,一個點可以和多個點有關聯。
    GO data:
    It contain gene ontology definition file and a gene association file
    GO assocaition file format: GAF format
    Functional analysis:
    ORA(Over-representation analysis0: To find representative functions of a list of genes
    FCS(Functional class scoring):
    Gene set enrichment:
    The process of discovering the common characteristics potentially, present in ln a list of genes.
    Tools: AgriGO, DAVID, Panther, goatools, ermineJ, GOrilla, ToppFun

  5. Data format
    目前生物學資料庫有GenBank和NCBI
    DNA sequence資料庫為INSDC(International nucleotide sequence database collaboration), 包括NCBI, EMBL, DDBJ.
    Protein sequence 資料庫為UniProt(Universal protein resource)
    另外,PDB(Protein data bank) 是生物大分子3D結構資訊庫
    Automate data access:
    Sequenceing data formate: GenBank, FASTA, FASTQ
    FASTA 資料格式

  6. 以">" 開頭

  7. ">"之後是一串字母

  8. 可能包括一些文字
    Some rules:

  9. Sequence lines should not be too long

  10. The sequence lines should wrap at the same width

  11. Use upper-case letters
    Some data of FASTA headers include structured information.
    Lower-case letters might be used to indicate repetitive regions for genome.
    FASTQ format
    分四部分:

  12. 以"@"開頭

  13. 已有的順序

  14. 符號“+”,也可能後面接與第一行一樣的ID

  15. 衡量第二部分質量的字元並且與第二行長度相同

  16. How to get data
    Where to get data: NCBI, ENSEMBL, BioMart, UCSC table browser
    FASTQ manipulation
    Overview data:
    seqkit stat *.gz
    There are too many manipulatios in FASTA/Q, I only report what you can do with FASTA/Q file and the answer is in Chapter 7 of Biostar handbook.
    How to get the GC content of every sequence in a FASTA/Q file?
    How to extract a subset of sequences from a FASTA/Q file with name/ID list file?
    How to find FASTA/Q sequences containing degenerate bases and locate them?
    How to remove FASTA/Q records with duplicated sequences?
    How to locate motif/subsequence/enzyme digest sites in FASTA/Q sequence?
    How to sort a huge number of FASTA sequences by length?
    How to split FASTA sequences according to information in the header?
    How to search and replace within a FASTA header using character strings from a text file?
    How to extract paired reads from two paired-end reads files?
    How to concatenate two FASTA sequences in to one?
    You can follow the answer in biostar handbook if you want to do some thing same as above

相關文章