定義基因區域

weixin_34402408發表於2018-10-23

defining_genomic_regions

learing

Defining genomic regions


First let’s install the latest version of BEDTools:

git clone https://github.com/arq5x/bedtools2.git
cd bedtools2
make clean; make all
bin/bedtools --version
#bedtools v2.20.1-4-gb877b35
cd ..

Now download the GENCODE version 19 (which is currently the latest update):

v=19
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$v/gencode.v$v.annotation.gtf.gz
 
#what does this file look like?
zcat gencode.v19.annotation.gtf.gz | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  exon_id "ENSE00002234944.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  exon_id "ENSE00003582793.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  exon_id "ENSE00002312635.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

In the third column of the GTF file corresponds to the feature type. How many different types of features are there?

zcat gencode.v19.annotation.gtf.gz | grep -v "^#" | cut -f3 | sort | uniq -c | sort -k1rn
1196293 exon
 723784 CDS
 284573 UTR
 196520 transcript
  84144 start_codon
  76196 stop_codon
  57820 gene
    114 Selenocysteine

So the exonic regions are already defined; however, I would like to remove overlapping exons. The mergeBed utility in BEDTools can accomplish this; however before merging make sure the coordinates are sorted. We can use sortBed to sort the BED file:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/mergeBed -i - | gzip > gencode_v${v}_exon_merged.bed.gz

To define intronic regions, we just need to subtract the exonic region from the genic region. The utility subtractBed can do this:

v=19
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/subtractBed -a stdin -b gencode_v${v}_exon_merged.bed.gz |
gzip > gencode_v${v}_intron.bed.gz
 
#let's intersect the two files
#this shouldn't produce any output
intersectBed -a gencode_v${v}_exon_merged.bed.gz -b gencode_v${v}_intron.bed.gz

And finally to define intergenic regions, we use complementBed to find regions not covered by genes.

v=19
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.genome
 
zcat gencode.v$v.annotation.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools2/bin/sortBed |
bedtools2/bin/complementBed -i stdin -g hg19.genome |
gzip > gencode_v${v}_intergenic.bed.gz

And that’s it.
[圖片上傳失敗...(image-8bf86a-1540296254859)]

  • Here’s a schematic showing the steps we performed above.

Genome coverage

How much of the genome does exons, introns, and intergenic regions cover?

#!/bin/env perl
 
use strict;
use warnings;
 
my $v = 19;
 
my $exon_file       = "gencode_v${v}_exon_merged.bed.gz";
my $intergenic_file = "gencode_v${v}_intergenic.bed.gz";
my $intron_file     = "gencode_v${v}_intron.bed.gz";
 
my $exon_coverage       = coverage($exon_file);
my $intergenic_coverage = coverage($intergenic_file);
my $intron_coverage     = coverage($intron_file);
 
my $total = $exon_coverage + $intergenic_coverage + $intron_coverage;
 
printf "Exon: %.2f\n", $exon_coverage*100/$total;
printf "Intron: %.2f\n", $intron_coverage*100/$total;
printf "Intergenic: %.2f\n", $intergenic_coverage*100/$total;
 
sub coverage {
   my ($infile) = @_;
   my $coverage = 0;
   open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n";
   while(<IN>){
      chomp;
      my ($chr, $start, $end) = split(/\t/);
      my $c = $end - $start;
      $coverage += $c;
   }
   close(IN);
   return($coverage);
}
 
exit(0);

Running this script:

coverage.pl
Exon: 3.72
Intron: 48.25
Intergenic: 48.03

I had initially thought that the intergenic region would have the highest percent coverage in the genome compared to introns and exons.

The untranslated region (UTR)

I got a question (see comments) regarding the untranslated region (UTR). To illustrate my answer, I’ll use the transcript ENST00000335295 as an example:

v=19
zcat gencode.v$v.annotation.gtf.gz | grep ENST00000335295
chr11   HAVANA  transcript      5246694 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5248160 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5248160 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  start_codon     5248249 5248251 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 1;  exon_id "ENSE00001829867.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5247807 5248029 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5247807 5248029 .       -       1       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 2;  exon_id "ENSE00001057381.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  exon    5246694 5246956 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  CDS     5246831 5246956 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  stop_codon      5246828 5246830 .       -       0       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; exon_number 3;  exon_id "ENSE00001600613.2";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5248252 5248301 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";
chr11   HAVANA  UTR     5246694 5246830 .       -       .       gene_id "ENSG00000244734.2"; transcript_id "ENST00000335295.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "HBB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "HBB-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS7753.1"; havana_gene "OTTHUMG00000066678.3"; havana_transcript "OTTHUMT00000142977.2";

This transcript (HBB) starts at 5248301 and ends at 5246694 on chromosome 11 (the coordinates are reversed because it’s on the negative strand). We see two lines with the UTR feature for this transcript, one starting at 5248301 and the other at 5246830.

So the coordinates for the 5′ UTR are chr11:5248252-5248301 and the 3′ UTR are chr11:5246694-5246830. Perhaps I should have picked a transcript that’s on the positive strand, however this was one example I knew where the transcript is short and well defined (contains the start and end codons, and the 5′ and 3′ UTRs). Which now leads me to wonder, how many transcripts have defined UTRs?

#!/bin/env perl
 
use strict;
use warnings;
 
my $usage = "Usage: $0 <infile.annotation.gtf.gz>\n";
my $infile = shift or die $usage;
 
if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
 
my %transcript = ();
my $current_transcript = '';
 
while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
 
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
 
   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript{$current_transcript} = 0;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }
 
   if ($type eq 'UTR'){
      $transcript{$current_transcript}++;
   }
}
close(IN);
 
foreach my $transcript (keys %transcript){
   print "$transcript\t$transcript{$transcript}\n";
}
 
exit(0);

Now to run the script:

check_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr_number.out.gz
 
zcat transcript_utr_number.out | wc -l
196520
 
zcat transcript_utr_number.out | cut -f2 | sort | uniq -c | sort -k1rn | head
 103800 0
  34999 2
  24234 3
  11543 1
  10352 4
   4493 5
   2264 6
   1297 7
    796 8
    583 9
 
zcat transcript_utr_number.out | awk '$2==57 {print}'
ENST00000264319.7       57

Running the script above, I find that over half of the GENCODE transcripts (103,800/196,520) don’t have a defined UTRs. 34,999 transcripts have 2 defined UTRs, 24,234 have 3 because one of the UTR is spliced. The transcript ENST00000264319.7 had 57 UTR lines, which was the most UTR annotations for a transcript.

Defining the 5′ and 3′ UTRs

I’m not entirely sure why the UTRs weren’t separated into 5′ and 3′, especially when GENCODE went through the trouble of labelling the start_codon and the stop_codon. I’m going to assume that if a UTR is closer to the starting position of the transcript it’s the 5′ UTR and vice versa.

#!/bin/env perl
 
use strict;
use warnings;
 
my $usage = "Usage: $0 <infile.annotation.gtf>\n";
my $infile = shift or die $usage;
 
if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
 
my %transcript = ();
my $current_transcript = '';
my $transcript_start = 0;
my $transcript_end = 0;
 
while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
 
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
 
   if ($type eq 'transcript'){
      foreach my $blah (@annotation){
         my ($type,$name) = split(/\s+/,$blah);
         if ($type eq 'transcript_id'){
            $current_transcript = $name;
            $current_transcript =~ s/"//g;
            $transcript_start = $start;
            $transcript_end = $end;
         }
      }
      if ($current_transcript eq 'none'){
         die "No name for entry $.\n";
      }
   }
 
   if ($type eq 'UTR'){
      my $region = '';
      if ($strand eq '+'){
         my $dis_to_start = abs($start - $transcript_start);
         my $dis_to_end = abs($start - $transcript_end);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      } else {
         my $dis_to_start = abs($end - $transcript_end);
         my $dis_to_end = abs($end - $transcript_start);
         $region = $dis_to_start < $dis_to_end ? '5_UTR' : '3_UTR';
      }
      print join ("\t", $chr, $start, $end, $region, $current_transcript, $strand),"\n";
   }
}
close(IN);
 
exit(0);

To run the script:

print_utr.pl gencode.v19.annotation.gtf.gz | gzip > transcript_utr.bed.gz
zcat transcript_utr.bed.gz | head
chr1    70006   70008   3_UTR   ENST00000335137.3       +
chr1    137621  138532  5_UTR   ENST00000423372.3       -
chr1    139310  139379  5_UTR   ENST00000423372.3       -
chr1    134901  135802  3_UTR   ENST00000423372.3       -
chr1    367640  367658  5_UTR   ENST00000426406.1       +
chr1    368595  368634  3_UTR   ENST00000426406.1       +
chr1    621059  621098  3_UTR   ENST00000332831.2       -
chr1    622035  622053  5_UTR   ENST00000332831.2       -
chr1    819981  819983  3_UTR   ENST00000594233.1       +
chr1    860260  860328  5_UTR   ENST00000420190.1       +

Basically the script gets the coordinates of the transcript, and whenever there is an UTR annotation, it checks to see if the UTR coordinates are closer to the start or the end of the transcript.

Promoter region

Here’s a script that reads the GENCODE annotation file, obtains the starting positions of each transcript, adds some user defined padding around this starting position and outputs it as a bed file. The starting position plus some padding is defined as the promoter region. Note that this script assumes that the hg19.genome file is available in the same directory as the script.

#!/bin/env perl
 
use strict;
use warnings;
 
my $usage = "Usage: $0 <infile.annotation.gtf> <padding>\n";
my $infile = shift or die $usage;
my $span = shift or die $usage;
 
if ($span !~ /^\d+$/){
   die "Please enter a numeric value for the padding\n";
}
 
my $hg19 = 'hg19.genome';
my %hg19 = ();
 
open(IN,'<',$hg19) || die "Could not open $hg19: $!\n";
while(<IN>){
   chomp;
   #chr9_gl000201_random    36148
   my ($chr, $end) = split(/\t/);
   $hg19{$chr} = $end;
}
close(IN);
 
if ($infile =~ /\.gz/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}
 
while(<IN>){
   chomp;
   next if (/^#/);
   #chr11   HAVANA  transcript      65265233        65273940        .       +       .       gene_id "ENSG00000251562.3"; transcript_id "ENST00000534336.1"; gene_type "processed_transcript"; gene_status "KNOWN"; gene_name "MALAT1"; transcript_type "non_coding"; transcript_status "KNOWN"; transcript_name "MALAT1-001"; level 2; havana_gene "OTTHUMG00000166322.1"; havana_transcript "OTTHUMT00000389143.1";
   my ($chr,$source,$type,$start,$end,$score,$strand,$phase,$annotation) = split(/\t/);
   next unless $type eq 'transcript';
   my @annotation = split(/;\s/,$annotation);
   my $transcript_id = 'none';
   foreach my $blah (@annotation){
      my ($type,$name) = split(/\s+/,$blah);
      if ($type eq 'transcript_id'){
         $transcript_id = $name;
         $transcript_id =~ s/"//g;
      }
   }
   if ($transcript_id eq 'none'){
      die "No name for entry $.\n";
   }
   my $promoter_start = '';
   my $promoter_end = '';
   if ($strand eq '+'){
      $promoter_start = $start - $span;
      $promoter_end = $start + $span;
   } else {
      $promoter_start = $end - $span;
      $promoter_end = $end + $span;
   }
   if ($promoter_start < 0){
      warn "Adjusted promoter start to 0\n";
      $promoter_start = 0;
   } elsif ($promoter_end > $hg19{$chr}){
      warn "Adjusted promoter end to $hg19{$chr}\n";
      $promoter_end = $hg19{$chr};
   }
   print join("\t",$chr,$promoter_start,$promoter_end,$transcript_id,0,$strand),"\n";
}
close(IN);
 
exit(0);

Running the script:

promoter.pl gencode.v19.annotation.gtf.gz 200  | head
chr1    11669   12069   ENST00000456328.2       0       +
chr1    11672   12072   ENST00000515242.2       0       +
chr1    11674   12074   ENST00000518655.2       0       +
chr1    11810   12210   ENST00000450305.2       0       +
chr1    29170   29570   ENST00000438504.2       0       -
chr1    24686   25086   ENST00000541675.1       0       -
chr1    29170   29570   ENST00000423562.1       0       -
chr1    29370   29770   ENST00000488147.1       0       -
chr1    29606   30006   ENST00000538476.1       0       -
chr1    29354   29754   ENST00000473358.1       0       +

See also
See the BEDTools documentation for cool usage examples.

All code used in this post is available at https://github.com/davetang/defining_genomic_regions.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License.
SHARE THIS:
TwitterGoogleFacebookLinkedInEmail
LIKE THIS:
RELATED
GENCODE
September 12, 2012
In "bioinformatics"
Getting acquainted with analysing DNA sequencing data
December 16, 2015
In "genomics"
ENCODE RNA polymerase II ChIP-seq
March 23, 2014
In "bioinformatics"
Posted in genomics
Tagged bedtools, genome
22 COMMENTS ADD YOURS
ASAKI says:
July 22, 2013 at 10:33 pm
Hi Dave,

Great resources and tutorials to extract the genomic regions from the GENCODE file.

Here, I want to look into the 3’UTR and 5’UTR regions in the gencode gtf file
altough the gtf file has a thrid column containg UTR info, it does not seperate the 3’UTR and 5’UTR.

Moreover, in UCSC browser, it offers us the 3’UTR&5’UTR bed file for V14 gencode.
So my question is how can we find this info from the gencode file from http://www.gencodegenes.org/releases/17.html, i.e., V17, the latest version gtf file?

Thanks

REPLY
DAVO says:
July 22, 2013 at 11:37 pm
Hi Asaki,

Thank you for the compliment. As for the UTRs, since we know the orientation of the gene/transcript, from the gtf file, we know whether the UTR is at the 5′ or 3′ end.

Bioconductor contains GENCODE genes, but it is not the latest version (since it’s derived from the UCSC Genome Browser).

It wouldn’t be too difficult to write a Perl script that extracts UTRs based on the gene orientation but some people are against writing custom scripts. I’m fine with custom scripts as long as they work as expected and are shared.

Cheers,

Dave

REPLY
Pingback: Sequence conservation in vertebrates - Musings from a PhD candidate
SAAD says:
February 2, 2014 at 10:26 pm
How about doing the same thing on a plant genome or for other non model organism for which something like gencode is not available.

REPLY
DAVO says:
February 2, 2014 at 11:12 pm
Yeah just use another set of transcript annotations and adapt the same concept.

REPLY
SAAD KHAN says:
February 14, 2014 at 12:04 am
What if the organism does not have annotations of UTR and also how to get it in a format like gencode has. For example I have this file :-

ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Gmax/annotation/Gmax_189_gene.gff3.gz

what additional steps do I need to follow to get it into a format similar to gencode.

Also gencode has lot of annotations that my file does not so your script does not seems to work with it.

REPLY
DAVO says:
February 17, 2014 at 1:49 am
Hi Saad,

I downloaded the file and had a look; it has annotations for:

CDS
five_prime_UTR
gene
mRNA
three_prime_UTR

It seems the UTRs are defined already?

Cheers,

Dave

REPLY
SAAD KHAN says:
February 17, 2014 at 2:30 am
Hi Dave,

I was more interested to generate a file similar to what UCSC generates. Unfortunately my organism is not there in UCSC browser.

This is the kind of output I am looking for. Do you think if there are already tools to do that or do I have to write a script for it myself.

output by ucsc : –

chr1 14969 15038 NR_024540_utr3_1_0_chr1_14970_r 0 – NR_024540 WASH7P 653635 pseudogene

REPLY
ALOK says:
July 22, 2014 at 7:23 am
Thank you very much for the above executed scripts, i was able to cross check my ouput with that of bedtools. I have no transcript info. All i have is exon and cds.The difference lowest start point of cds and lowest start point of exon will be my 5’UTR and difference in highest will be my 3’UTR.
The following is my GFF file can you please help me extract the UTR regions?
Ca1 Gnomon gene 254 741 . – . ID=gene0;Name=LOC101497325;Dbxref=GeneID:101497325;gbkey=Gene;gene=LOC101497325
Ca1 Gnomon mRNA 254 741 . – . ID=rna0;Name=XM_004486173.1;Parent=gene0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 602 741 . – . ID=id1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon exon 254 506 . – . ID=id2;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XM_004486173.1;gbkey=mRNA;gene=LOC101497325;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;transcript_id=XM_004486173.1
Ca1 Gnomon CDS 602 663 . – 0 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon CDS 254 506 . – 1 ID=cds0;Name=XP_004486230.1;Parent=rna0;Dbxref=GeneID:101497325,Genbank:XP_004486230.1;gbkey=CDS;product=type II inositol 1%2C4%2C5-trisphosphate 5-phosphatase FRA3-like;protein_id=XP_004486230.1
Ca1 Gnomon gene 18089 20552 . – . ID=gene1;Name=LOC101488339;Dbxref=GeneID:101488339;gbkey=Gene;gene=LOC101488339

Ca1 Gnomon gene 165984 168949 . + . ID=gene10;Name=LOC101493009;Dbxref=GeneID:101493009;gbkey=Gene;gene=LOC101493009
Ca1 Gnomon mRNA 165984 168949 . + . ID=rna18;Name=XM_004485362.1;Parent=gene10;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 165984 166302 . + . ID=id169;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon exon 167289 168949 . + . ID=id170;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XM_004485362.1;gbkey=mRNA;gene=LOC101493009;product=uncharacterized LOC101493009;transcript_id=XM_004485362.1
Ca1 Gnomon CDS 167476 168681 . + 0 ID=cds17;Name=XP_004485419.1;Parent=rna18;Dbxref=GeneID:101493009,Genbank:XP_004485419.1;gbkey=CDS;product=uncharacterized protein LOC101493009;protein_id=XP_004485419.1

REPLY
DAVO says:
July 22, 2014 at 8:01 am
Hi Alok,

if I understood you correctly, then the 5′ UTR for LOC101497325 is Ca1:663-741 and the 3′ UTR is Ca1:254-506 and for LOC101493009 the 5′ UTR is Ca1:165984-166302 and the 3′ UTR is Ca1:168681-168949.

Cheers,

Dave

REPLY
JULIEN says:
September 16, 2014 at 8:37 am
Thanks Dave for this very nice post.
I found a small in the conversion of GTF format to BED format. GTF has 1-based start coordinates, and BED has 0-based start coordinates. Thus the awk command to convert should for example be:

zcat gencode.v$v.annotation.gtf.gz |
awk ‘BEGIN{OFS=”\t”;} $3==”exon” {print $1,$4-1,$5}’
etc.

Thanks
Julien

REPLY
DAVO says:
September 16, 2014 at 8:40 am
Thanks Julien! This off-by-one error caused by 0- and 1-based coordinates is probably the most common bioinformatic error ?

REPLY
BRUCE says:
November 26, 2014 at 4:48 pm
Hi Dave,

nice piece of work, thanks. Wondering about your thoughts on how to avoid issues when merging the exons due to overlapping genes on each strand. How do you then know which gene/transcript is aligned to?

Thanks,

Bruce.

REPLY
DAVO says:
November 27, 2014 at 5:33 am
Hi Bruce,

thanks for the comment. This post was definitely simplifying the task of annotating mapped reads, which as you point out can get quite complicated (and messy).

One approach a colleague took, was to create a table for mapped reads (or clusters of reads), where each column was a genomic property, e.g. antisense to promoter, sense to exon, etc., and each column would be filled with the transcript ID that fulfilled that property. At the end, he could be able to tell the properties of a read. Needless to say he didn’t merge exons, like I did, or else he would lose information.

Cheers,

Dave

REPLY
JAY says:
March 27, 2015 at 11:52 am
Can you share your intergenicRegion.bed file?

REPLY
BENT says:
October 26, 2015 at 8:12 am
Thanks a lot, Dave.
It’s a wonderful post. While when I download gencode.v20.annotation.gtf.gz and so forth, it gives the error messages as below,
====
Checking and Creating Intergenic Regions
Warning: end of BED entry exceeds chromosome length. Please correct.
====
I also check the entry count for each output.
====
152320 gencode.v19.annotation.gtf.gz
8543 gencode_v19_exon_merged.bed.gz
575 gencode_v19_intergenic.bed.gz
6805 gencode_v19_intron.bed.gz
8060 promoter.bed.gz
8243 transcript_utr.bed.gz
3292 transcript_utr_number.out.gz
====
158000 gencode.v23.annotation.gtf.gz
8116 gencode_v23_exon_merged.bed.gz
187 gencode_v23_intergenic.bed.gz
6998 gencode_v23_intron.bed.gz
8731 promoter.bed.gz
8234 transcript_utr.bed.gz
3068 transcript_utr_number.out.gz
====

The gencode_v23_intergenic.bed.gz is much smaller than gencode_v19_intergenic.bed.gz, is it correct output?

Thanks a lot for your comment.

REPLY
DAVO says:
October 26, 2015 at 10:15 am
Hi Ben,

GENCODE version 19 is the last version of GENCODE that uses the hg19 reference genome; that’s why you’re getting the warnings (a good thing!).

In the MySQL command, change it to: “select chrom, size from hg38.chromInfo” and save the output as hg38.genome and change the workflow accordingly.

Cheers,

Dave

REPLY
MAYANK KAASHYAP says:
July 27, 2016 at 6:04 am
Hi Dave,
Very nice post. All goes well except when I try to isolate intergenic regions. I get an error: Less than the req’d two fields were encountered in the genome file. May I please request you to help me!

REPLY
DAVO says:
July 28, 2016 at 6:38 am
Can you confirm that your genome file looks something like this?


cat hg19.genome | head
chrom size
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chrX 155270560
chr8 146364022

REPLY
MAYANK KAASHYAP says:
July 28, 2016 at 10:59 am
Hi Dave,
Thanks so much. It works very well now!
I was put Ca1 instead refseq Sequence NC_021160.1.
Chromosome Size
NC_021160.1 48359943
NC_021161.1 36634854
NC_021162.1 39989001
NC_021163.1 49191682
NC_021164.1 48169137
NC_021165.1 59463898
NC_021166.1 48961560
NC_021167.1 16477302

REPLY
MAYANK KAASHYAP says:
May 16, 2017 at 10:34 am
Hi Dave,
Thansk for this. How can I cite you.

REPLY
DAVO says:
May 17, 2017 at 1:44 am
Hi Mayank,

you can use the URL of the post if you really want to cite me. Otherwise, just cite:

https://www.ncbi.nlm.nih.gov/pubmed/20110278

Cheers,

Dave

REPLY
Leave a Reply
Your email address will not be published. Required fields are marked *

Comment 

Name * 

Email * 

Website 

Save my name, email, and website in this browser for the next time I comment.

  Notify me of follow-up comments by email.

 Notify me of new posts by email.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Search for:
Search …
WHO'S ONLINE
33 visitors online now
23 guests, 10 bots, 0 members
Map of Visitors
Creative Commons License

相關文章