Friday, 06 April 2012 at 10:11 am

GFF and GTF are data formats heavily used for storing annotation information. It's common to see these two formats used interchangeably. However, GFFs (general feature format) are actually meant to be used for any genomic feature, while GTF (gene transfer format) is strictly used for genes. 

Both of these formats are very similar, making conversion pretty simple. The only problem in conversion is when the lines of the GFF file is not arranged in feature blocks. This entry will show you the differences between these two files and how to interconvert between the two formats.


Data format

Both GFF and GTF formats are tab delimited files consisting of 9 columns. The first 8 columns of both files are the same except for a couple of version specific requirements which I will get to. Briefly, here are the desciptions and examples of the first 8 columns:

  1. reference sequence name (chromosome1,refContig1,sequence1)
  2. source of annotation (pfam,blast2go,interpro,est)
  3. type of feature (gene,exon,start_codon,cds,mRNA,zinc_finger,conserved_region)
  4. 1-based, inclusive start coordinate (integer > 0)
  5. 1-based, inclusive end coordinate (integer > 0)
  6. score
  7. strand (+,-,.)
  8. frame (0,1,2)

The start/end coordinates are 1-based, meaning the first base is 1. In contrast, a 0-based coordinate would start at 0. Inclusive coordinates means that the feature includes the start and end coordinates. For example a feature that has coordinates of 10-15 would have 10, 11, 12, 13, 14, 15 positions defined as the feature. In contrast, an exclusive coordinate of 10-15 would have 11, 12, 13, 14 positions defined.

The score column in a GFF file is meant to be used as a probability value for the presence of this feature. It is not used in the GTF format. The strand is the directionality of the feature relative to the reference sequence. It can be a '+' for forward, '-' for reverse, or '.' for unknown. 

The frame column is for coding region features represented by 0, 1, 2 for open reading frame 1, 2, and 3.


GFF

GFF is a generic feature format that can be used to describe any genomic feature. However, I will only talk about GFF in terms of gene features since we want to convert to GTF, which is a gene annotation format. The GFF format is currently on version 3. GFF allows commenting using the "#" symbol. For example some GFF files may have a line indicating what version it is and date it was created:

##gff-version 2
##created 11/11/11 

Columns 1 - 8 of GFFs are previously described. Note that the feature type (column 3) is not controlled. You can use any feature name; although it is probably better for convention-sake to use sensible names like gene, exon, transcript, mRNA.

The last 9th column is the attributes column. This attributes column contains information on the feature itself. The format of this column from version 2 and on is in a tag/value pair delimited by a '=' sign. Multiple attributes are delimited by a semi-colon. For example:

ID=geneAExon1;Name=geneA;Parent=geneA;Organism=human

Note that there is no semi-colon at the end of this column. 

In the latest version (v3) of the GFF specifications, there are a set of pre-defined attributes which you can find here. A general rule of thumb is to always include an ID attribute and make sure IDs are unique in the GFF file. Another important attribute to note is the 'Parent' attribute. The value of the 'Parent' attribute indicates that the current feature is a sub-feature of the parent. For example mutiple exons with unique IDs can have a gene feature as parents:

Contig01  PFAM  gene  501  750  .  +  0  ID=geneA;Name=geneA
Contig01  PFAM  exon  501  650  .  +  2  ID=exonA1;Parent=geneA
Contig01  PFAM  exon  700  750  .  +  2  ID=exonA2;Parent=geneA


GTF

The GTF format is currently on version 2. As mentioned earlier, it is used to describe gene annotations. There are really only two hard requirements for GTF files:

  • CDS, start_codon, and end_codon feature types are required. Depending on the software, certain feature types must be present. Common featutre types are exon, transcript, cds, start_codon, end_codon.
  • The 9th column attribute fields must begin with 'gene_id' and 'transcript_id' attributes

The 9th attribute column is also structured differently from GFF files. The tag/value pair is delimited by a space. Each attribute must end with a semi-colon. Values that are text must be encased in quotes. Although not specified, I would not include spaces in tag names; instead, use an underscore in place of a space. For example:

gene_id "geneA";transcript_id "geneA.1";database_id "0012";modified_by "Damian";duplicates 0;

Note that there is a semi-colon at the end of the line because each attribute needs to end with a semi-colon. This is different from GFF files.

There is also no feature/sub-feature relationship in GTF files. If there are multiple exons, they are grouped together by having the same transcript_id. Multiple transcripts can have the same gene_id.


Breakdown

Here is a table breaking down the differences between the two formats and versions:

Columns  GTF2  GFF3
1. reference sequence name same same
2. annotation source same same
3. feature type

CDS, start_codon, end_codon are required. feature requirements depend on software.

can be anything
4. start coordinate same same
5. end coordinate same same
6. score not used optional
7. strand same same
8. frame same same
9. attributes
  • tag/value delimited by a space
  • each attribute must end with a semi-colon
  • must begin with gene_id and transcript_id attributes
  • Text values must be in quotes
  • ex. gene_id "gene01"; transcript_id "transcript01"; created_by "Damian";
  • tag/value delimited by '='
  • each attribute delimited by semi-colon
  • there are a list of pre-defined attributes here
  • must have a unique ID attribute
  • ex. ID=geneA;Parent=geneAP;Name=geneA


Conversion

As mentioned previously, if the lines of a GFF file are not arranged in feature blocks where all lines pertaining to one feature are consecutive, parsing the file will require writing the data to memory:

#geneA and geneB are both arranged in blocks of consecutive lines
Contig01 Twinscan gene 501 750 . + 0 ID=geneA;Name=geneA Contig01 Twinscan  exon  501  650  .  +  2  ID=exonA1;Parent=geneA Contig01 Twinscan  exon  700  750  .  +  2  ID=exonA2;Parent=geneA Contig01 Twinscan  gene  700  900  .  +  2  ID=geneB;Name=geneB Contig01 Twinscan  exon  700  750  .  +  0  ID=geneB1;Parent=geneB Contig01 Twinscan  exon  800  900  .  +  0  ID=geneB2;Parent=geneB

#lines defining geneA and geneB are arranged randomly 
Contig01 Twinscan gene 501 750 . + 0 ID=geneA;Name=geneA Contig01 Twinscan  gene  700  900  .  +  2  ID=geneB;Name=geneB Contig01 Twinscan  exon  700  750  .  +  2  ID=exonA2;Parent=geneA Contig01 Twinscan  exon  700  750  .  +  0  ID=geneB1;Parent=geneB Contig01 Twinscan  exon  501  650  .  +  2  ID=exonA1;Parent=geneA Contig01 Twinscan  exon  800  900  .  +  0  ID=geneB2;Parent=geneB

This is because sometimes a GFF file can have multiple levels of feature hierarchy. A gene can be the parent to a transcript, which can be the parent of an exon, which can be the parent of a domain... and so on. I am not going to discuss how we can deal with that kind of file. 

There are multple ways to parse GTF/GFF files and interconvert them. BioPython and BioPerl both have GTF/GFF parsing modules. When converting from GFF to GTF, always look over the GFF files to see what feature types and attributes are being used. 


GTF to GFF conversion

This conversion is pretty straightforward. You just have to take the transcript_id value from the GTF file and convert it into a GFF attribute format. Note that the hierarchal relationship between gene_id and transcript_id is not conserved in the GFF.

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')
#parse the transcript/gene ID. I suck at using regex, so I usually just do a series of splits. transcriptID = data[-1].split('transcript_id')[-1].split(';')[0].strip()[1:-1] geneID = data[-1].split('gene_id')[-1].split(';')[0].strip()[1:-1]
#replace the last column with a GFF formatted attributes columns #I added a GID attribute just to conserve all the GTF data data[-1] = "ID=" + transcriptID + ";GID=" + geneID
#print out this new GFF line print '\t'.join(data)

Save this script and use it by:

python myScript.py myFile.gtf > myFile.gff


GFF to GTF conversion

There are no simple script solutions that is guaranteed to convert any GFF to GTF. It all depends on the feature types and attributes used in the GFF file. In this example, I am assuming the parent feature type is "gene", and all other feature types are just sub-features of "gene". Example of this GFF file:

Contig01  PFAM  gene         501  750  .  +  .  ID=geneA;Name=geneA
Contig01  PFAM  exon         501  650  .  +  .  ID=exonA1;Parent=geneA
Contig01  PFAM  exon         700  750  .  +  .  ID=exonA2;Parent=geneA
Contig01  PFAM  cds          700  750  .  +  2  ID=cdsA1;Parent=geneA
Contig01  PFAM  start_codon  700  750  .  +  .  ID=startA1;Parent=geneA
Contig01  PFAM  domain       700  750  .  +  .  ID=domainA1;Parent=geneA

The feature type "gene" in the above example is the parent feature to all the other subfeatures (exon, cds, start_codon, domain).

Knowing this data structure, we only have to take the ID attribute of gene feature, Parent attribute of other features and convert them to gene_id and transcript_id GTF attributes.

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')

    ID = ''

    #if the feature is a gene 
    if data[2] == "gene":
      #get the id
      ID = data[-1].split('ID=')[-1].split(';')[0]

    #if the feature is anything else
    else:
      # get the parent as the ID
      ID = data[-1].split('Parent=')[-1].split(';')[0]
    
    #modify the last column
    data[-1] = 'gene_id "' + ID + '"; transcript_id "' + ID

    #print out this new GTF line
    print '\t'.join(data)


Summary

GFF is a more general genomic annotation format while GTF is strictly a gene annotation format. Being more general, GFF is less controlled, thus more variable in content. Converting GTF to GFF is relatively straightforward as you can just convert the required gene/transcript_id attributes to an ID attribute in GFF. However, GFF to GTF conversion can be more complicated depending on the heirarchy of the features, types of features used, and attributes used.

I think most bioinformatician will tell you that a significant amount of work they do is parsing and converting data formats. It's only within the last decade or so that the field has started to get together and come up with some standardized formats. GTF and GFF are very commonly used formats for annotation and will probably stay that way for a while.








Search

Categories


Archive