Annotation pipeline

From GO Wiki
Jump to: navigation, search

Reference Genome Annotation Pipeline

from orthology sets to comprehensive annotations…

Last Modified: April 7, 2008; (Judy)

Previous Modifies: April 1, 2008; (Judy and Suzi)

Here we outlined the current procedures for the Reference Genome Project Annotation Pipeline. These procedures are developed to ensure consistent, high quality annotation efforts among the participating resource providers. The resource providers are:


Reference Genome Group Contact Person
SGD Stacia Engel
MGI David Hill
FlyBase Susan Tweedie
dictyBase Pascale Gaudet
E.coli Jim Hu
TAIR Tanya Beradini
WormBase Kimberly Van Auken
S. pombe Val Wood
RGD Victoria Petri
Human Emily Dimmer
Zebrafish Doug Howe
Chicken Fiona McCarthy



Here is a figure of the Pipeline.

Ref Genome annotation pipelineMar31-08.png


I. Gp2Protein files are provided by each of the participating genome groups.

a. All genes or gene products known to be within the organism’s genome are to be included.
b. Current 'genes' restricted to protein-coding units with one representation for each coding unit regardless of number of isoforms. The longest AA sequence is provided for the purpose of this work.
  • this is not what everyone does
  • how do we then annotate gene products? (ie isoforms)


II. See documentation here: http://wiki.geneontology.org/index.php/RG:_Software

III. Software will use the gp2protein files to construct fasta files.

a. Error reports are generated when these are loaded into the GO database
  • what errors are reported?
b. Only the longest amino acid sequence for a given gene will be used when generating the fasta file.
  • Kara had issues generating the FASTA file from the gp2protein file. Chris and Seth say that the FASTA file will now be generated together with the data releases (monthly)
c. P-Pod will be run to generate initial tentative ortholog/homology sets – for brevity we will refer to these sets of proteins as ”ortho-sets”, but this is to be understood as simply as shorthand for a much more nuanced interpretation.

IV. Ortho-set(s) are chosen for comprehensive curation.

a. The curator who selects an ortho-set becomes the lead curator for this ortho-set and will oversee the overall annotation process
b. A protein tree is available to evaluate the ortho-sets

V. Each genome group responsible for vetting their protein members of this set.

a. There are agreed criteria for adding/deleting proteins from this set
b. The ‘vetted’ ortho-set is deposited into the GO-DB – this is the official set; the unit of annotation.
c. Each curator notifies the lead curator (by changing the status to approved) once they have vetted their proteins. The fact that an ortholog was incorrectly called need to be captured (for later iterations of the ortho-sets and in cases where we need to verify)

VI. Experimental annotations are comprehensively added for all proteins in the ortho-set (no ISS annotations are added at this time)

a. Those groups with no experimental data don’t do any annotation for their proteins in this ortho-set at this time (although they may be working on other annotations and ortho-sets).
b. When finished with comprehensive experimental annotation for a selected set, each genome group marks as ‘exp captured’. Those without any experimental evidence will set this flag as soon as this absence is determined.
c. When all groups mark as ‘exp captured’, the set is open for ISS inference annotations using the experimental data (automatic notification)

VII. Each groups now add ISS annotations based on the experimental annotations collected as part of the reference genome project

a. IEA are not accepted, curators look at all of the ortho-set annotations. Since step 5 is manual this step, in and of itself may serve to justify the ISS evidence code.
b. The ISA or ISO annotations all have a “with” to another protein in the ortho-set for which experimental data exists.
c. Since the ortho-set has been settled in step V this means ISS annotations to proteins outside of the ortho-set, although they may use reference genome annotations, are excluded from the ortho-set, although the annotations themselves can be submitted.
  • Should we not submit those to UniProt? To ensure that they are visible in AmiGO.

VIII. At the completion of the experimental annotation and the ISS inference additions for a given ortho set, the lead curator who proposed the annotation of the ortho-set will do QC on the resulting annotations

a. Again, protein trees will be used to evaluate the consistency of annotations across the genomes.
b. Curators may be asked to revise their annotations if there are inconsistencies.
c. Following this QC, the set is marked as ‘complete; in GOdb and dated.

IX. Further documentation will indicate criteria and policies for revising and updating GO annotations for these genes.

  • One possible query would be to check whether there are annotations to one of the genes from the 'completely curated' ortho sets that is more recent than the date it was last checked. This way we only re-verify genes for which there is new experimental data available.