Annotation pipeline (Retired)
Reference Genome Annotation Pipeline from orthology sets to comprehensive annotations….
Last Modified: April 1, 2008; (Judy and Suzi)
Here we outlined the current procedures for the Reference Genome Project Annotation Pipeline. These procedures are developed to ensure consistent, high quality annotation efforts among the participating resource providers. The resource providers are:
|Reference Genome Group||Contact Person|
|WormBase||Kimberly von Auken|
|S. pombe||Val Wood|
I. Gp2Protein files are provided by each of the participating genome groups.
a. All genes or gene products known to be within the organism’s genome are to be included.
b. See documentation here: http://wiki.geneontology.org/index.php/RG:_Software
II. Software will use the gp2protein files to construct fasta files.
a. Error reports are generated when these are loaded into the GO database
b. Only the longest amino acid sequence for a given gene will be used when generating the fasta file.
III. P-Pod will be run to generate initial tentative ortholog/homology sets – for brevity we will refer to these sets of proteins as ”ortho-sets”, but this is to be understood as simply as shorthand for a much more nuanced interpretation.
IV. Ortho-set(s) are chosen for comprehensive curation.
a. The curator who selects an ortho-set becomes the lead curator for this ortho-set and will oversee the overall annotation process
b. A protein tree is available to evaluate the ortho-sets
V. Each genome group responsible for vetting their protein members of this set.
a. There are agreed criteria for adding/deleting proteins from this set
b. The ‘vetted’ ortho-set is deposited into the GO-DB – this is the official set; the unit of annotation.
c. Each curator notifies the lead curator (by changing the status to approved) once they have vetted their proteins.
VI. Experimental annotations are comprehensively added for all proteins in the ortho-set (no ISS annotations are added at this time)
a. Those groups with no experimental data don’t do any annotation for their proteins in this ortho-set at this time (although they may be working on other annotations and ortho-sets).
b. When finished with comprehensive experimental annotation for a selected set, each genome group marks as ‘exp captured’. Those without any experimental evidence will set this flag as soon as this absence is determined.
c. When all groups mark as ‘exp captured’ , the set is open for ISS inference annotations using the experimental data
VII. Each groups now add ISS annotations based on the experimental annotations collected as part of the reference genome project
a. IEA are not accepted, curators look at all of the ortho-set annotations. Since step V is manual this step, in and of itself may serve to justify the ISS evidence code.
b. The ISA or ISO annotations all have a “with” to another protein in the ortho-set for which experimental data exists.
c. Since the ortho-set has been settled in step V this means ISS annotations to proteins outside of the ortho-set, although they may use reference genome annotations, are excluded from the ortho-set, although the annotations themselves can be submitted.
VIII. At the completion of the experimental annotation and the ISS inference additions for a given ortho set, the lead curator who proposed the annotation of the ortho-set will do QC on the resulting annotations
a. Again, protein trees will be used to evaluate the consistency of annotations across the genomes.
b. Curators may be asked to revise their annotations if there are inconsistencies.
c. Following this QC, the set is marked as ‘complete; in GOdb and dated.
IX. Further criteria will indicate expectations and policies for revising and updating GO annotations for these genes.