Annotation pipeline (Retired): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(10 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[Category:PAINT Archived]]
==Reference Genome Annotation Pipeline==
==Reference Genome Annotation Pipeline==
<i>from orthology sets to comprehensive annotations…</i>
<i>from orthology sets to comprehensive annotations…</i>
Line 15: Line 16:
|-
|-
|SGD
|SGD
|Eurie Hong
|Stacia Engel
|-
|-
|MGI
|MGI
Line 33: Line 34:
|-
|-
|WormBase
|WormBase
|Kimberly von Auken
|Kimberly Van Auken
|-
|-
|S. pombe
|S. pombe
Line 57: Line 58:
Here is a figure of the Pipeline.
Here is a figure of the Pipeline.


[[Image:Ref_Genome_annotation_pipelineMar31-08.pdf]]
[[Image:Ref_Genome_annotation_pipelineMar31-08.png]]




Line 63: Line 64:
:: a. All genes or gene products known to be within the organism’s genome are to be included.
:: a. All genes or gene products known to be within the organism’s genome are to be included.
:: b. Current 'genes' restricted to protein-coding units with one representation for each coding unit regardless of number of isoforms.  The longest AA sequence is provided for the purpose of this work.
:: b. Current 'genes' restricted to protein-coding units with one representation for each coding unit regardless of number of isoforms.  The longest AA sequence is provided for the purpose of this work.
*''this is not what everyone does''
*''how do we then annotate gene products? (ie isoforms)''


II. See documentation here: http://wiki.geneontology.org/index.php/RG:_Software
II. See documentation here: http://wiki.geneontology.org/index.php/RG:_Software
Line 68: Line 72:
III. Software will use the gp2protein files to construct fasta files.
III. Software will use the gp2protein files to construct fasta files.
:: a. Error reports are generated when these are loaded into the GO database
:: a. Error reports are generated when these are loaded into the GO database
*''what errors are reported? ''
:: b. Only the longest amino acid sequence for a given gene will be used when generating the fasta file.
:: b. Only the longest amino acid sequence for a given gene will be used when generating the fasta file.
*''Kara had issues generating the FASTA file from the gp2protein file. Chris and Seth say that the FASTA file will now be generated together with the data releases (monthly)''
:: c. P-Pod will be run to generate initial tentative ortholog/homology sets – for brevity we will refer to these sets of proteins as ”ortho-sets”, but this is to be understood as simply as shorthand for a much more nuanced interpretation.
:: c. P-Pod will be run to generate initial tentative ortholog/homology sets – for brevity we will refer to these sets of proteins as ”ortho-sets”, but this is to be understood as simply as shorthand for a much more nuanced interpretation.


Line 76: Line 82:
:: b. A protein tree is available to evaluate the ortho-sets
:: b. A protein tree is available to evaluate the ortho-sets


V.Each genome group responsible for vetting their protein members of this set.
V. Each genome group responsible for vetting their protein members of this set.


:: a. There are agreed criteria for adding/deleting proteins from this set
:: a. There are agreed criteria for adding/deleting proteins from this set
:: b. The ‘vetted’ ortho-set is deposited into the GO-DB – this is the official set; the unit of annotation.
:: b. The ‘vetted’ ortho-set is deposited into the GO-DB – this is the official set; the unit of annotation.
:: c. Each curator notifies the lead curator (by changing the status to approved) once they have vetted their proteins.
:: c. Each curator notifies the lead curator (by changing the status to approved) once they have vetted their proteins. ''The fact that an ortholog was incorrectly called need to be captured (for later iterations of the ortho-sets and in cases where we need to verify)''


VI.Experimental annotations are comprehensively added for all proteins in the ortho-set (no ISS annotations are added at this time)
VI. Experimental annotations are comprehensively added for all proteins in the ortho-set (no ISS annotations are added at this time)


:: a. Those groups with no experimental data don’t do any annotation for their proteins in this ortho-set at this time (although they may be working on other annotations and ortho-sets).
:: a. Those groups with no experimental data don’t do any annotation for their proteins in this ortho-set at this time (although they may be working on other annotations and ortho-sets).
:: b. When finished with comprehensive experimental annotation for a selected set, each genome group marks as ‘exp captured’. Those without any experimental evidence will set this flag as soon as this absence is determined.
:: b. When finished with comprehensive experimental annotation for a selected set, each genome group marks as ‘exp captured’. Those without any experimental evidence will set this flag as soon as this absence is determined.
:: c. When all groups mark as ‘exp captured’ , the set is open for ISS inference annotations using the experimental data
:: c. When all groups mark as ‘exp captured’, the set is open for ISS inference annotations using the experimental data (''automatic notification'')


VII.Each groups now add ISS annotations based on the experimental annotations collected as part of the reference genome project  
VII. Each groups now add ISS annotations based on the experimental annotations collected as part of the reference genome project  


:: a. IEA are not accepted, curators look at all of the ortho-set annotations. Since step 5 is manual this step, in and of itself may serve to justify the ISS evidence code.
:: a. IEA are not accepted, curators look at all of the ortho-set annotations. Since step 5 is manual this step, in and of itself may serve to justify the ISS evidence code.
:: b. The ISA or ISO annotations all have a “with” to another protein in the ortho-set for which experimental data exists.
:: b. The ISA or ISO annotations all have a “with” to another protein in the ortho-set for which experimental data exists.
:: c. Since the ortho-set has been settled in step V this means ISS annotations to proteins outside of the ortho-set, although they may use reference genome annotations, are excluded from the ortho-set, although the annotations themselves can be submitted.
:: c. Since the ortho-set has been settled in step V this means ISS annotations to proteins outside of the ortho-set, although they may use reference genome annotations, are excluded from the ortho-set, although the annotations themselves can be submitted.  
*''Should we not submit those to UniProt? To ensure that they are visible in AmiGO.''


VIII.At the completion of the experimental annotation and the ISS inference additions for a given ortho set, the lead curator who proposed the annotation of the ortho-set will do QC on the resulting annotations
VIII. At the completion of the experimental annotation and the ISS inference additions for a given ortho set, the lead curator who proposed the annotation of the ortho-set will do QC on the resulting annotations


:: a. Again, protein trees will be used to evaluate the consistency of annotations across the genomes.
:: a. Again, protein trees will be used to evaluate the consistency of annotations across the genomes.
Line 100: Line 107:
:: c. Following this QC, the set is marked as ‘complete; in GOdb and dated.
:: c. Following this QC, the set is marked as ‘complete; in GOdb and dated.


IX.Further documentation will indicate criteria and policies for revising and updating GO annotations for these genes.
IX. Further documentation will indicate criteria and policies for revising and updating GO annotations for these genes.
*''One possible query would be to check whether there are annotations to one of the genes from the 'completely curated' ortho sets that is more recent than the date it was last checked. This way we only re-verify genes for which there is new experimental data available. ''

Latest revision as of 11:13, 12 April 2019

Reference Genome Annotation Pipeline

from orthology sets to comprehensive annotations…

Last Modified: April 7, 2008; (Judy)

Previous Modifies: April 1, 2008; (Judy and Suzi)

Here we outlined the current procedures for the Reference Genome Project Annotation Pipeline. These procedures are developed to ensure consistent, high quality annotation efforts among the participating resource providers. The resource providers are:


Reference Genome Group Contact Person
SGD Stacia Engel
MGI David Hill
FlyBase Susan Tweedie
dictyBase Pascale Gaudet
E.coli Jim Hu
TAIR Tanya Beradini
WormBase Kimberly Van Auken
S. pombe Val Wood
RGD Victoria Petri
Human Emily Dimmer
Zebrafish Doug Howe
Chicken Fiona McCarthy



Here is a figure of the Pipeline.


I. Gp2Protein files are provided by each of the participating genome groups.

a. All genes or gene products known to be within the organism’s genome are to be included.
b. Current 'genes' restricted to protein-coding units with one representation for each coding unit regardless of number of isoforms. The longest AA sequence is provided for the purpose of this work.
  • this is not what everyone does
  • how do we then annotate gene products? (ie isoforms)


II. See documentation here: http://wiki.geneontology.org/index.php/RG:_Software

III. Software will use the gp2protein files to construct fasta files.

a. Error reports are generated when these are loaded into the GO database
  • what errors are reported?
b. Only the longest amino acid sequence for a given gene will be used when generating the fasta file.
  • Kara had issues generating the FASTA file from the gp2protein file. Chris and Seth say that the FASTA file will now be generated together with the data releases (monthly)
c. P-Pod will be run to generate initial tentative ortholog/homology sets – for brevity we will refer to these sets of proteins as ”ortho-sets”, but this is to be understood as simply as shorthand for a much more nuanced interpretation.

IV. Ortho-set(s) are chosen for comprehensive curation.

a. The curator who selects an ortho-set becomes the lead curator for this ortho-set and will oversee the overall annotation process
b. A protein tree is available to evaluate the ortho-sets

V. Each genome group responsible for vetting their protein members of this set.

a. There are agreed criteria for adding/deleting proteins from this set
b. The ‘vetted’ ortho-set is deposited into the GO-DB – this is the official set; the unit of annotation.
c. Each curator notifies the lead curator (by changing the status to approved) once they have vetted their proteins. The fact that an ortholog was incorrectly called need to be captured (for later iterations of the ortho-sets and in cases where we need to verify)

VI. Experimental annotations are comprehensively added for all proteins in the ortho-set (no ISS annotations are added at this time)

a. Those groups with no experimental data don’t do any annotation for their proteins in this ortho-set at this time (although they may be working on other annotations and ortho-sets).
b. When finished with comprehensive experimental annotation for a selected set, each genome group marks as ‘exp captured’. Those without any experimental evidence will set this flag as soon as this absence is determined.
c. When all groups mark as ‘exp captured’, the set is open for ISS inference annotations using the experimental data (automatic notification)

VII. Each groups now add ISS annotations based on the experimental annotations collected as part of the reference genome project

a. IEA are not accepted, curators look at all of the ortho-set annotations. Since step 5 is manual this step, in and of itself may serve to justify the ISS evidence code.
b. The ISA or ISO annotations all have a “with” to another protein in the ortho-set for which experimental data exists.
c. Since the ortho-set has been settled in step V this means ISS annotations to proteins outside of the ortho-set, although they may use reference genome annotations, are excluded from the ortho-set, although the annotations themselves can be submitted.
  • Should we not submit those to UniProt? To ensure that they are visible in AmiGO.

VIII. At the completion of the experimental annotation and the ISS inference additions for a given ortho set, the lead curator who proposed the annotation of the ortho-set will do QC on the resulting annotations

a. Again, protein trees will be used to evaluate the consistency of annotations across the genomes.
b. Curators may be asked to revise their annotations if there are inconsistencies.
c. Following this QC, the set is marked as ‘complete; in GOdb and dated.

IX. Further documentation will indicate criteria and policies for revising and updating GO annotations for these genes.

  • One possible query would be to check whether there are annotations to one of the genes from the 'completely curated' ortho sets that is more recent than the date it was last checked. This way we only re-verify genes for which there is new experimental data available.