Core Consortium annotation activities - 2012: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
 
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Activities to which contributing GO annotation groups must minimally agree==
==Core Annotation Activities asked of GO Annotation Groups==


* Supply an appropriately formatted GAF file with GO annotations that is syntactically valid and meets the GO Consortium data requirements as specified here: [http://www.geneontology.org/GO.format.gaf-2_0.shtml GAF:2.0]. The GOC commits to backward compatibility with this format.
* '''Submission of an annotation file in GAF2.0 format'''


* All annotations must be to UniProtKB identifiers located in the [http://www.ebi.ac.uk/reference_proteomes/ UniProt Reference Proteome Files], unless the group agrees to provide complete gp2protein, gp2rna and unlocalised_gp files, where appropriate (see below).
All GO annotation efforts who would like to supply their annotations to the Consortium must supply an appropriately formatted annotation file that conforms to the Consortium's syntaxtic and semantic requirements. The primary GO annotation format is [http://www.geneontology.org/GO.format.gaf-2_0.shtml GAF:2.0].


* All groups must support the transition to using ECO identifiers in future annotation formats; this could be future versions of GAF or the GPAD format. ECO identifiers agreed as appropriate for usage in GO annotations may be more or less specific than the current GO evidence code list.  
If you are a new annotation group, please see the [http://wiki.geneontology.org/index.php?title=Contributing_GO_Annotations GOC Annotation guidelines and policies] for making annotations and the [http://gocwiki.geneontology.org/index.php/Submit_GO_annotations FAQ] for assistance in submitting/making an annotation file.


* Curational groups are not necessarily committing to on-going updates to their annotations. In the case of non-recurring submissions or those from annotation groups which are now inactive annotation providers, responsibility for corrections and updates will revert to the GOC.
*''' Annotations should describe activities of UniProtKB protein or NCBI gene product identifiers'''


<span style="color:#009000">Please see the guidance provided  [http://wiki.geneontology.org/index.php?title=Contributing_GO_Annotations here] for full details on how new GO annotation efforts are supported by the GO Consortium.</span>
Ideally, all annotations should describe the activities or locations of UniProtKB identifiers present in the [http://www.ebi.ac.uk/reference_proteomes/ UniProt Reference Proteome Files]. However if this is not possilbe groups should provide identifier mapping files: gp2protein and gp2rna files, where equivalent UniProt or NCBI identifiers should be supplied. A gp_unlocalized file should additionally be provided where no sequence or genomic location is known for a gene identifier.


===Complete identifier mapping files===
* '''Willingness to adopt Evidence Code identifiers'''
 
While the current primary annotation file format applies [http://www.geneontology.org/GO.evidence.shtml GO evidence codes] to describe the category of support available in the cited reference, groups must support the Consortium's intent to transition to using [http://evidenceontology.org/ Evidence Ontology] identifiers in future annotation formats.
 
* '''Annotation update responsiblity lies updates primarily with submitter, but can revert to the GO Consortium'''
 
Curational groups do not need to commit to supplying regular updates to their annotations. In the case of non-recurring submissions or those from annotation groups which are now inactive annotation providers, responsibility for corrections and updates will revert to the GOC, please see the guidance provided [http://wiki.geneontology.org/index.php?title=Contributing_GO_Annotations here], for further details.
 
== Complete identifier mapping files==
 
Why do we need mapping files?
* For downloading sequences from UniProt/NCBI. These sequences are used for AmiGO BLAST and for phylogenetic inferencing (PAINT)
 
* To search for GO annotations in AmiGO using other DB cross reference IDs (UniProt or NCBI)
 
* The ID mapping will help with book keeping and tracking IDs and annotations, removing duplicates etc
 
In all cases where identifier mapping is carried out, groups must be aware that due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.


====Complete gp2protein file ====
====Complete gp2protein file ====


The file must meet the [http://wiki.geneontology.org/index.php/Gp2protein_file gp2protein format specification ]
The file must meet the [http://wiki.geneontology.org/index.php/Gp2protein_file gp2protein format specification]


The gp2protein-mapping file must contain the full list of all protein-encoding genes in the respective organism (or community), including those proteins not annotated to GO.   
The gp2protein-mapping file must contain the full list of all protein-encoding genes in the respective organism (or community), including those proteins not annotated to GO.   
Line 21: Line 38:
The first column contains all gene or gene product identifiers (these are typically MOD-specific identifiers) and the second column contains mappings to canonical identifiers.  rotein coding genes must map to UniProtKB identifiers (Swiss-Prot in preference, if not then TrEMBL). If identifiers are truly unavailable in UniProtKB then NCBI identifiers (NP_ and XP_) are permissible.
The first column contains all gene or gene product identifiers (these are typically MOD-specific identifiers) and the second column contains mappings to canonical identifiers.  rotein coding genes must map to UniProtKB identifiers (Swiss-Prot in preference, if not then TrEMBL). If identifiers are truly unavailable in UniProtKB then NCBI identifiers (NP_ and XP_) are permissible.


<span style="color:#009000"> If an annotation group is fully satisfied with the identifier mapping from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB is willing to take on the responsibility of supplying the external id -> UniProtKB mapping to the GO Consortium.</span>
If an annotation group is fully satisfied with the identifier mapping from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB is willing to take on the responsibility of supplying the external id -> UniProtKB mapping to the GO Consortium.


====Complete gp2rna file====
====Complete gp2rna file====
If your annotation file includes ncRNAs, then the gp2protein files must include all ncRNA-encoding genes currently identified in the genome build including those ncRNAs not annotated to GO.
If your annotation file includes ncRNAs, then your corresponding gp2rna file must include all ncRNA-encoding genes currently identified in the genome build including those ncRNAs not annotated to GO.


Functional ncRNA must map to NCBI (NR_ or XR_) if available, blank if unavailable).
Functional ncRNA must map to NCBI (NR_ or XR_) if available, blank if unavailable).
Line 34: Line 51:
If your database supplies gene identifiers that have been manually curated from the literature, but where no sequence or genomic location is known (such genes have been variously described as 'unlocalised genes', 'single heritable traits' or  'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.   
If your database supplies gene identifiers that have been manually curated from the literature, but where no sequence or genomic location is known (such genes have been variously described as 'unlocalised genes', 'single heritable traits' or  'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.   


This file should contain all non-genome localized gene identifiers available, including those not annotated to GO.
This file should contain a list of all the non-genome localized gene identifiers available, including those not annotated to GO.


[http://wiki.geneontology.org/index.php?title=Gp_unlocalized_file gp_unlocalised file format]
[http://wiki.geneontology.org/index.php?title=Gp_unlocalized_file gp_unlocalised file format]


====Macromolecular complexes====
====Macromolecular complexes====
If the annotation file includes macromolecular complexes as the subject of the annotation then no corresponding entry is required for the gp2protein file – only gene or gene product mappings should be included.  
If the annotation file includes macromolecular complexes as the subject of the annotation then no corresponding entry is required for the gp2protein file – only gene or gene product mappings should be included.  


Line 45: Line 63:
Groups must regularly update their  gp2protein or gp2rna file (e.g. in response to UniProt-GOA feedback on inclusion of obsolete/secondary UniProtKB accessions in a group’s gp2protein, or obsoletion of NCBI identifiers).
Groups must regularly update their  gp2protein or gp2rna file (e.g. in response to UniProt-GOA feedback on inclusion of obsolete/secondary UniProtKB accessions in a group’s gp2protein, or obsoletion of NCBI identifiers).


==Additional responsibilities of active GO annotation groups.==
For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please see the following description of [[GO annotation activities by central GO Consortium members]]
* Be responsive to requests from other curators/external users to correct their annotations when necessary, and to integrate accepted corrections from external sources.
 
* Be responsive to correct annotations in which problems are uncovered using the GOC soft/hard QC checks.
 
* Each annotation group should be represented on the GO Consortium fortnightly annotation calls and frequently attend GO Consortium meetings, to ensure all groups are kept up-to-date with developments in the GO Consortium.
 
==Additional responsibilities of providers who act as the authoritative source for a species==
 
* Authoritative providers of annotations for a particular species should integrate manual and electronic annotations from external sources (e.g. GOC, RefGenome, UniProt, Reactome) on a monthly basis, providing up-to-date set of merged annotations for their species (as the authoritative source) to the GOC. If an external source includes data in the annotation_extension field or gene_product_form_id (columns 16 and 17 of GAF2.0), then the species owner is required to include this data, even if they themselves are not annotating to this field themselves.
 
* It is preferable that all annotations are mapped to a common identifier type to enable users to retrieve a standardized annotation format (this could be the group’s MOD identifier type).
 
<span style="color:#009000">* Authotiative species-owning groups should only minimally filter their  annotation set to reduce annotation redundancy in their annotation file. The GO Consortium will make available less-redundant annotation sets to users, however it is important that the Consortium receives all annotations supplied by GOC (automatic and manual) annotation efforts.
 
Document describing the advised mechanism for filtering for annotation redundancy:
 
http://wiki.geneontology.org/index.php/Mechanisms_for_reducing_annotation_redundancy  '': Currently under discussion'' </span>
 
==Activities required of dedicated NIH funded GO curators (optional for others)==
 
===Annotation activities===
 
* Supply appropriately formatted GO annotations at a minimum frequency of once/month. Initially these will be supplied in GAF 2.0 (as above) but GO bio-curators will be required to transition to releasing files to the GO Consortium in GPAD/GPI format. [http://wiki.geneontology.org/index.php/Gene_Product_Association_Data_%28GPAD%29_Format GPAD format]
 
* Transition to annotate to identifiers describing specific protein/complex forms as necessary. Possible ids include protein isoforms, post-translationally modified proteins, functional RNAs, protein complexes) e.g. UniProtKB accessions, PRO ids or IntAct complex ids.
 
* Add information into annotations to describe the biological context of a GO annotation (cell type, anatomical structure, other spatial and temporal attributes), as well as add data to linking functions, processes and components in an annotation. This will require at a minimum annotation extensions (column 16 in GAFs) – i.e. a minimum of GAF2.0 OR GPAD expressivity. It may also involve annotation relations, which requires at a minimum GPAD.  This includes enhancements that will become available in the future, such as specifications for more expressive annotations (e.g. LEGO-style annotation, with arbitrary levels of nesting), supported by the appropriate annotation tools becoming available.
 
* Carry out phylogenetically-based annotations of PANTHER protein families using PAINT, (secondary to curating annotations derived from experimental data.)
 
* Commit to submit annotations for all species that are described in the papers they are curating, that is annotation data for species not typically captured by the MOD. This would improve efficiency of literature curation when a paper characterizes gene products from >1 species.  This will be facilitated through the use of the Community Annotation Tool.
 
===Other activities===
 
* Mentor emerging annotation groups outside of the established functional annotation stream as resources allow; allocation of mentoring responsibilities defined by GOC.
 
* Participate in the review of annotations contributed by external community experts (as emailed to GO Help or brought in via the CANTO tool)
 
* Contribute to the development of the expressivity of GO annotations (at least in the format defined by the GO Consortium, if not in the MODs web-display) and determine the priority of these extensions.
 
* Participate in the testing and development of the Central GO Common Annotation Framework.  The Framework comprises both a front-end UI and a back-end system for reasoning, QA and deposition in the GO database, and dedicated curators will participate in the development of either one or both of these elements.  The Framework will support multiple UIs as long as they conform to the back-end specifications and respond in a timely manner to updates in these specifications.
 
* Provide a GPI file (rather than a gp2protein file)
[http://wiki.geneontology.org/index.php/Gene_Product_Association_Data_%28GPAD%29_Format#Proposed_Gene_Product_Information_.28GPI.29_file_format | GPI file format]
 
** The GPI file must include identifiers and description of macromolecular complexes that have been annotated.


[[Category:Annotation]]
[[Category:Annotation Archived]]
[[Category:File Format]]
[[Category:Formats]]

Latest revision as of 12:03, 13 April 2019

Core Annotation Activities asked of GO Annotation Groups

  • Submission of an annotation file in GAF2.0 format

All GO annotation efforts who would like to supply their annotations to the Consortium must supply an appropriately formatted annotation file that conforms to the Consortium's syntaxtic and semantic requirements. The primary GO annotation format is GAF:2.0.

If you are a new annotation group, please see the GOC Annotation guidelines and policies for making annotations and the FAQ for assistance in submitting/making an annotation file.

  • Annotations should describe activities of UniProtKB protein or NCBI gene product identifiers

Ideally, all annotations should describe the activities or locations of UniProtKB identifiers present in the UniProt Reference Proteome Files. However if this is not possilbe groups should provide identifier mapping files: gp2protein and gp2rna files, where equivalent UniProt or NCBI identifiers should be supplied. A gp_unlocalized file should additionally be provided where no sequence or genomic location is known for a gene identifier.

  • Willingness to adopt Evidence Code identifiers

While the current primary annotation file format applies GO evidence codes to describe the category of support available in the cited reference, groups must support the Consortium's intent to transition to using Evidence Ontology identifiers in future annotation formats.

  • Annotation update responsiblity lies updates primarily with submitter, but can revert to the GO Consortium

Curational groups do not need to commit to supplying regular updates to their annotations. In the case of non-recurring submissions or those from annotation groups which are now inactive annotation providers, responsibility for corrections and updates will revert to the GOC, please see the guidance provided here, for further details.

Complete identifier mapping files

Why do we need mapping files?

  • For downloading sequences from UniProt/NCBI. These sequences are used for AmiGO BLAST and for phylogenetic inferencing (PAINT)
  • To search for GO annotations in AmiGO using other DB cross reference IDs (UniProt or NCBI)
  • The ID mapping will help with book keeping and tracking IDs and annotations, removing duplicates etc

In all cases where identifier mapping is carried out, groups must be aware that due to different database release cycles, sequence identifiers that should correspond with each other may not always display the same data.

Complete gp2protein file

The file must meet the gp2protein format specification

The gp2protein-mapping file must contain the full list of all protein-encoding genes in the respective organism (or community), including those proteins not annotated to GO.

The first column contains all gene or gene product identifiers (these are typically MOD-specific identifiers) and the second column contains mappings to canonical identifiers. rotein coding genes must map to UniProtKB identifiers (Swiss-Prot in preference, if not then TrEMBL). If identifiers are truly unavailable in UniProtKB then NCBI identifiers (NP_ and XP_) are permissible.

If an annotation group is fully satisfied with the identifier mapping from an external identifier type to UniProtKB accessions, as supplied by the UniProt Knowledgebase cross-references, then UniProtKB is willing to take on the responsibility of supplying the external id -> UniProtKB mapping to the GO Consortium.

Complete gp2rna file

If your annotation file includes ncRNAs, then your corresponding gp2rna file must include all ncRNA-encoding genes currently identified in the genome build including those ncRNAs not annotated to GO.

Functional ncRNA must map to NCBI (NR_ or XR_) if available, blank if unavailable).

gp2rna format

Complete gp_unlocalized file

If your database supplies gene identifiers that have been manually curated from the literature, but where no sequence or genomic location is known (such genes have been variously described as 'unlocalised genes', 'single heritable traits' or 'phenotypic orphans'), then you should additionally supply a complete gp_unlocalized_file.

This file should contain a list of all the non-genome localized gene identifiers available, including those not annotated to GO.

gp_unlocalised file format

Macromolecular complexes

If the annotation file includes macromolecular complexes as the subject of the annotation then no corresponding entry is required for the gp2protein file – only gene or gene product mappings should be included.

Updates of identifier mapping files.

Groups must regularly update their gp2protein or gp2rna file (e.g. in response to UniProt-GOA feedback on inclusion of obsolete/secondary UniProtKB accessions in a group’s gp2protein, or obsoletion of NCBI identifiers).

For groups who provide authoritative files for a species, or who are funded by the GO NIH grant, please see the following description of GO annotation activities by central GO Consortium members