Procedure for migration of protein annotations to Protein2GO

Summary of integration procedure

UniProt-GOA will run their syntax checker over the MOD's GAF (obtained from the GOC website, post-filtering) to determine how many annotations are suitable for direct incorporation into Protein2GO. Any errors that arise from the syntax check will be supplied to the MOD.

UniProt-GOA will work together with the MOD to ensure as many annotations as possible are in the correct format for integration into our database, this has in the past involved many communications between the two groups to resolve formatting and syntax errors. Once as many of these issues have been resolved as possible, we will integrate the MOD's 'cleansed' file into our database and the MOD curators will be required to start using P2GO as their sole GO curation tool.

UniProt-GOA will make an update of all the latest MOD annotations just before the MOD curators start using P2GO. The MOD will then be made an internal curation source, after which the MOD curators should make updates to their annotations in Protein2GO only. Once the MOD curators are using P2GO, we will stop updating from the external MOD GAF.

Considerations prior to integration

Protein2GO only allows curation to UniProt accession numbers (e.g. Q4VCS5), therefore MOD identifiers will be mapped to UniProt accessions, where possible, using the MOD's gp2protein file, which we upload nightly and assume to be the primary identifier mapping source. The GAF provided to the MODs by UniProt-GOA will have annotations to UniProt accessions, the MOD will need to map back to their own identifiers if they so require.

ISS annotations can be accepted from MODs only if they have an identifier in the ‘with’ field that conforms to the regular expressions we have for many database identifiers. ISS, ISO and ISA annotations that have a PMID reference and no entry in the with field will only be accepted if their date is before April 30th 2008, which is when the GOC guidelines on the withfield being mandatory were changed.

In the case of chromosome duplication or two genes that map to the same protein; UniProt is intending to demerge such entries so that eventually there will be a 1:1 correspondence between a gene and protein for each species. However in the meantime, the MOD should decide whether they would want us to merge the annotation sets for the MOD identifiers into one UniProt entry and remove the redundancy. When a conversion from the UniProt file to the MOD-formatted one is done, in order for the MOD to obtain annotations for both genes, MODs will need to have both gene identifiers mapping to the same UniProt accession in their gp2protein file so that all the annotations from one UniProt accession are supplied to both MOD identifiers.

Preserving original curator and timestamps - Can the original curator and original timestamp be preserved for annotations imported from a MOD?
If there are annotations made by a curator no longer working for that MOD how should they be entered and can they be edited at a later date?
If there are duplicate manual annotations from both the MOD and UniProt, how will that be handled?

Required changes to the Protein2GO database

UniProt-GOA will be responsible for making the necessary changes to the database in order to accept the MOD as an internal annotation source. These include;

1. Requesting Protein2GO accounts for the MOD curators

2. Adding in affiliation and editing rights data to the relevant database tables

3. Updating the MOD's source from 'external' to 'internal', which will include assigning stable annotation IDs to the source allowing the MOD's annotations to be audited

4. Updating tables to include the MOD in the relevant release statistics, which will be displayed on the UniProt-GOA website

Examples of annotations we will not accept

1. Those that use an identifier in Column 2 that cannot be mapped to a UniProt accession

2. Those that do not use a public reference, e.g. PMID, or an internal reference that cannot be translated to a GO_REF

Can we list reference identifiers that are accepted, e.g. PMID, GO_REF? What about doi's or other database identifiers, e.g. Agricola?

3. Those that use an identifier in the 'with' field that doesn't match the regular expressions that we recognise*

Is the list of regular expressions recognised readily available somewhere for groups to check what they use?

4. Any IEA annotations - as we already supply these using several methods

UniProtKB will not integrate annotations that have the following identifiers in the 'with' field, either because we already have pipelines that cover this data or the identifiers are not resolvable accessions (i.e. CBS);

CBS:TMHMM

CBS:SignalP

Enzyme Commission Numbers

KEGG or KEGG_PATHWAY identifiers

MetaCyc identifiers

Using Protein2GO

Protein2GO only allows curation to UniProt accession numbers (e.g. Q4VCS5).

UniProt only allows limited use of ISS evidence code, e.g. we do not make ISS annotations from InterPro2GO mappings as we display the InterPro2GO mappings as IEA-evidenced annotations. We only make ISS annotations based on a curators or authors decision of sequence/structural similarity. Once the MOD curators start using P2GO, they will only be able to make new ISS annotations with a UniProt accession in the ‘with’ field. The MOD needs to check if GO_REF:0000024 (http://www.geneontology.org/cgi-bin/references.cgi) that we use for our ISS annotations based on curator judgment sufficiently covers their annotation practice.

Curators do not deal manually with IEA annotations, however they are displayed in Protein2GO. If there is a problem with an IEA annotation, please contact the annotation provider directly, e.g. InterPro for InterPro2GO (via SourceForge tracker), goa@ebi.ac.uk for UniProt keyword and subcellular2GO etc.

UniProt-GOA will require the names and email addresses of MOD curators that will use Protein2GO in order to set up a login for them.

Training on the use of Protein2GO will be provided by one of the UniProt full-time GO curators using Webex or equivalent, there will also be an opportunity to try out P2G in a test environment before curators move to using P2GO full-time.

UniProtKB maintains extensive documentation on how to use the tool. Can we provide the link here?

Release of annotations

Once the MOD's annotations have been integrated into our database, UniProt-GOA will provide the MOD with a file in the GAF2.0 format containing the entire set of GO annotations that match the taxon identifier(s) the MOD is responsible for as well as any additional annotations the MOD has created to other taxons. When importing the annotations back into their own database, the MOD can either note the updates made in this set from the changes in the date attached to each annotation (dates indicate when the last edit was made to the annotation) or they can carry out a full delete and reload of their GO annotation set.

Any annotations that we cannot accept from the MOD, but which the MOD wants to keep can be appended to the supplied GAF by the MOD, e.g. annotations to non-coding RNAs, annotations using internal references that aren't mapped to a GO_REF, IEA annotations, etc. UniProt-GOA will not store the annotations that are excluded, so it is up to the MOD to keep a record of these.

If required, we can provide IEA annotations, specific for the taxon(s) the MOD is responsible for, in the GAF as well as high-quality protein binding annotations that we import from IntAct. The electronic annotations we provide will not be filtered in any way, so the MOD will need to perform any filtering steps as it sees fit.

An updated MOD file will be created every two weeks - once as part of our main four-weekly release, as well as during a supplemental release that occurs two weeks later. All MOD-specific GAF files will be available to download from date-named folders located on the UniProt-GOA ftp site.

Some MODs update their databases on a nightly basis and would therefore like to have more frequent data releases. Is that possible?

The main UniProt-GOA releases are timed to coincide with the UniProtKB releases and at this time a page displaying a breakdown of the MOD file's annotation statistics will be displayed on the UniProt-GOA website.

Annotations entered into Protein2GO will be made public in the QuickGO browser weekly (each Sunday) and in the GAF file releases fortnightly.

Each MOD will be responsible for supplying their final GAF to the GO Consortium.

Continuing quality assurance

UniProt-GOA run a number of sanity and syntax checks over the Protein2GO database. For each group contributing annotations to Protein2GO an email will be sent detailing the annotations that have not passed checks. These regular emails will ask that annotation owners make the requested updates in the way that they feel is most appropriate. The idea behind this is that groups will still have control over the dataset being exported by the UniProt-GOA pipeline and that UniProt curators should carry out little or no manual changes to another group's data. Examples of checks are secondary/obsolete GO terms, secondary UniProt accessions - these are likely to be added to in the future.

What next?

UniProt-GOA will load up the MOD's annotations and we will review which ones are rejected. We will supply the MOD with feedback on the rejections and what can be done to adjust the annotation so that it will be accepted into Protein2GO.

Annotation Questions

The literature for my organism rarely cites a UniProtKB identifier, but instead uses common gene names or synonyms. Can I use these common names as an entry point for annotation in Protein2GO?

My group currently annotates to gene identifiers. If the experiment I wish to annotate does not directly involve a protein product (e.g., genetic ablation leads to a particular phenotype), what UniProtKB identifier should I use and what statement will that make about the experiment?

The paper I'm annotating describes experimental results for a protein, but there isn't enough information in the paper to determine exactly which protein isoform was used. What UniProtKB identifier should I choose and what statement will that make about the experiment?

My group annotates to ncRNAs as well as uncloned loci. How will these annotations be handled?

We like the additional features of the Protein2GO tool and can see the benefits of a common annotation framework, but our GO curation efforts are an integral part of our overall curation process. Is there a way to integrate the Protein2GO tool into our existing framework? Can the Protein2GO tool write annotations to our local database as well as the UniProtKB database?

What external ontologies can be used for Annotation Extensions (Column 16)? If our group has organism-specific ontologies, e.g. anatomy or lifestage ontologies, and we'd like to use for Column 16, can they be included in Protein2GO and in what file format should they be supplied?

[UniProt-GOA September 2012]