Procedure for migration of protein annotations to Protein2GO
Summary of integration procedure
UniProt-GOA will run their syntax checker over the MOD's GAF (obtained from the GOC website, post-filtering) to determine how many annotations are suitable for direct incorporation into Protein2GO. Any errors that arise from the syntax check will be supplied to the MOD.
UniProt-GOA will work together with the MOD to ensure as many annotations as possible are in the correct format for integration into our database, this has in the past involved many communications between the two groups to resolve formatting and syntax errors. Once as many of these issues have been resolved as possible, we will integrate the MOD's 'cleansed' file into our database and the MOD curators will be required to start using P2GO as their sole GO curation tool.
UniProt-GOA will make an update of all the latest MOD annotations just before the MOD curators start using P2GO. The MOD will then be made an internal curation source, after which the MOD curators should make updates to their annotations in Protein2GO only. Once the MOD curators are using P2GO, we will stop updating from the external MOD GAF.
Considerations prior to integration
Protein2GO only allows curation to UniProt accession numbers (e.g. Q4VCS5), therefore MOD identifiers will be mapped to UniProt accessions, where possible, using the MOD's gp2protein file, which we upload nightly and assume to be the primary identifier mapping source. The GAF provided to the MODs by UniProt-GOA will have annotations to UniProt accessions, the MOD will need to map back to their own identifiers if they so require.
ISS annotations can be accepted from MODs only if they have an identifier in the ‘with’ field that conforms to the regular expressions we have for many database identifiers. ISS, ISO and ISA annotations that have a PMID reference and no entry in the with field will only be accepted if their date is before April 30th 2008, which is when the GOC guidelines on the withfield being mandatory were changed.
In the case of chromosome duplication or two genes that map to the same protein; UniProt is intending to demerge such entries so that eventually there will be a 1:1 correspondence between a gene and protein for each species. However in the meantime, the MOD should decide whether they would want us to merge the annotation sets for the MOD identifiers into one UniProt entry and remove the redundancy. When a conversion from the UniProt file to the MOD-formatted one is done, in order for the MOD to obtain annotations for both genes, MODs will need to have both gene identifiers mapping to the same UniProt accession in their gp2protein file so that all the annotations from one UniProt accession are supplied to both MOD identifiers.
Regarding the preservation of the original curator and timestamps, the timestamp will be preserved since this is one of the fields in the gene association file. If groups would like to keep the name of the original curator then we ask that groups supply us with one annotation file (in GAF or GPAD format) per curator. The file name should be in the format: ‘gene_association.mod.Firstname_Lastname’, e.g. gene_association.wb.Joe_Bloggs
UniProt will create Protein2GO user accounts for each curator, whether actively curating or not, to enable us to display the curator name in Protein2GO.
For annotations made by a curator no longer working for the group, UniProt will need to be supplied with a separate annotation file for each curator, as detailed above. Once present in Protein2GO, these annotations will be editable by curators that have the same source as the original curator, e.g. a WB curator can edit all WB-sourced annotations.
Required changes to the Protein2GO database
UniProt-GOA will be responsible for making the necessary changes to the database in order to accept the MOD as an internal annotation source. These include;
1. Requesting Protein2GO accounts for the MOD curators
2. Adding in affiliation and editing rights data to the relevant database tables
3. Updating the MOD's source from 'external' to 'internal', which will include assigning stable annotation IDs to the source allowing the MOD's annotations to be audited
4. Updating tables to include the MOD in the relevant release statistics, which will be displayed on the UniProt-GOA website
Examples of annotations we will not accept
1. Those that use an identifier in Column 2 that cannot be mapped to a UniProt accession
2. Those that do not use a public reference, e.g. PMID, or an internal reference that cannot be translated to a GO_REF. Recognised public references include: PMID, DOI, GO_REF, REACT_ (others will be considered if they are (i) commonly used, (ii) are not referenced by any of the public references already listed, and (iii) are publicly accessible). Any non-public reference, needs to be provided with a GO_REF equivalent (this is provided by the external_accession: tags supplied under GO_REFs in http://www.geneontology.org/doc/GO.references. PAINT_REFs are displayed as GO_REF:0000033.
3. Those that use an identifier in the 'with' field that doesn't match the regular expressions that we recognise. Regular Expressions accepted by UniProt for use in the 'with' field
UniProtKB will not integrate annotations that have the following identifiers in the 'with' field, either because we already have pipelines that cover this data or the identifiers are not resolvable accessions (i.e. CBS);
Enzyme Commission Numbers
KEGG or KEGG_PATHWAY identifiers
4. Any IEA annotations - as we already supply these using several methods
Protein2GO only allows curation to UniProt accession numbers (e.g. Q4VCS5).
UniProt only allows limited use of ISS evidence code, e.g. we do not make ISS annotations from InterPro2GO mappings as we display the InterPro2GO mappings as IEA-evidenced annotations. We only make ISS annotations based on a curators or authors decision of sequence/structural similarity. Once the MOD curators start using P2GO, they will only be able to make new ISS annotations with a UniProt accession in the ‘with’ field. The MOD needs to check if GO_REF:0000024 (http://www.geneontology.org/cgi-bin/references.cgi) that we use for our ISS annotations based on curator judgment sufficiently covers their annotation practice.
Curators do not deal manually with IEA annotations, however they are displayed in Protein2GO. If there is a problem with an IEA annotation, please contact the annotation provider directly, e.g. InterPro for InterPro2GO (via SourceForge tracker), firstname.lastname@example.org for UniProt keyword and subcellular2GO etc.
UniProt-GOA will require the names and email addresses of MOD curators that will use Protein2GO in order to set up a login for them.
Training on the use of Protein2GO will be provided by one of the UniProt full-time GO curators using Webex or equivalent, there will also be an opportunity to try out Protein2GO in a test environment before curators move to using it full-time.
Please see also the Protein2GO manual
For each curator we are able to specify whether their annotations should be made publicly available or not. For instance, curators who are undergoing training in GO curation can be assigned a private source, which means their annotations will not be publicly released until they are approved by a checker.
So that it is clear for us to know what privileges to assign any new curators you tell us about, you will need to supply us with the following information for each new curator;
2. Email address
3. Private or public source (a)
4. Checker email (b)
(a) if the curator is trained in GO annotation and you are happy for their annotations to be made public immediately, choose 'public'. If the curator is new and learning GO annotation and you are actively checking their annotations, you can choose 'private' and we will not publicly release their annotations, the annotations should be checked by their trainer who should update the source of each annotation to the group's public source when it has been approved.
(b) fill this in with the trainer's email address if the curator is in training. If not completed the default email address is the contact address of the annotating group.
Release of annotations
Once the MOD's annotations have been integrated into our database, UniProt-GOA will provide the MOD with a file in the GAF2.0 format containing the entire set of GO annotations that match the taxon identifier(s) the MOD is responsible for as well as any additional annotations the MOD has created to other taxons. When importing the annotations back into their own database, the MOD can either note the updates made in this set from the changes in the date attached to each annotation (dates indicate when the last edit was made to the annotation) or they can carry out a full delete and reload of their GO annotation set.
Any annotations that we cannot accept from the MOD, but which the MOD wants to keep can be appended to the supplied GAF by the MOD, e.g. annotations to non-coding RNAs, annotations using internal references that aren't mapped to a GO_REF, IEA annotations, etc. UniProt-GOA will not store the annotations that are excluded, so it is up to the MOD to keep a record of these.
If required, we can provide IEA annotations, specific for the taxon(s) the MOD is responsible for, in the GAF as well as high-quality protein binding annotations that we import from IntAct. The electronic annotations we provide will not be filtered in any way, so the MOD will need to perform any filtering steps as it sees fit.
An updated MOD file will be created every two weeks - once as part of our main four-weekly release, as well as during a supplemental release that occurs two weeks later. All MOD-specific GAF files will be available to download from date-named folders located on the UniProt-GOA ftp site.
The main UniProt-GOA releases are timed to coincide with the UniProtKB releases and at this time a page displaying a breakdown of the MOD file's annotation statistics will be displayed on the UniProt-GOA website.
Annotations entered into Protein2GO will be made public in the QuickGO browser weekly (each Sunday) and in the GAF file releases fortnightly.
Each MOD will be responsible for supplying their final GAF to the GO Consortium.
Continuing quality assurance
UniProt-GOA run a number of sanity and syntax checks over the Protein2GO database. For each group contributing annotations to Protein2GO an email will be sent detailing the annotations that have not passed checks. These regular emails will ask that annotation owners make the requested updates in the way that they feel is most appropriate. The idea behind this is that groups will still have control over the dataset being exported by the UniProt-GOA pipeline and that UniProt curators should carry out little or no manual changes to another group's data. Examples of checks are secondary/obsolete GO terms, secondary UniProt accessions - these are likely to be added to in the future.
UniProt-GOA will load up the MOD's annotations and we will review which ones are rejected. We will supply the MOD with feedback on the rejections and what can be done to adjust the annotation so that it will be accepted into Protein2GO.
Frequently Asked Questions
Q1. If there are duplicate manual annotations from both the MOD and UniProt, how will that be handled?
A1. The UniProt-GOA database can handle duplicate annotations that differ only in source, therefore we will display duplicate annotations. We will be supplying all annotations to the species indicated in the file, regardless of which group created the annotation, so it would be up to each group to decide which they want to keep. However, if annotations from other groups are retained, attribution of these annotations must stay as the original source.
Q2. Some MODs update their databases on a nightly basis and would therefore like to have more frequent data releases. Is that possible?
A2. The default for supplying annotation files to groups is once every two weeks. If any group would like their file more often, we are happy to consider this within reason. There are certain times of the week when it is not possible to generate files (including at weekends) due to scheduling conflicts with other data import/export pipelines.
Q3. The literature for my organism rarely cites a UniProtKB identifier, but instead uses common gene names or synonyms. Can I use these common names as an entry point for annotation in Protein2GO?
A3. You can use a MOD identifier, e.g. WB:WBGene00000865, or gene name or synonym, e.g. ace1, to search for UniProt accessions in Protein2GO.
Q4. My group currently annotates to gene identifiers. If the experiment I wish to annotate does not directly involve a protein product (e.g., genetic ablation leads to a particular phenotype), what UniProtKB identifier should I use and what statement will that make about the experiment?
A4. Use the UniProtKB accession that is present in your gp2protein file for that gene identifier. If there is more than one, check if any are reviewed (i.e. Swiss-Prot entries as opposed to TrEMBL). We would suggest annotating to reviewed entries where possible, if all of them are unreviewed (TrEMBL), then you should consider annotating to all of the UniProtKB accessions. There are a number of possibilities when an unreviewed entry is reviewed so becoming a Swiss-Prot entry and the consequences to the manual GO annotations are described for each one below;
i) a TrEMBL entry is reviewed and becomes a Swiss-Prot entry: all annotations are automatically carried over
ii) a TrEMBL entry is split into two Swiss-Prot entries: an alert is sent and a curator must determine on which entry the annotations belong
iii) more than one TrEMBL entries are merged into a Swiss-Prot entry: all annotations are kept on the Swiss-Prot entry, if duplicate annotations exist an alert is generated and a curator must delete one of the annotations and move the other to the Swiss-Prot entry
During the process of reviewing UniProt entries, curators do not change manual GO annotations assigned to a TrEMBL entry unless they look incorrect.
In the case where multiple TrEMBL entries exist for a gene product and another TrEMBL accession is added for that gene product, no manual annotations are automatically propagated to the new accession.
If you would like to curate two separate TrEMBL entries that are known isoforms of a protein, then you should contact UniProt (email@example.com) requesting the two entries be merged into one Swiss-Prot accession and for isoform IDs to be assigned to each sequence.
Regarding what statement is being made about the experiment; it would indicate that this protein is involved in/located in the GO term chosen. This is no more or less accurate than saying your gene product is involved in the GO term chosen, which is what you are currently doing. Care should be taken when annotating mutant phenotype in any case.
Note on identifiers: Different annotating groups use different identifiers but no inferences should be made as to whether the annotation concerns a gene, RNA or protein.
Q5. The paper I'm annotating describes experimental results for a protein, but there isn't enough information in the paper to determine exactly which protein isoform was used. What UniProtKB identifier should I choose and what statement will that make about the experiment?
A5. If you are unsure of the particular isoform used, then annotate to the top-level accession, e.g. Q4VCS5, rather than an isoform, e.g. Q4VCS5-1. This states that Q4VCS5 has this function, it might be one isoform or all isoforms but this is not known from the evidence provided. If an isoform is stated in the paper, then you should always annotate to the specific isoform accession, e.g. Q4VCS5-1, which means only this isoform has this function.
Q6. My group annotates to ncRNAs as well as uncloned loci. How will these annotations be handled?
A6. These annotations should be appended to the GAF that UniProt supplies you with before you submit to the GO Consortium. You should also supply the GOC with either a gp_unlocalized file which should contain all the non-genome localized identifiers available in your database, including those not annotated to GO (http://wiki.geneontology.org/index.php?title=Gp_unlocalized_file) or a gp2rna file, which must include all ncRNA-encoding genes currently in the genome build including those not annotated to GO (http://wiki.geneontology.org/index.php/Gp2rna_file).
Q7. We like the additional features of the Protein2GO tool and can see the benefits of a common annotation framework, but our GO curation efforts are an integral part of our overall curation process. (i) Is there a way to integrate the Protein2GO tool into our existing framework? (ii) Can the Protein2GO tool write annotations to our local database as well as the UniProtKB database?
A7. (i) We have a webservice mechanism whereby an external tool can send annotations to Protein2GO via a webservice, however, the annotations will need to be associated with a valid Protein2GO account. Please contact firstname.lastname@example.org for more details. (ii) The current Protein2GO architecture does not support writing to any database other than UniProt.
Q8. What external ontologies can be used for Annotation Extensions (Column 16)? If our group has organism-specific ontologies, e.g. anatomy or lifestage ontologies, we'd like to use for Column 16, can they be included in Protein2GO and in what file format should they be supplied?
A8. External ontologies and vocabularies that are allowable in annotation extensions are described in the [| GO annotation extension relations file] in the union_of tags of the ENTITY_UNION stanzas.
The specific identifiers used should be present as an entity_type in the GO.xrf_abbs file (http://www.geneontology.org/doc/GO.xrf_abbs). These may be extended to other ontologies/vocabularies that are needed by curators by discussing with those who maintain this file (Midori, Val, Rachael, Chris). Currently, Protein2GO allows CL, PO, UniProtKB, PomBase, CHEBI, Ensembl, GO, IntAct, PR, SO, UBERON, but this can be extended if curators wish to use different types. However, since we use OLS (http://www.ebi.ac.uk/ontology-lookup/) to look up terms from other ontologies, the requested ontology must be supported by OLS.