GOA FAQ

From GO Wiki
Jump to: navigation, search

What is GOA?

The GOA project aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI), and is a central dataset for other major multi-species databases; such as Ensembl and NCBI.

GOA has been a member of the GO Consortium <http://www.geneontology.org/> since 2001, and is responsible for the integration and release of GO annotations to the human, chicken and cow proteomes. In 2006 GOA became a central participant in the new GOC Reference Genome Annotation project and is committed to the comprehensive annotation of a set of disease-related proteins in human.

Because of the multi-species nature of the UniProtKB, GOA also assists in the curation of another 120,000 species. This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups (e.g. LifeDB). This effort ensures that the GOA dataset remain a key reference and a comprehensive source of GO annotation for all species.

Why are InterPro2GO mappings not updated with GOA releases?

GOA is updated in accordance with the latest data released by its core databases (UniProtKB, IPI, InterPro and Ensembl) as well as mappings of SWISS-PROT Keywords, InterPro and Enzyme Commission (EC) terms to GO. Each of GOA's core databases produces its own releases; for example, InterPro has dependencies on the member databases of InterPro. InterPro2GO is updated at regular intervals but not always in keeping with monthly schedule of GOA releases.

What methods are used for automatic annotation in GOA?

At present, GOA has five independent methods for automatic annotation.

Four of these methods use mappings of concepts from external database systems that have been manually indexed to equivalent GO terms. The mappings used by GOA are of InterPro signatures, Enzyme Commission numbers, Swiss-Prot keywords and HAMAP families to GO terms. Further information on mapping files can be found at: http://www.geneontology.org/GO.indices.shtml.

In addition GOA has recently provided a new electronic annotation technique in collaboration with the Ensembl group. Using the gene orthology obtained from the Ensembl Compara pipeline, GO terms from a source species have been projected onto one or more target species. This method has provided over 30,000 annotations to the human, mouse, rat, chicken, dog, bovine and Anopheles gambiae proteomes. Only one to one and apparent one to one orthologies are used and only manually-annotated GO terms with an evidence type of IDA, IEP, IGI, IMP or IPI are projected.

Further detail on these annotation techniques and their format in the GOA gene association file is described in the GOA readme, available at: http://www.ebi.ac.uk/GOA/goaHelp.html

Is the GOA-Human association file a subset of the GOA UniProt file? What are the differences?

GOA-Human is not a subset of GOA UniProt file.


The GOA UniProt gene association file contains all manual and electronic annotations that GOA has assigned to UniProtKB entries. This dataset contains annotations to more than 120,000 different species and is redundant for electronic annotations where two different electronic methods have assigned the same or less granular GO term.

The IPI project (http://www.ebi.ac.uk/IPI/) provides the GOA group with a minimally redundant yet maximally complete sets of proteins for a number of species, including the human proteome, and is assembled from protein sequence information taken from UniProtKB, RefSeq, Ensembl, TAIR, H-InvDB and Vega.

While entries in the UniProt Knowledgebase (Swiss-Prot and TrEMBL) representing proteins with identical sequences are merged, a low level of semantic redundancy remains (where different sequences represent the same underlying entity, perhaps due to biological variation or sequencing error). Using IPI, these surplus entries can be filtered out to produce a non-redundant data set.

For the GO terms in these IPI files, our aim has been to remove those electronic annotations created by the same technique and that have predicted same or less granular GO terms.

An example would be for annotations created by the InterPro2GO mapping technique. In the redundant UniProt gene association file, there are three annotations to binding terms for protein P02144:

UniProt P02144 MYG_HUMAN GO:0005488 GOA:interpro IEA InterPro:IPR000971 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F IPI00217493 protein taxon:9606 20060125 UniProt (GO:0005488 - 'binding', GO:0019825 - 'oxygen binding' , GO:0020037 - 'heme binding')

However within the human IPI species-specific file there exist only two of these three:

UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt (GO:0019825 - 'oxygen binding' and GO:0020037 'heme binding')

The GO term for 'binding' has been removed from the human file as it does not provide users with any extra information, as it is a less granular parent to the oxygen and heme binding terms. This can be done because of the 'true path rule' that GO follows.

In the true path rule "the pathway from a child term all the way up to its top-level parent(s) must always be true" so a protein which is annotated to a term such as 'oxygen binding' automatically indicates that the protein would also be correctly annotated to its parent term 'binding'. This is known because the 'binding' GO term is displayed in GO a parent of 'oxygen binding'.

Most of the publicly available expression data gives GenBank IDs for the human genes but the GO annotation uses UniProtKB accessions or IPI identifiers. Is there an easy way to map between the two?

The Gene Ontology Consortium provides annotations for gene products, so we reference protein sequences and not DNA sequences. It is not necessarily useful to know just the DNA accession number because so many different coding sequences can be within one DNA sequence. There are at least two ways to find Protein_IDs:

By retrieving the SWISS-PROT file from the EBI's FTP site: The DR line of almost every SWISS-PROT entry contains the following, e.g.:

DR EMBL; AF043736; AAC02090.1; -.

AF043736 is the EMBL/Genbank/DDBJ AC number and AAC02090 is the protein identifier for the coding sequence (CDS) within the EMBL/Genbank/DDBJ entry. These are universal IDs shared by all three of the collaborating nucleotide sequence databases.

Additionally (and redundantly), GenBank identifies its proteins by a second identifier, the GI number. SWISS-PROT does not keep cross references to Genbank GI numbers, but you can map between protein identifiers by retrieving the NCBI's non-redundant protein dataset from ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.Z

You can parse the deflines of that file. If two sequences are identical they are merged and their information is merged into the defline. For example, searching for Q9W4P5 hits the following:

gi|18543319|ref|NP_570080.1| (NM_130724) CG2934 gene product [Drosophila melanogaster]gi|12585516|sp|Q9W4P5|V0D1_DROME Vacuolar ATP synthase subunit d 1 (V-ATPase d subunit 1) (Vacuolar proton pump d subunit 1) (V-ATPase 39 KDa subunit 1)gi|7290447|gb|AAF45902.1| (AE003429) CG2934 gene product [Drosophila melanogaster]gi|17862396|gb|AAL39675.1| (AY069530) LD24653p [Drosophila melanogaster]

From the above you can find the Protein_ID (AAL39675.1), the universal DNA accession for the region that contains this CDS(AY069530), and the GI number.

By using the EBI's Sequence retrieval system (SRS): You can use SRS to search the EBI's GO annotation (GOA) files or the GO database, which is a mirror of the GO consortium repository.

For example, to search GOA for all proteins that function as transporters(GO:0005215) and that have an experimental evidence code:

  1. Choose 'extended search' and select enter the GO identifier '0005215' in the 'goid' search field.
  2. In the 'combine searches with' section of the tool bar on the left-hand side of the page, select the'BUTNOT'option and, in the 'evidence' field, add the GO evidence code IEA (this means 'inferred from electronic annotation'). This creates a query that searches for all proteins that have been linked to the GO term' transporter' and that were manually curated.
  3. SRS can link your results to databases that do not contain direct references to each other. For example, in the last search, SWISS-PROT, InterPro and TrEMBL accession numbers will be displayed in the results page but the search can also be extended to show all transporters in the last search that share EMBL/GenBank/DDBJ accession numbers. To do this, on the results page select the 'link' option on the left-hand tool bar, choose 'EMBL' and hit the 'submit link' button.

Is there a mapping of GO IDs to UniGene IDs?

GOA provides a UniGene to UniProtKB mapping (available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/gp2protein/gp2protein.unigene.gz) Using the either one of GOA's non-redundant species-specific files (Arabidopsis, chicken, cow, human, mouse, rat and zebrafish proteomes), or the GOA UniProt gene association file, you should be able to parse out the appropriate set of GO terms for a set of Unigene ids.

How do I associate an EMBL/DDBJ/Genbank nucleotide sequence accession number with the GO ID?

In addition to the gene association files produced by the GOA project we also provide mapping files between the entries in these (arabidopsis, chicken, cow, human, mouse, rat and zebrafish) sets and other databases such as the EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO, LocusLink and RefSeq at the NCBI. The readme for these files can be found at: http://www.geneontology.org/doc/goa.README


Also the information to link the protein and nucleotide data exists in almost every UniProtKB entry. The specific format for cross-references from Swiss-Prot or TrEMBL to coding sequences (CDS) in the DDBJ/EMBL/GenBank nucleotide sequence database is in the DR line, e.g.:

   DR EMBL; AF043736; AAC02090.1; -.
   AF043736 is the EMBL/GenBank/DDBJ Accession number
   AAC02090 is the protein-id/Protein Sequence Identifier for the CDS within the EMBL/GenBank/DDBJ entry.
   These two are universal IDs shared by all 3 of the collaborating nucleotide sequence databases.

We have currently released all GO annotation to SWISS-PROT and TrEMBL and we are working on adding this GO annotation directly in the EMBL-Bank database. As you know, EMBL/DDBJ and GenBank are an international collaboration, which exchange information on a daily basis. This will be possible by the EMBL Christmas 2002 release. For the time being the only way you can download EMBL/Genbank/DDBJ accession numbers with GO annotation is to use the EBI's sequence retrieval system (SRS: http://srs.ebi.ac.uk/).

Search the GOA database by GO ID or GO evidence code and then use the link option to link to EMBL-Bank database.

GO IDs are also displayed by the Ensembl database for the human genome. For more information read the GOA or Ensembl home pages, http://www.ebi.ac.uk/GOA and http://www.ebi.ac.uk/ensembl .