Annotation FAQ

From GO Public

Jump to: navigation, search

Contents

What is annotation?

What does it mean to do GO annotation of genes or proteins?

Terms from the Gene Ontology are applied in the annotation of gene products in biological databases. GO annotations are associations made between gene products and the GO terms that describe them. Because a single gene may encode different products with very different attributes, GO recommends associating GO terms with database objects representing gene products rather than genes. If identifiers are not available to distinguish individual gene products, GO terms may be associated with an identifier for a gene; a gene object is associated with all GO terms applicable to any of its products.

What is a 'gene product'?

GO uses the term 'gene product' to refer collectively to genes and any entities encoded by the gene, e.g. proteins and functional RNAs.

How are gene products associated with GO terms?

A gene product can be annotated to zero or more nodes of each ontology, at any level within each ontology; annotation of a gene product to one ontology is independent of its annotation to other ontologies. Annotations should reflect the normal function, process, or localization (component) of the gene product; an activity or location observed only in a mutant or disease state is therefore not usually included.

The member databases of the GO Consortium use manual and automated methods to annotate genes or gene products using GO terms. Both manual and automated annotations are made according to two principles: first, every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis; second, the annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. GO uses a simple controlled vocabulary to indicate the type of evidence found in the cited reference to support the annotation.

See the GO annotation guide and evidence code documentation for more information.

What criteria are used to annotate genes with GO terms?

A variety of criteria are used for each annotation including experimental results, sequence similarity and curator judgement.

See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.

What is an evidence code?

Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis. The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. A simple controlled vocabulary is used to record evidence; and the evidence codes are simply the three-letter codes used to signify the type of evidence cited. More information on the meaning and use of the evidence codes can be found in the GO evidence codes documentation.

Can a gene or gene product be annotated to more than one term from an ontology?

Yes, a gene product can be annotated to zero or more nodes of each ontology, at any level within each ontology.

See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.

Can a gene product be annotated to more than one ontology?

Yes, annotators are encouraged to annotate to terms from all three ontologies. Annotation of a gene product to one ontology is independent of its annotation to other ontologies.

See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.

Why are some gene products annotated to both a parent term and a child term?

Question:

I was wondering why some gene products in the gene_product table are assigned to terms that are a parent or grand-parent etc ... of another term that has been assigned to the gene product. For example, why is "protein kinase activity" assigned to the gene product with symbol KFMS_HUMAN if "protein tyrosine kinase activity" was already assiged to this gene?

This is done when there is explicit evidence to support separate annotations; usually it means that there is strong evidence for a more general annotation (parent term) and weaker evidence supporting a more specific annotation (child term).

From the GO annotation guide:

Uncertain knowledge of where a gene product operates should be denoted by annotating it to two nodes, one of which can be a parent of the other. For instance, a yeast gene product known to be in the nucleolus, but also experimentally observed in the nucleus generally, can be annotated to both nucleolus and nucleus in the cell component ontology. Even though annotation to nucleolus alone implies that a gene product is also in the nucleus, annotate to both so as to explicitly indicate that it has been reported in the two locations. The two annotations may have the same or different supporting evidence.

Where can I view or download the complete sets of GO annotations?

As with the vocabularies, the gene product/GO association sets from contributing groups are available at the GO web site. Tab-delimited files of the associations between gene products and GO terms that are made by the member organizations are available from their individual FTP sites, from the GO FTP site (ftp://ftp.geneontology.org/pub/go/gene-association), or from a link on the Current Annotations table.

The gene association file format is described in the GO annotation guide. These files store IDs for objects (genes/gene products) in the database that contributed the file (e.g. FlyBase IDs, Swiss-Prot accession IDs for proteins) as well as citation and evidence data. There are also files containing Swiss-Prot/TrEMBL protein sequence identifiers for gene products that have been annotated using GO terms; they are available via FTP.

You can also download the annotations in mySQL format from http://www.geneontology.org/GO.downloads.database.shtml. Note however that the mysql database dumps shouldn't be treated as flat files to be parsed directly. Rather, they are meant to be loaded into a mysql database and queried.

There is an issue here is that the data in

[1]

and the GO database do not agree on what constitutes a single "annotation". The former counts a line in a gene_association file as an annotation whereas the GO database (and the file you're looking at, if parsed directly) counts a gene product-GO association referenced by a publication as a single annotation, even if there is more than one piece of evidence for this.

Where can I find GO annotations of proteins and ESTs?

Gene objects in model organism databases typically have multiple nucleotide sequences from the public databases associated with them, including expressed sequence tags (ESTs) and one or more protein sequences. There are two ways to obtain sets of sequences with GO annotations:

  1. from the model organism databases; or
  2. from the annotation sets for transcripts and proteins contributed to GO terms by Compugen and UniProt.


Obtaining GO annotations for model organism sequence sets

In the gene association files, the GO terms are associated with an ID for a gene or gene product from the contributing data resource. Usually, the mappings of gene IDs to sequence IDs are also available from the contributing model organism database. For example, the Mouse Genome Informatics FTP site includes the gene association files contributed to the GO, and other reports that include official mouse gene symbols and names and all curated gene : sequence ID associations.

Obtaining GO annotations for transcript and proteins in general

Large transcript and protein sequence data sets are annotated to GO terms by Compugen and UniProt, respectively. These files can be downloaded direct from the GO web site. Species of origin for the sequence is included in the association files.

How do I find all the human genes that have been annotated with a particular GO term?

GO terms have been associated with a non-redundant set of human proteins described in Swiss-Prot/TrEMBL/InterPro and Ensembl. These annotations are available in the GOA Human file on the EBI and GO FTP sites.

GOA project data are also accessible from Ensembl and from the EMBL/DDBJ/GenBank nucleotide sequences stored at EMBL-Bank. For more information about browsing GOA project data at EBI, see the EBI's GOA page (http://www.ebi.ac.uk/GOA/index.html).

How can I get sequences of proteins annotated to a GO term?

The AmiGO browser allows you to search the GO annotations contributed by all the participating databases and retrieve protein sequences for annotated gene products (if available).

To retrieve sequences using AmiGO, go to http://amigo.geneontology.org/cgi-bin/amigo/go.cgi and enter your chosen GO term (e.g 'mitochondrion') in the 'search for GO terms and associated genes' field. On the page listing the annotated genes, check the genes you require the sequence for and, at the bottom of the page, toggle the option box to 'Get Fasta Sequences'. Hit the 'Submit' button.

If you would like to retrieve sequences for only one species or only one data source then please choose a filter setting using the menus under the heading 'Filter Associations'.

Why can't I retrieve sequences for some annotated proteins?

The AmiGO browser uses the GO database, which is built using submissions from various bioinformatics groups and model organism databases (GO Consortium members). Some of these groups also submit an optional mapping from their IDs to protein sequence database IDs. Not all do, which is why you only get a subset.

Can I sort or filter annotations by evidence code?

I want the most reliable data available, so I want to retrieve only the annotations that were done manually. How can I do this?

Many GO browsers, including AmiGO, allow you to select one or more evidence codes, and retrieve annotations using the selected codes. In AmiGO, use the advanced query and select codes from the Evidence Type list.

What are the advantages and disadvantages of manual annotation?

The most reliable annotations are those made manually by database curators based on primary and review literature. Manual annotations often cite experimental evidence that provides strong support for the association of a GO term with a gene product, and can be done at a very detailed level. The chief disadvantage of manual annotation is that it is labor-intensive, requiring a lot of time and effort from trained biologists.

What are the advantages and disadvantages of automatic annotation?

The main advantage of automatic annotation is simply speed: wholly or partially automated methods facilitate the annotation of much larger sets of known or predicted gene products than can be produced manually. Automation, however, yields more general (less detailed) annotations compared to manual annotation, and automated methods are more error-prone. Annotations made by automated methods are therefore regarded as less reliable than manual ones.

How is annotation quality controlled to ensure consistency between databases?

The accuracy of GO annotations is a high priority for all members of the GO Consortium. Each member organization is responsible for keeping its own annotations accurate and up to date, and for correcting any errors. Users can report errors to the GO mailing list at the GO helpdesk; any comments on annotations will be forwarded to the appropriate contributing group.

The GO Consortium is also looking into possible ways to improve quality assurance further, such as manually reviewing selected annotations and developing tools to automate detection of potentially erroneous annotations.

How often does automatic annotation give results that are consistent with manual annotation?

In general, electronic annotations are rarely incorrect, as they are annotations to very high-level GO terms. For example, the GOA group at EBI reports:

Usually manual annotation simply provides deeper-level terms in GO. In 93% of cases GOA's electronic annotation is in the same GO lineage as the manual annotation. Some users have used our manual annotation to assess the quality of their automatic GO annotation techniques. They have found a few manual annotation errors by Proteome Inc. but no errors (so far) of manual annotation by Swiss-Prot staff have been reported to GOA. A few InterPro2GO errors have been reported but not very many. So, in general, our electronic techniques are very accurate, and are sometimes based on manual annotation. For example, Swiss-Prot keywords are usually manually annotated to Swiss-Prot entries; by using a mapping of Swiss-Prot keywords to GO, GOA inherits the high quality of Swiss-Prot manual annotation.

There has been further investigation into this topic in 2005:

"The quality of electronic annotation has recently been assessed in some detail (Camon et al., 2005). This research found that in the worst case scenario, the generation of electronic annotations using the interpro2go, spkw2go, and ec2go mapping files precisely predicted the correct GO term 60% to 70% of the time, with the remainder of the predictions being to insufficiently specific GO terms. The high precision was found to be due to the basing of electronic annotations on manually curated mapping files. Curators noted that it was more important for database curation to be accurate than to have complete coverage, and the figures above demonstrate that this is the tendency with electronic annotation." Text from (Clark et al., 2005).

How do I annotate ESTs?

To make electronic GO annotation to ESTs it is usual to Blast the EST sequences against sequences that have been manually annotated and transfer the annotations from similar sequences, adding evidence code IEA.

Some useful tools for EST annotation:

The AmiGO browser has a BLAST query feature built in, which you can use to query annotated gene products in the GO database. For large batch queries, you may want to download the file of annotated sequences and use it to run BLAST locally. The file is available from the GO ftp site (ftp://ftp.geneontology.org/pub/go) and is updated regularly.

Another option might be to install the AmiGO code and GO database locally.

The underlying data are in flat files that can be found in these directories on the GO FTP site:

There is a README for the gp2protein directory. The format of the files in the /gene_associations directory is described in the GO annotation guide. Please let us know if you have questions about these files.

- You could also try using InterProScan to find protein domains/motifs encoded by the ESTs, and transfer GO terms that have been associated with InterPro entries. See InterPro for more information.

- TIGR's file with GO associations for the Tentative Consensus's in TIGR's Gene Index databases are included on the GO web site. The Tentative Consensus sequences are assemblies of EST's. All of the GO terms were assigned by computer and are therefore IEA [electronic, i.e. computed and not reviewed by a human].

- Several other groups have done automated assignment of GO terms to genes or proteins, including ESTs, and many of them would probably be willing to share their methods and software. The GO Bibliography includes a list of publications on GO annotations.

Is there a file that shows the numerical identifiers for GO terms and the corresponding annotated genes?

There is a directory that contains flat files associating genes or gene products, contributed by several different databases: ftp://ftp.geneontology.org/pub/go/gene-associations. The file format is documented in GO annotation guide.

Why is my gene/protein not in the GO database?

GO is a work in progress, so not all genes and proteins have GO terms associated with them yet. To check on the annotation status of a particular gene or protein, please email the GO helpdesk.

I have a question about gene or protein nomenclature

The GO Consortium is not involved in naming genes at all, in any organism. The GO vocabularies describe attributes of gene products; they are not collections of gene names or protein names.

Gene names are generally standardized within an organism but not necessarily between organisms (with some notable exceptions, such as the ongoing effort to make human and mouse gene names consistent). I suggest that you direct your query to the database or nomenclature committee for your organism. For example, human gene names are maintained by the HUGO Gene Nomenclature Committee (HGNC), mouse gene names by MGI, etc.

How do I submit annotations to GO?

Please write to the GO helpdesk with your suggestions.

Suggestions for updating existing annotations will be forwarded to the contributing group.

If you want submit a new annotation set, we will reply and tell you how to proceed.


Specific questions

I have a list of Entrez IDs; how do I find the annotations for them?

The list of Entrez IDs should be converted to UniProtKB or model organism database IDs, and those IDs used to search the GO database.

PIR has an ID mapping tool to help with the conversion: http://pir.georgetown.edu/pirwww/search/idmapping.shtml

The GO website has a list of taxa and authoritative database groups: http://www.geneontology.org/GO.annotation.species_db.shtml

Personal tools