GO FAQ
From GO Public
If you do not find the answer to your question here, you can email the GO helpdesk.
[edit] General GO
[edit] What is GO?
The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, we write and maintain the ontologies themselves; second, we make cross-links between the ontologies and the genes and gene products in the collaborating databases, and third, we develop tools that facilitate the creation, maintainence and use of ontologies.
The use of GO terms by several collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that you can query them at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product.
[edit] What is an ontology?
Ontologies are 'specifications of a relational vocabulary'. In other words they are sets of defined terms like the sort that you would find in a dictionary, but the terms are given hierarchical relationships to one another. The terms in a given vocabulary are likely to be restricted to those used in a particular field or domain, and in the case of GO, the terms are all biological.
[edit] Why do we need GO?
To ask meaningful questions, biologists often need to retrieve and analyse data from disparate sources. For example, if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis, but that have significantly different sequences or structures from those in humans. But if one database describes these molecules as being involved in 'translation', whereas another uses the phrase 'protein synthesis', it will be difficult for you - and even harder for a computer - to find functionally equivalent terms.
The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The GO collaborators are developing three ontologies - a word used by computer scientists to mean 'specifications of a relational vocabulary' - that describe biological processes, cellular components and molecular functions in a species-independent manner.
Ontologies provide a vocabulary for representing and communicating knowledge about a topic, and a set of relationships that hold among the terms of the vocabulary. They can be structurally very complex, or relatively simple. Most importantly, ontologies capture domain knowledge in a way that can easily be dealt with by a computer . Because the terms in an ontology and the relationships between the terms are carefully defined, the use of ontologies facilitates making standard annotations, improves computational queries, and can support the construction of inference statements from the information at hand.
Genomic sequencing projects and microarray experiments alike produce electronically-generated data flows that require computer accessible systems to work with the information. As systems that make domain knowledge available to both humans and computers, bio-ontologies such as GO and the many other bio-ontologies being created (see the OBO web page for some examples) for are essential to the process of extracting biological insight from enormous sets of data.
[edit] Which biological domains are supported by GO?
The current ontologies of the GO project are molecular function, biological process, and cellular component. These three areas are considered independent of each other. The ontologies are developed to include all terms falling into these domains without consideration of whether the biological attribute is restricted to certain taxonomic groups. Therefore, biological processes that occur only in plants (e.g. photosynthesis) or mammals (e.g. lactation) are included.
Other biological ontologies are discussed in the OBO web site.
[edit] What is beyond the scope of the GO project?
Almost as important as understanding the scope of the GO project is understanding what the GO project is not. The most common misapprehensions are (1) that the GO is a system for naming genes and proteins and (2) that the GO attempts to describe all of biology. The GO neither names genes or gene products, nor attempts to provide structured vocabularies beyond its three domains: molecular function, biological process and cellular component.
GO is not a nomenclature for genes or gene products. The vocabularies describe molecular phenomena (e.g. programmed cell death), not biological objects (e.g. proteins or genes). Sharing gene product names would entail tracking evolutionary histories and reflecting both orthologous and paralogous relationships between gene products. Different research communities have different naming conventions. Different organisms have different numbers of members in gene families. The GO project focuses on the development of vocabularies to describe attributes of biological objects, not on the naming of the objects themselves. This point is particularly important to understand because many genes and gene products are named for their function.
[edit] Does the GO ID have any meaning?
The GO IDs are purely unique identifiers; they do not encode any information about a term or its position relative to other terms in the tree.
[edit] How do I browse the GO?
The GO Consortium has developed AmiGO for searching and browsing the Gene Ontology and the gene products that member databases have annotated using GO terms. Browsing the GO tree or searching for a term allows you to see term information and the hierarchy for the term, crossreferences to external databases, and the complete set of gene product associations for the term and any of its children.
Other tools with GO browsing capabilities can be found on the GO tools page of the GO website.
[edit] How do I find GO annotations for 'my' genes?
The GO Consortium has developed AmiGO for searching and browsing the Gene Ontology and the gene products that member databases have annotated using GO terms. Using AmiGO, you can search for one or more gene products and view its GO annotations.
[edit] Where can I view or download the complete sets of GO annotations?
As with the vocabularies, the gene product sets (gene association files) from contributing groups are freely available; you can download them from the annotation downloads section of the GO website. The files are in tab-delimited text; the file format is described in the GO annotation guide. Gene association files contain all evidence pertinent to the annotation, including database IDs and gene product names, as well as citation and evidence data.
[edit] Why does GO always refer to 'gene products'?
GO uses 'gene products' to refer to any protein or RNA encoded by a gene.
[edit] Searching and browsing GO
[edit] How do I browse genes from all the different participating databases annotated to a particular term?
The GO Consortium has developed AmiGO for searching and browsing the Gene Ontology and the gene products that member databases have annotated using GO terms. Using AmiGO, you can search for one or more gene products and view its GO annotations. The results include the GO hierarchy for the term, definition and synonyms for the term, external links, and the complete set of gene product associations for the term and any of its children. AmiGO also allows you to filter your results if you wish to see only a subset of the data.
[edit] How do I find manually annotated gene products only, i.e. how do I sort by evidence code?
The GO Consortium has developed AmiGO for searching and browsing the Gene Ontology and the gene products that member databases have annotated using GO terms. Using AmiGO, you can search for one or more gene products and view its GO annotations. The results can be filtered so that only annotations using a user-defined set of evidence codes are shown. At present, AmiGO only uses manual annotations (it excludes all annotations with the evidence code IEA) but it will soon allow all annotation data to be shown.
[edit] Where can I view or download the complete sets of GO annotations?
As with the vocabularies, the gene product/GO association sets from contributing groups are available at the GO web site. Tab-delimited files of the associations between gene products and GO terms that are made by the member organizations are available from their individual FTP sites, from the GO FTP site (ftp://ftp.geneontology.org/pub/go/gene-association), or from a link on the Current Annotations table.
The gene association file format is described in the GO annotation guide. These files store IDs for objects (genes/gene products) in the database that contributed the file (e.g. FlyBase IDs, Swiss-Prot accession IDs for proteins) as well as citation and evidence data. There are also files containing Swiss-Prot/TrEMBL protein sequence identifiers for gene products that have been annotated using GO terms; they are available via FTP.
You can also download the annotations in mySQL format from GO Database Downloads. Note however that the mysql database dumps shouldn't be treated as flat files to be parsed directly. Rather, they are meant to be loaded into a mysql database and queried. There is an issue here is that the data in the Current Annotations table and the GO database do not agree on what constitutes a single "annotation". The former counts a line in a gene_association file as an annotation whereas the GO database (and the file you're looking at, if parsed directly) counts a geneproduct-GO association referenced by a publication as a single annotation, even if there is >1 evidence for this.
[edit] Where can I find GO annotations of proteins and ESTs?
Gene objects in model organism databases typically have multiple nucleotide sequences from the public databases associated with them, including expressed sequence tags (ESTs) and one or more protein sequences. There are two ways to obtain sets of sequences with GO annotations:
- from the model organism databases
- from the annotation sets for transcripts and proteins contributed to the GO by Compugen and SWISS-PROT
Obtaining GO annotations for model organism sequence sets: In the gene association files, the GO terms are associated with an accession ID for a gene or gene product from the contributing data resource. Usually, the association files of the gene to sequenceIDs are also available from the contributing model organism database. For example, the Mouse Genome Informatics FTP site includes the gene association files contributed to the GO, and other reports that include official mouse gene symbols and names and all curated gene : sequence ID associations.
Obtaining GO annotations for transcript and proteins in general: Large transcript and protein sequence data sets are annotated to the GO by Compugen and SWISS-PROT/TrEMBL, respectively. These files can be downloaded direct from the GO web site. Species of origin for the sequence is included in the association files.
[edit] How can I get FASTA files of proteins annotated to a particular GO term?
On the GO web site, select the link to the AmiGO browser (which will allow you to search the GO gene associations contributed by all the participating databases) and enter your chosen GO term (e.g 'mitochondrion') in the Search box. Toggle the 'Terms' button and click on 'Submit Query.' The resulting page will present a list of all Gene Product Associations to the queried term and its children. Note that associations may be filtered according to Species, Data Source, and Evidence Code as well as to only those gene products annotated directly to the queried term. Check the genes you require the sequence for and, at the bottom of the page, toggle the option box to 'Get FASTA sequences'. Hit the 'Submit Query' button. If you would like sequences for all of the gene products, click on the 'Select all' option.
[edit] How do I find all the human genes that have been annotated with a particular GO term?
GO terms have been associated with a non-redundant set of human proteins described in SWISS-PROT/TrEMBL/InterPro and Ensembl. These annotations are available in the GOA-Human file on the EBI and GO FTP sites.
GOA project data are also accessible from Ensembl and from the EMBL/DDBJ/GenBank nucleotide sequences stored at EMBL-Bank. For more information about browsing GOA project data at EBI, see the EBI's GOA page.
[edit] Is it possible to browse GO database using a GenBank accession number or gi number?
The GO database does not include GenBank accession numbers for annotated genes (or gene products), with the exception of an annotation dataset provided by Compugen, Inc. at ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.compugen.Genbank.gz and http://www.geneontology.org/doc/Compugen.README
For annotatians provided by the GO Annotations at EBI (GOA) project, a file of cross-references to database entries including GenBank/EMBL/DDBJ is available at ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/human.xrefs.gz
For some other annotation sets, there are files containing Swiss-Prot/TrEMBL protein sequence identifiers and model organism database IDs, available from ftp://ftp.geneontology.org/pub/go/gp2protein/
[edit] Can I search GO using Boolean operators?
Yes - you can perform this sort of search on the ontologies using the ontology editing tool OBO-Edit, which is developed by the GO Consortium. Full instructions for searching using OBO-Edit are available in the OBO-Edit help menu.
[edit] What are the recommended data access policies?
The GO Database server, http://www.godatabase.org, is a shared resource and thus we require data mining to be performed in a manner that allows others to utilize this resource at the same time. Any activity that mines the GO Database using AmiGO must be controlled so that only one request at a time. You may download and install the database locally. You can also retrieve all the source files that define the data within the database. Details on installing the database locally are available at http://www.godatabase.org/dev/database/
For more information please contact the GO helpdesk
[edit] What is the best way to obtain the GO annotations for a list of UniProt Accession Numbers in batch?
With UniProt accession numbers, you can obtain all GO annotations by parsing a GOA gene association file, which are provided in a simple 15 column tab-delimited format. These files are available from our ftp site, at ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
The GOA project offers users a number of different files at this site so people can choose whether to look at the entire collection of GO annotations to proteins in UniProtKB: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.go_uniprot.gz
Or, if you were only interested in proteins from a particular species, we also provide non-redundant, species-specific files for human, mouse, rat, zebrafish, chicken, cow and Arabidopsis proteins (these files are created using the International Protein Index (IPI) - which provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms) : ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
Further information on the content and format of our gene association files is available from our ReadMe at http://www.ebi.ac.uk/GOA/goaHelp.html
Please contact GOA help for further assistance.
[edit] Annotations
[edit] What is annotation?
What does it mean to do GO annotation of genes or proteins?
Terms from the Gene Ontology are applied in the annotation of gene products in biological databases. GO annotations are associations made between gene products and the GO terms that describe them. A gene product is an RNA or protein product encoded by a gene. Because a single gene may encode different products with very different attributes, GO recommends associating GO terms with database objects representing gene products rather than genes. If identifiers are not available to distinguish individual gene products, GO terms may be associated with an identifier for a gene; a gene object is associated with all GO terms applicable to any of its products.
[edit] How are genes and gene products associated with GO terms?
A gene product can be annotated to zero or more nodes of each ontology, at any level within each ontology; annotation of a gene product to one ontology is independent of its annotation to other ontologies. Annotations should reflect the normal function, process, or localization (component) of the gene product; an activity or location observed only in a mutant or disease state is therefore not usually included.
The member databases of the GO Consortium use manual and automated methods to annotate genes or gene products using GO terms. Both manual and automated annotations are made according to two principles: first, every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis; second, the annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. GO uses a simple controlled vocabulary to indicate the type of evidence found in the cited reference to support the annotation.
See the GO annotation guide and evidence code documentation for more information.
[edit] What criteria are used to annotate genes with GO terms?
A variety of critera are used for each annotation including experimental results, sequence similarity and curator judgement.
See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.
[edit] What is an evidence code?
Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis. The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. A simple controlled vocabulary is used to record evidence; and the evidence codes are simply the three-letter codes used to signify the type of evidence cited. More information on the meaning and use of the evidence codes can be found in the GO evidence codes documentation.
[edit] Can a gene or gene product be annotated to more than one term from an ontology?
Yes, a gene product can be annotated to zero or more nodes of each ontology, at any level within each ontology.
See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.
[edit] Can a gene product be annotated to more than one ontology?
Yes, annotators are encouraged to annotate to terms from all three ontologies. Annotation of a gene product to one ontology is independent of its annotation to other ontologies.
See #How are genes and gene products associated with GO terms? in the FAQ and the GO annotation guide for more information.
[edit] Why are some gene products annotated to both a parent term and a child term?
Question:
I was wondering why some gene products in the gene_product table are assigned to terms that are a parent or grand-parent etc ... of another term that has been assigned to the gene product. For example, why is "protein kinase activity" assigned to the gene product with symbol KFMS_HUMAN if "protein tyrosine kinase activity" was already assiged to this gene?
This is done when there is explicit evidence to support separate annotations; usually it means that there is strong evidence for a more general annotation (parent term) and weaker evidence supporting a more specific annotation (child term).
From the GO annotation guide:
Uncertain knowledge of where a gene product operates should be denoted by annotating it to two nodes, one of which can be a parent of the other. For instance, a yeast gene product known to be in the nucleolus, but also experimentally observed in the nucleus generally, can be annotated to both nucleolus and nucleus in the cell component ontology. Even though annotation to nucleolus alone implies that a gene product is also in the nucleus, annotate to both so as to explicitly indicate that it has been reported in the two locations. The two annotations may have the same or different supporting evidence.
[edit] Where can I view or download the complete sets of GO annotations?
As with the vocabularies, the gene product/GO association sets from contributing groups are available at the GO web site. Tab-delimited files of the associations between gene products and GO terms that are made by the member organizations are available from their individual FTP sites, from the GO FTP site (ftp://ftp.geneontology.org/pub/go/gene-association), or from a link on the Current Annotations table.
The gene association file format is described in the GO annotation guide. These files store IDs for objects (genes/gene products) in the database that contributed the file (e.g. FlyBase IDs, Swiss-Prot accession IDs for proteins) as well as citation and evidence data. There are also files containing Swiss-Prot/TrEMBL protein sequence identifiers for gene products that have been annotated using GO terms; they are available via FTP.
You can also download the annotations in mySQL format from http://www.geneontology.org/GO.downloads.database.shtml. Note however that the mysql database dumps shouldn't be treated as flat files to be parsed directly. Rather, they are meant to be loaded into a mysql database and queried.
There is an issue here is that the data in
and the GO database do not agree on what constitutes a single "annotation". The former counts a line in a gene_association file as an annotation whereas the GO database (and the file you're looking at, if parsed directly) counts a geneproduct-GO association referenced by a publication as a single annotation, even if there is >1 evidence for this.
[edit] Where can I find GO annotations of proteins and ESTs?
Gene objects in model organism databases typically have multiple nucleotide sequences from the public databases associated with them, including expressed sequence tags (ESTs) and one or more protein sequences. There are two ways to obtain sets of sequences with GO annotations: (1) from the model organism databases or (2) from the annotation sets for transcripts and proteins contributed to GO terms by Compugen and Swiss-Prot.
Obtaining GO annotations for model organism sequence sets: In the gene association files, the GO terms are associated with an accession ID for a gene or gene product from the contributing data resource. Usually, the association files of the gene to sequenceIDs are also available from the contributing model organism database. For example, the Mouse Genome Informatics FTP site includes the gene association files contributed to the GO, and other reports that include official mouse gene symbols and names and all curated gene : sequence ID associations.
Obtaining GO annotations for transcript and proteins in general: Large transcript and protein sequence data sets are annotated to GO terms by Compugen and Swiss-Prot/TrEMBL, respectively. These files can be downloaded direct from the GO web site. Species of origin for the sequence is included in the association files.
[edit] How do I find all the human genes that have been annotated with a particular GO term?
GO terms have been associated with a non-redundant set of human proteins described in Swiss-Prot/TrEMBL/InterPro and Ensembl. These annotations are available in the GOA Human file on the EBI and GO FTP sites.
GOA project data are also accessible from Ensembl and from the EMBL/DDBJ/GenBank nucleotide sequences stored at EMBL-Bank. For more information about browsing GOA project data at EBI, see the EBI's GOA page (http://www.ebi.ac.uk/GOA/index.html).
[edit] How can I get sequences of proteins annotated to a GO term?
The AmiGO browser allows you to search the GO annotations contributed by all the participating databases and retrieve protein sequences for annotated gene products (if available).
To retrieve sequences using AmiGO, go to http://amigo.geneontology.org/cgi-bin/amigo/go.cgi and enter your chosen GO term (e.g 'mitochondrion') in the 'search for GO terms and associated genes' field. On the page listing the annotated genes, check the genes you require the sequence for and, at the bottom of the page, toggle the option box to 'Get Fasta Sequences'. Hit the 'Submit' button.
If you would like to retrieve sequences for only one species or only one data source then please choose a filter setting using the menus under the heading 'Filter Associations'.
[edit] Why can't I retrieve sequences for some annotated proteins?
The AmiGO browser uses the GO database, which is built using submissions from various bioinformatics groups and model organism databases (GO Consortium members). Some of these groups also submit an optional mapping from their IDs to protein sequence database IDs. Not all do, which is why you only get a subset.
[edit] Can I sort or filter annotations by evidence code?
I want the most reliable data available, so I want to retrieve only the annotations that were done manually. How can I do this?
Many GO browsers, including AmiGO, allow you to select one or more evidence codes, and retrieve annotations using the selected codes. In AmiGO, use the advanced query and select codes from the Evidence Type list.
[edit] What are the advantages and disadvantages of manual annotation?
The most reliable annotations are those made manually by database curators based on primary and review literature. Manual annotations often cite experimental evidence that provides strong support for the association of a GO term with a gene product, and can be done at a very detailed level. The chief disadvantage of manual annotation is that it is labor-intensive, requiring a lot of time and effort from trained biologists.
[edit] What are the advantages and disadvantages of automatic annotation?
The main advantage of automatic annotation is simply speed: wholly or partially automated methods facilitate the annotation of much larger sets of known or predicted gene products than can be produced manually. Automation, however, yields more general (less detailed) annotations compared to manual annotation, and automated methods are more error-prone. Annotations made by automated methods are therefore regarded as less reliable than manual ones.
[edit] How is annotation quality controlled to ensure consistency between databases?
The accuracy of GO annotations is a high priority for all members of the GO Consortium. Each member organization is responsible for keeping its own annotations accurate and up to date, and for correcting any errors. Users can report errors to the GO mailing list at the GO helpdesk; any comments on annotations will be forwarded to the appropriate contributing group.
The GO Consortium is also looking into possible ways to improve quality assurance further, such as manually reviewing selected annotations and developing tools to automate detection of potentially erroneous annotations.
[edit] How often does automatic annotation give results that are consistent with manual annotation?
In general, electronic annotations are rarely incorrect, as they are annotations to very high-level GO terms. For example, the GOA group at EBI reports:
Usually manual annotation simply provides deeper-level terms in GO. In 93% of cases GOA's electronic annotation is in the same GO lineage as the manual annotation. Some users have used our manual annotation to assess the quality of their automatic GO annotation techniques. They have found a few manual annotation errors by Proteome Inc. but no errors (so far) of manual annotation by Swiss-Prot staff have been reported to GOA. A few InterPro2GO errors have been reported but not very many. So, in general, our electronic techniques are very accurate, and are sometimes based on manual annotation. For example, Swiss-Prot keywords are usually manually annotated to Swiss-Prot entries; by using a mapping of Swiss-Prot keywords to GO, GOA inherits the high quality of Swiss-Prot manual annotation.
There has been further investigation into this topic in 2005:
"The quality of electronic annotation has recently been assessed in some detail (Camon et al., 2005). This research found that in the worst case scenario, the generation of electronic annotations using the interpro2go, spkw2go, and ec2go mapping files precisely predicted the correct GO term 60% to 70% of the time, with the remainder of the predictions being to insufficiently specific GO terms. The high precision was found to be due to the basing of electronic annotations on manually curated mapping files. Curators noted that it was more important for database curation to be accurate than to have complete coverage, and the figures above demonstrate that this is the tendency with electronic annotation." Text from (Clark et al., 2005).
[edit] How do I annotate ESTs?
To make electronic GO annotation to ESTs it is usual to Blast the EST sequences against sequences that have been manually annotated and transfer the annotations from similar sequences, adding evidence code IEA.
Some useful tools for EST annotation:
The AmiGO browser has a BLAST query feature built in, which you can use to query annotated gene products in the GO database. For large batch queries, you may want to download the file of annotated sequences and use it to run BLAST locally. The file is available from the GO ftp site (ftp://ftp.geneontology.org/pub/go) and is updated regularly.
Another option might be to install the AmiGO code and GO database locally.
The underlying data are in flat files that can be found in these directories on the GO FTP site:
- ftp://ftp.geneontology.org/pub/go/gene_associations (annotated gene products)
- ftp://ftp.geneontology.org/pub/go/gp2protein (Uniprot IDs for annotated protein sequences)
There is a README for the gp2protein directory. The format of the files in the /gene_associations directory is described in the GO annotation guide. Please let us know if you have questions about these files.
- You could also try using InterProScan to find protein domains/motifs encoded by the ESTs, and transfer GO terms that have been associated with InterPro entries. See InterPro for more information.
- TIGR's file with GO associations for the Tentative Consensus's in TIGR's Gene Index databases are included on the GO web site. The Tentative Consensus sequences are assemblies of EST's. All of the GO terms were assigned by computer and are therefore IEA [electronic, i.e. computed and not reviewed by a human].
- Several other groups have done automated assignment of GO terms to genes or proteins, including ESTs, and many of them would probably be willing to share their methods and software. The [/cgi-bin/biblio.cgi GO Bibliography] includes a list of publications on GO annotations.
[edit] How do I submit annotations to GO?
Please write to the GO helpdesk with your suggestions.
Suggestions for updating existing annotations will be forwarded to the contributing group.
If you want submit a new annotation set, we will reply and tell you how to proceed.
[edit] Is there a file that shows the numerical identifiers for GO terms and the corresponding annotated genes?
There is a directory that contains flat files associating genes or gene products, contributed by several different databases: ftp://ftp.geneontology.org/pub/go/gene-associations. The file format is documented in GO annotation guide.
[edit] Why is my gene/protein not in the GO database?
GO is a work in progress, so not all genes and proteins have GO terms associated with them yet. To check on the annotation status of a particular gene or protein, please email the GO helpdesk.
[edit] GOA - Human Annotations
[edit] What is GOA?
The GOA project aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and International Protein Index (IPI), and is a central dataset for other major multi-species databases; such as Ensembl and NCBI.
GOA has been a member of the GO Consortium <http://www.geneontology.org/> since 2001, and is responsible for the integration and release of GO annotations to the human, chicken and cow proteomes. In 2006 GOA became a central participant in the new GOC Reference Genome Annotation project and is committed to the comprehensive annotation of a set of disease-related proteins in human.
Because of the multi-species nature of the UniProtKB, GOA also assists in the curation of another 120,000 species. This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups (e.g. LifeDB). This effort ensures that the GOA dataset remain a key reference and a comprehensive source of GO annotation for all species.
[edit] Why are InterPro2GO mappings not updated with GOA releases?
GOA is updated in accordance with the latest data released by its core databases (UniProtKB, IPI, InterPro and Ensembl) as well as mappings of SWISS-PROT Keywords, InterPro and Enzyme Commission (EC) terms to GO. Each of GOA's core databases produces its own releases; for example, InterPro has dependencies on the member databases of InterPro. InterPro2GO is updated at regular intervals but not always in keeping with monthly schedule of GOA releases.
[edit] What methods are used for automatic annotation in GOA?
At present, GOA has five independent methods for automatic annotation.
Four of these methods use mappings of concepts from external database systems that have been manually indexed to equivalent GO terms. The mappings used by GOA are of InterPro signatures, Enzyme Commission numbers, Swiss-Prot keywords and HAMAP families to GO terms. Further information on mapping files can be found at: http://www.geneontology.org/GO.indices.shtml.
In addition GOA has recently provided a new electronic annotation technique in collaboration with the Ensembl group. Using the gene orthology obtained from the Ensembl Compara pipeline, GO terms from a source species have been projected onto one or more target species. This method has provided over 30,000 annotations to the human, mouse, rat, chicken, dog, bovine and Anopheles gambiae proteomes. Only one to one and apparent one to one orthologies are used and only manually-annotated GO terms with an evidence type of IDA, IEP, IGI, IMP or IPI are projected.
Further detail on these annotation techniques and their format in the GOA gene association file is described in the GOA readme, available at: http://www.ebi.ac.uk/GOA/goaHelp.html
[edit] Is the GOA-Human association file a subset of the GOA UniProt file? What are the differences?
GOA-Human is not a subset of GOA UniProt file.
The GOA UniProt gene association file contains all manual and electronic annotations that GOA has assigned to UniProtKB entries. This dataset contains annotations to more than 120,000 different species and is redundant for electronic annotations where two different electronic methods have assigned the same or less granular GO term.
The IPI project (http://www.ebi.ac.uk/IPI/) provides the GOA group with a minimally redundant yet maximally complete sets of proteins for a number of species, including the human proteome, and is assembled from protein sequence information taken from UniProtKB, RefSeq, Ensembl, TAIR, H-InvDB and Vega.
While entries in the UniProt Knowledgebase (Swiss-Prot and TrEMBL) representing proteins with identical sequences are merged, a low level of semantic redundancy remains (where different sequences represent the same underlying entity, perhaps due to biological variation or sequencing error). Using IPI, these surplus entries can be filtered out to produce a non-redundant data set.
For the GO terms in these IPI files, our aim has been to remove those electronic annotations created by the same technique and that have predicted same or less granular GO terms.
An example would be for annotations created by the InterPro2GO mapping technique. In the redundant UniProt gene association file, there are three annotations to binding terms for protein P02144:
UniProt P02144 MYG_HUMAN GO:0005488 GOA:interpro IEA InterPro:IPR000971 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F IPI00217493 protein taxon:9606 20060125 UniProt (GO:0005488 - 'binding', GO:0019825 - 'oxygen binding' , GO:0020037 - 'heme binding')
However within the human IPI species-specific file there exist only two of these three:
UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt (GO:0019825 - 'oxygen binding' and GO:0020037 'heme binding')
The GO term for 'binding' has been removed from the human file as it does not provide users with any extra information, as it is a less granular parent to the oxygen and heme binding terms. This can be done because of the 'true path rule' that GO follows.
In the true path rule "the pathway from a child term all the way up to its top-level parent(s) must always be true" so a protein which is annotated to a term such as 'oxygen binding' automatically indicates that the protein would also be correctly annotated to its parent term 'binding'. This is known because the 'binding' GO term is displayed in GO a parent of 'oxygen binding'.
[edit] Most of the publicly available expression data gives GenBank IDs for the human genes but the GO annotation uses UniProtKB accessions or IPI identifiers. Is there an easy way to map between the two?
The Gene Ontology Consortium provides annotations for gene products, so we reference protein sequences and not DNA sequences. It is not necessarily useful to know just the DNA accession number because so many different coding sequences can be within one DNA sequence. There are at least two ways to find Protein_IDs:
By retrieving the SWISS-PROT file from the EBI's FTP site: The DR line of almost every SWISS-PROT entry contains the following, e.g.:
DR EMBL; AF043736; AAC02090.1; -.
AF043736 is the EMBL/Genbank/DDBJ AC number and AAC02090 is the protein identifier for the coding sequence (CDS) within the EMBL/Genbank/DDBJ entry. These are universal IDs shared by all three of the collaborating nucleotide sequence databases.
Additionally (and redundantly), GenBank identifies its proteins by a second identifier, the GI number. SWISS-PROT does not keep cross references to Genbank GI numbers, but you can map between protein identifiers by retrieving the NCBI's non-redundant protein dataset from ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.Z
You can parse the deflines of that file. If two sequences are identical they are merged and their information is merged into the defline. For example, searching for Q9W4P5 hits the following:
gi|18543319|ref|NP_570080.1| (NM_130724) CG2934 gene product [Drosophila melanogaster]gi|12585516|sp|Q9W4P5|V0D1_DROME Vacuolar ATP synthase subunit d 1 (V-ATPase d subunit 1) (Vacuolar proton pump d subunit 1) (V-ATPase 39 KDa subunit 1)gi|7290447|gb|AAF45902.1| (AE003429) CG2934 gene product [Drosophila melanogaster]gi|17862396|gb|AAL39675.1| (AY069530) LD24653p [Drosophila melanogaster]
From the above you can find the Protein_ID (AAL39675.1), the universal DNA accession for the region that contains this CDS(AY069530), and the GI number.
By using the EBI's Sequence retrieval system (SRS): You can use SRS to search the EBI's GO annotation (GOA) files or the GO database, which is a mirror of the GO consortium repository.
For example, to search GOA for all proteins that function as transporters(GO:0005215) and that have an experimental evidence code:
- Choose 'extended search' and select enter the GO identifier '0005215' in the 'goid' search field.
- In the 'combine searches with' section of the tool bar on the left-hand side of the page, select the'BUTNOT'option and, in the 'evidence' field, add the GO evidence code IEA (this means 'inferred from electronic annotation'). This creates a query that searches for all proteins that have been linked to the GO term' transporter' and that were manually curated.
- SRS can link your results to databases that do not contain direct references to each other. For example, in the last search, SWISS-PROT, InterPro and TrEMBL accession numbers will be displayed in the results page but the search can also be extended to show all transporters in the last search that share EMBL/GenBank/DDBJ accession numbers. To do this, on the results page select the 'link' option on the left-hand tool bar, choose 'EMBL' and hit the 'submit link' button.
[edit] Is there a mapping of GO IDs to UniGene IDs?
GOA provides a UniGene to UniProtKB mapping (available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/gp2protein/gp2protein.unigene.gz) Using the either one of GOA's non-redundant species-specific files (Arabidopsis, chicken, cow, human, mouse, rat and zebrafish proteomes), or the GOA UniProt gene association file, you should be able to parse out the appropriate set of GO terms for a set of Unigene ids.
[edit] How do I associate an EMBL/DDBJ/Genbank nucleotide sequence accession number with the GO ID?
In addition to the gene association files produced by the GOA project we also provide mapping files between the entries in these (arabidopsis, chicken, cow, human, mouse, rat and zebrafish) sets and other databases such as the EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO, LocusLink and RefSeq at the NCBI. The readme for these files can be found at: http://www.geneontology.org/doc/goa.README
Also the information to link the protein and nucleotide data exists in almost every UniProtKB entry. The specific format for cross-references from Swiss-Prot or TrEMBL to coding sequences (CDS) in the DDBJ/EMBL/GenBank nucleotide sequence database is in the DR line, e.g.:
DR EMBL; AF043736; AAC02090.1; -. AF043736 is the EMBL/GenBank/DDBJ Accession number AAC02090 is the protein-id/Protein Sequence Identifier for the CDS within the EMBL/GenBank/DDBJ entry. These two are universal IDs shared by all 3 of the collaborating nucleotide sequence databases.
We have currently released all GO annotation to SWISS-PROT and TrEMBL and we are working on adding this GO annotation directly in the EMBL-Bank database. As you know, EMBL/DDBJ and GenBank are an international collaboration, which exchange information on a daily basis. This will be possible by the EMBL Christmas 2002 release. For the time being the only way you can download EMBL/Genbank/DDBJ accession numbers with GO annotation is to use the EBI's sequence retrieval system (SRS: http://srs.abi.ac.uk/).
Search the GOA database by GO ID or GO evidence code and then use the link option to link to EMBL-Bank database.
GO IDs are also displayed by the Ensembl database for the human genome. For more information read the GOA or Ensembl home pages, http://www.ebi.ac.uk/GOA and http://www.ebi.ac.uk/ensembl .
[edit] GO content
[edit] What is GO content?
GO content refers to the content of the ontologies themselves and the biology underlying it. It includes anything to do with terms and their organisation, definitions, synonyms and the relationships between terms.
[edit] How can I suggest new GO terms?
The GO vocabularies are updated on a regular basis, and suggestions from the community for additional terms or for other improvements are very welcome. You can make and track your suggestions via the Curator Requests Tracker.
This system is very simple to use - please see the instructions on the GO website.
You can also submit your suggestions to the GO helpdesk.
[edit] Is there any way to convert downstream GO terms to a GO slim term?
There is a script, map2slim.pl, that does essentially this. It uses the GO MySQL database and Perl API, so you should familiarize yourself with those. The script is in the directory http://www.fruitfly.org/developers/src/go-dev/apps/query-utils/
Database and API documentation are available:
- http://www.godatabase.org/dev/database/
- http://www.godatabase.org/dev/develop.html
- http://www.fruitfly.org/cgi-bin/wiki/view.pl/GoWeb/GoAPI
Web-based implementations of map2slim are also available; GO Term Mapper is a tool developed at Princeton University which can be used to map terms to their corresponding GO slim term for any species, while the SGD Gene Ontology Slim Mapper does the same for Saccharomyces cerevisiae data.
[edit] Can a term that is listed two places in an ontology file have children in one place but not the other?
No - the term will always have the same children wherever, and however many times it appears.
[edit] Can a term in one ontology have parents in one of the other two ontologies?
No - relationships are not currently created between the different ontologies.
[edit] Why is there no definition for my GO ID?
This is because not all GO terms have definitions yet. Currently over 95% of terms are defined, and eventually all GO terms will have a definition.
If you would like to suggest a definition for an undefined term, please submit it to the requests tracker.
[edit] Why is the term Gene_Ontology now obsolete?
The former root node GO:0003673 Gene_Ontology is now obsolete because it did not represent an actual biological concept. It was originally created because some software -- including AmiGO -- relies on there being a root node, so the developers have now created an artificial node in the MySQL database called "all" that is the root of all possible concepts.
[edit] Where have the 'unknown' terms gone?
Good principles of ontological design state that terms should represent biological entities that actually exist, e.g., functional activities that are catalyzed by enzymes, biological processes that are carried out in cells, specific locations or complexes in cells, etc. To adhere to these principles the Gene Ontology Consortium has removed the terms, "biological process unknown" (GO:0000004), "molecular function unknown" (GO:0005554) and "cellular component unknown" (GO:0008372) from the ontology.
The "unknown" terms violated this principle of sound ontological design because they did not represent actual biological entities but instead represented annotation status. Annotations to "unknown" terms distinguished between genes that were curated when no information was available and genes that were not yet curated (i.e., not annotated). Annotation status is now indicated by annotating to the root nodes, i.e. "biological_process" (GO:0008150), "molecular_function" (GO:0003674), or "cellular_component" (GO:0005575). These annotations continue to signify that a given gene product is expected to have a molecular function, biological process, or cellular component, but that no information was available as of the date of annotation.
Adhering to principles of correct ontology design should allow GO users to take advantage of existing tools and reasoning methods developed by the ontological community.
[edit] How can I calculate the 'level's of GO terms?
GO terms do not occupy strict fixed levels in the hierarchy. Because GO is a Glossary#DAG Directed Acyclic Graph, terms can occupy different levels if different paths are followed through the DAG. This is especially true if one mixes is_a and part_of relations. Thus it is more proper to ask: "what is the maximum depth of such and such a term" (or minimum, average).
We do not pre-generate reports showing this. If you genuinely want this information you can perform SQL queries on our database to get it. See this example
But you may want to reconsider whether you want this information at all! The (maximum) depth of a term may not be as informative as you think.
A more informative metric would be the information content of the node based on anntations See for example the work of Alterovitz et al
[edit] Mapping other classification systems to GO
[edit] Why are Interpro2go mappings not updated with GOA releases?
GOA is updated in accordance with the latest data released by its core databases (SWISS-PROT, TrEMBL, InterPro, Ensembl) as well as mappings of SWISS-PROT Keywords, InterPro and Enzyme Commission (EC) terms to GO. Each of GOA's core databases produces its own releases; for example, InterPro has dependencies on the member databases of InterPro. InterPro2GO is updated at regular intervals but not always in keeping with monthly schedule of GOA releases.
[edit] What are mappings?
The files contain concepts from systems external to GO e.g. Enzyme Comission numbers, SWISS-PROT keywords and TIGR roles, indexed to equivalent GO terms. The mappings are typically made manually; details can be found in the file header. See the Mappings to GO for files available.
[edit] Software and tools
[edit] Where can I find software to allow me to browse the GO terms and annotations?
GO terms and annotations browsed using various tools, all of which can be found on the GO software page under the heading Tools for searching and browsing GO. Most GO browsers are web-based and will allow you to view a term and its attributes, such as definitions, synonyms and database references. Some, such as AmiGO and QuickGO, also let you see annotations to each term. The downloadable GO editor OBO-Edit shows ontology information. There is a short description of each browser on the web page to allow you to choose the appropriate tool for your task.
[edit] Where can I find software to allow me to edit the GO terms and annotations?
GO terms and annotations edited using various pieces of software, all of which can be found on the GO software page.
- GO terms and the ontology structures: we recommend using OBO-Edit for any serious ontology editing. OBO-Edit was developed by the software group at BDGP specifically for ontology editing and ensures that file syntax remains correct. It can also be used to edit other ontologies in the same format.
- GO annotations can be edited using various database-specific tools; for example, Manatee at TIGR and the EBI's Talisman tool. Please contact the relevant database to find out how their GO annotation is done.
Please note that only authorized GO curators with CVS write access can edit the GO files and annotations can only be submitted by authorized annotators. If you would like to submit annotations or contribute to the ontologies, have a look at the #Contribute to GO section of this FAQ.
[edit] Where can I find more general GO tools?
General GO or GO-related tools can be found on the GO tools page, split into one of a number of categories. There is a short description of each tool on the web page to allow you to choose the appropriate tool for your task.
[edit] Applications of GO
[edit] How is the GO used in genome analysis?
Genome and full-length cDNA sequence projects often include computational (putative) assignments of molecular function based on sequence similarity to annotated genes or sequences. A common tactic now is to use a computational approach to establish some threshold sequence similarity to a SWISS-PROT sequence. Then the GO associations to the SWISS-PROT sequence can be retrieved and associated with the gene model. Under the GO guidelines, the evidence code for this event would be 'inferred from electronic annotation' (IEA).
One aspect of the use of the GO for annotation of large data sets is the ability to group gene products to some high level term. For example, while gene products may be precisely annotated as having role in a particular function in carbohydrate metabolism (i.e., glucose catabolism), in the summary documentation of the data set, all gene products functioning in carbohydrate metabolism could be grouped together as being involved in the more general phenomena 'carbohydrate metabolism'. Various sets of GO terms have been used to summarize experimental data sets in this way. The expectation is that published sets of high-level GO terms used in genome annotations and publications will be archived at the GO site. Some of these 'GO slims' are already available.
[edit] What are all the possible uses of GO?
It would be impossible to list all the potential applications of GO, but applications for which GO has already been used include the following:
- integrating proteomic information from different organisms;
- assigning functions to protein domains;
- finding functional similarities in genes that are overexpressed or underexpressed in diseases and as we age;
- predicting the likelihood that a particular gene is involved in diseases that haven't yet been mapped to specific genes;
- analysing groups of genes that are co-expressed during development;
- developing automated ways of deriving information about gene function from the literature;
- verifying models of genetic, metabolic and product interaction networks.
For references to these and other studies that have used GO, see the GO Publications.
[edit] How is the GO used in gene expression analysis?
The inclusion of GO annotation in microarray datasets can often reveal aspects of why a particular group of genes share similar expression patterns. Sets of co-expressed genes can encode products that are involved in a common biological process, and may be localized to the same cellular component. In cases where a few uncharacterized genes are co-expressed with well-characterized genes annotated to identical or similar GO process terms, one can infer that the 'unknown' gene's product is likely to act in the same process. Software for manipulating and analyzing microarray gene expression data that incorporates access to GO annotations for genes is becoming available. The Expression Profiler is a web-based set of tools for the clustering and analysis of gene expression data developed by Jaak Vilo at the European Bioinformatics Institute (EBI). One of the tools available in this set is the EP:GO, a tool that lets users search the GO vocabularies and extract genes associated with various GO terms to assist in the interpretation of expression data.
[edit] GO consortium
[edit] Who funds GO?
Direct support for the Gene Ontology Consortium is provided by an R01 grant from the National Human Genome Research Institute (NHGRI) [grant HG02273]; AstraZeneca; Incyte Genomics; the European Union and the UK Medical Research Council.
Participating databases are funded as follows:
- SGD is supported by a P41, National Resources, grant from the NHGRI [grant HG01315].
- MGD is supported by a P41 from the NHGRI [grant HG00330].
- GXD is supported by the National Institute of Child Health and Human Development [grant HD33745].
- FlyBase is supported by a P41 from the NHGRI [grant HG00739] and by the Medical Research Council, London.
- TAIR is supported by the National Science Foundation [grant DBI-9978564].
- WormBase is supported by a P41, National Resources, grant from the NHGRI [grant HG02223].
- RGD is supported by an R01 grant from the NHLBI [grant HL64541].
- DictyBase is supported by an R01 grant from the NIGMS [grant GM064426].
[edit] How do I become a member of the consortium?
The most important criterion for GO Consortium membership is that the members contribute something to the collection of resources that we make available to the public (almost all member contribute annotations; several contribute to the ontologies; a few contribute software). The scientists involved in working with GO in these member groups communicate via the GO mailing list to discuss development issues in the ontologies. If you represent a database that wishes to join the GO Consortium please write to the mailing list to enquire about the criteria for joining. The current consortium member groups must all agree to inclusion of a new member group, and so writing to the mailing list is a good way to reach all the groups and begin the process.
Anyone with a more general interest in the GO can join the gofriends@geneontology.org mailing list to hear about GO and to attend the users meetings.
[edit] Who runs GO?
The GO project began as a collaboration between three model organism databases: Flybase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD) in 1998. Since then, the GO Consortium has grown to include many databases, including several of the world's major repositories for plant, animal and microbial genomes. See the Gene Ontology Consortium web page for a full list of member organizations.
[edit] How can I contribute to GO?
You can contribute to GO in several different ways. First, you might want to submit your gene or gene product annotations (see 'How do I submit annotations to GO?') for distribution through GO. Or you can make suggestions for new terms or other changes to the ontologies (see 'How can I suggest new GO terms?'). You could also join one of the GO interest groups; these groups work on developing specific areas of the ontologies. If you don't see an interest group that suits you, email the GO helpdesk to suggest a new one.
[edit] Cite or redistribute GO
[edit] I'd like to integrate the GO files into a commercial application. Do I need a licensing agreement?
The Gene Ontology vocabularies and gene product annotations are available to all public and private sector users, with no licensing requirements. We do ask that you cite the GO Consortium: see 'Cite or redistributeGO' at the GO web site. Also, please do not make any changes to the vocabularies; instead, please contact us to suggest changes by e-mail to the GO helpdesk.
[edit] How do I cite GO?
The GO database and vocabularies are in the public domain. The annotations provided by member organizations in the Current Annotations table are also in the public domain. There are no restrictions on their use, although third parties are asked to give appropriate acknowledgement to the GO Consortium and to the appropriate member organization(s). To reference the Gene Ontology Consortium, please cite this paper:
Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29 PDF
We also recommend that you include the date you scanned the GO information within your paper. The GO ontology, gene_associations, and documentation files have version numbers and dates, which may be used for this purpose. The GO is evolving and changes will occur with time.
[edit] Can I download the GO?
To download the GO please follow the 'downloads' link on the GO web site.
The GO vocabularies, association tools and documentation are freely available and have been placed in the public domain. The GO is copyrighted to protect the integrity of the vocabularies, which means that changes to the GO vocabularies need to be done by GO developers. However, anyone can download the GO and use the ontologies in their annotation or database system. The GO is available in several formats including parsable flatfiles, as tables for a MySQL database and as XML.
[edit] Contribute to GO
[edit] How can I suggest new GO terms?
The GO vocabularies are updated on a regular basis, and suggestions from the community for additional terms or for other improvements are very welcome. You can make and track your suggestions via the Curator Requests Tracker hosted at SourceForge. The system is very simple to use; there are also instructions on using SourceForge available.
You can also send suggestions to the GO helpdesk
[edit] How can I contribute to GO?
You can contribute to GO in several different ways. First, you might want to submit your gene or gene product annotations (see #How do I submit annotations to GO?) for distribution through GO. Or you can make suggestions for new terms or other changes to the ontologies (see #How can I suggest new GO terms?). You could also join a GO interest group; these groups work on developing specific areas of the ontologies.
[edit] How do I submit annotations to GO?
Write to the GO helpdesk with your suggestions.
[edit] File Formats
The Consortium makes the ontologies and annotations available for download in a range of formats. Please see the GO downloads section if you wish to download any specific file.
[edit] Why are the ontologies initially produced in OBO flat file format instead of XML?
The ontologies are initially produced in the specially designed OBO flat file format. They are converted to XML once a month for the convenience of users who require this facility. Both formats and many others are available in the GO downloads section.
We use the OBO flat file format because it is very much more human-readable, and also because the file is much smaller without the XML tags. This means that it is much quicker and easier for the curators to handle the file on a day-to-basis.
[edit] Why won't the RDF-XML file parse using RDF parsers?
The GO RDF-XML format was originally developed some time ago, before the advent of OWL. It has a few unusual features that render it more of a pseudo-rdf format.
The actual RDF is embedded within a <go:go> xml element - this should be stripped out before handing to RDF parsers.
Note that the GO RDF-XML conforms to a DTD, something that is not normally a requirement of RDF. This is because most people parse the file using conventional XML parsers rather than XML tools.
[edit] GO database
Questions regarding the querying and installation of the GO database and go-perl. Full documentation on the database can be found at: http://www.geneontology.org/GO.database.shtml
[edit] How do I query the GO database?
For most basic queries on GO, the appropriate interface is AmiGO (http://amigo.geneontology.org). However, there may be certain kinds of queries that are difficult to express through the existing AmiGO interface (or the AmiGO interface may not be convenient for certain bulk queries). If this is the case, a request can be placed to have this functionality added, and/or the database can be queried directly; there are three options for directly querying the GO database: