Latest revision as of 15:43, 28 October 2013

Overview

The Term Enrichment tool can be used to discover what a set of genes may have in common by examining annotations and finding significant shared GO terms. The algorithm employed by the tool attempts to determine whether an observed level of annotation for a group of genes is significant within the context of annotation for all genes within the genome; examples of studies that have used this algorithm are PMID:15492223 and PMID:14561723. AmiGO's Term Enrichment tool, which is based on the GO-TermFinder perl module by Gavin Sherlock and Shuai Weng at Stanford University, allows users to specify a list of genes, define a background set against which the significance will be calculated and set the p-value (significance indicator) cut-off.

Term enrichment is a very useful method for analyzing data from large scale experiments, such as gene clusters from microarray expression data. For a more detailed discussion of the algorithm, please see the published material on GO::TermFinder. The p-values returned by this software undergo Bonferroni correction.

Caution: Please note that by default, this tool uses annotation datasets that include electronically inferred (IEA) data. The results for organisms where a proportion of the annotation coverage is IEA-based will not match/correspond only to the annotations made by curators. For more information about what data AmiGO uses, please see the overview page.

Usage

Gene Product List

The user may upload a plain text file with a whitespace separated list of gene product identifiers. These may be a mix of gene product symbols, synonyms or accessions. If the list is too large for manual input, the user may instead upload a either a file containing identifiers (as listed above) or a gene association file. Also, if AmiGO finds any gene product identifiers that are ambiguous or not found, the user will be informed before the end of the process.

Background Set

Where possible, we recommend users should supply their own background set.

Your background set should be your whole gene/gene product set from the experiment being analyzed i.e. the list from which your gene/gene product list of interest has been derived. This might be all genes on your microarray, for example, or possibly all genes in the genome of your organism. If you do not provide a background set, the database filter selected will be used as the background set, that is, all genes for that database that have GO annotation.

Input

The background set may be input in a very similar way to the gene product list above. The only difference is the addition of an optional database filter--the user must either enter/upload a background set, select a database filter, or do both.

Filtering

If the user enters a background set and selects a database, the inputted background set will be filtered so that only gene products that are found in that database will be used in the calculations. This can help to remove a lot of possible ambiguity in the inputted set. The abbreviations used in the filter are as follows:

Abbreviation	Database name	Species
AspDB	Aspergillus Genome Database	Aspergillus nidulans
CGD	Candida Genome Database	Candida albicans
EcoCyc	EcoCyc and EcoliWiki	Escherichia coli K-12
Ensembl	Ensembl project Genome Databases	Multi-species
FB	FlyBase	Drosophila melanogaster
GR_protein	Gramene	Multi-species, grains
GeneDB_Lmajor	Sanger GeneDB	Leishmania major
GeneDB_Pfalciparum	Sanger GeneDB	Plasmodium falciparum
GeneDB_Spombe	Sanger GeneDB	Schizosaccharomyces pombe
GeneDB_Tbrucei	Sanger GeneDB	Trypanosoma brucei
JCVI_CMR	The J. Craig Venter Institute	Multi-species, bacterial
MGI	Mouse Genome Initiative	Mus musculus
NCBI	National Center for Biotechnology Information	Multi-species
NCBI_GP	NCBI GenPept, proteins	Multi-species
NCBI_NP	NCBI RefSeq, proteins	Multi-species
PAMGO_VMD	Plant-Associated Microbe Gene Ontology (PAMGO) consortium	Multi-species, plant-associated microbes
PDB	Protein Data Bank	Multi-species
PseudoCAP	Pseudomonas aeruginosa Community Annotation Project	Pseudomonas aeruginosa
RGD	Rat Genome Database	Rattus norvegicus
RefSeq	NCBI Reference Sequence	Multi-species
SGD	Saccharomyces cerevisiae Genome Database	Saccharomyces cerevisiae
SGN	SOL Genomics Network	Multispecies, plant
TAIR	The Arabidopsis Information Resource	Arabidopsis thaliana
TIGR_CMR	The Institute for Genomic Research	Multi-species, bacterial
UniProt	Deprecated
UniProtKB	UniProt Protein Knowledge Base	Multispecies
UniProtKB/Swiss-Prot	UniProt reviewed, manually annotated	Multispecies
UniProtKB/TrEMBL	UniProt unreviewed, automatically annotated	Multispecies
WB	WormBase	Caenorhabditis elegans
ZFIN	Zebrafish Information Network	Danio rerio
dictyBase	dictyBase	Dictyostelium discoideum

(For more detail see the GO xref page)

Otherwise, if the user did not enter a background set, the selected database will be used as the background set. If you do not supply a background set, we recommend you do not use a multi-species database as a filter as this will lead to all genes from all species being used as your background set which may give unreliable results. We currently do not have a filter directly corresponding to the human genome/proteome, so we recommend for human gene lists you always supply your own background set.

IEAs

By default, this tool uses annotation datasets that do not include IEA (electronically inferred) data. The results for organisms where a proportion of the annotation coverage is IEA-based will match/correspond only to the annotations made by curators. If you wish to include IEA data, please check the "use IEAs" box on the form.

Thresholds

The AmiGO interface gives the user the ability to change the maximum p-value and the minimum number of gene products that are used when running the algorithm. Please see the published material for a more detailed discussion of what these values mean and how to use them meaningfully. To determine whether any GO terms annotate a specified list of genes at a frequency greater than that would be expected by chance, GO::TermFinder calculates a P-value using the hypergeometric distribution, (PMID:15297299).

Advanced Options

Clicking on Display advanced result options gives advanced users access to additional settings.

Result Types

In addition to the standard results that are returned from this page, the user may also select all results, which will return all results without any kind of threshold filtering (and ignoring any threshold inputs specified above).

Results Formats

In addition to the standard html page results, the user may instead select a tab-delimited file or an xml file. Please be warned that the XML file is in an unstable internal format and should only really be used by people prefer parsing XML over other types.

Tab-delimited columns

The columns of the tab-delimited format are as follows:

acc of the GO term
aspect of the GO
GO term name
p-value (see thresholds above)
number of gene products in the sample set that are annotated
number of gene products in the sample set
number of gene products in the background set that are annotated
number of gene products in the background set
list of gene products in the sample set annotated to the GO term

Visualization

When you have received results from the Term Enrichment tool, you will be given the option to visualize your results within the top results section block. Since the visualization is produced by the AmiGO_Manual:_Visualize component, relation coloring conventions can be found there. For the node coloration, the following slide down table currently applies:

p-value >	color	example
[base color]	#ffffff
1x10^-1	#eeeeee
1x10^-2	#dddddd
1x10^-3	#cccccc
1x10^-4	#bbbbbb
1x10^-5	#aaaaaa
1x10^-6	#999999
1x10^-7	#888888
1x10^-8	#777777
1x10^-9	#666666
1x10^-10	#555555
1x10^-11	#444444
1x10^-12	#333333
1x10^-13	#222222
1x10^-14	#111111
[otherwise]	#000000

Limitations

Unfortunately, at this time, the term enrichment tool and slimmer both suffer from timeout and load issues on their They are limited in the amount of work that they can accomplish before a timeout event occurs either on the client or server. Due to these limitations, the current tool is not really designed to work on sets beyond a certain size. Unfortunately, this size is hard to pinpoint: depending on input type, size, and database warmup, the results available may be very different. If you get a timeout error, or see a phrase like "Query execution was interrupted", you have probably reached the time resource limit.

In the fairly near future we'll be moving to a Galaxy based workflow system that will not have these same limitations in size and time. Until then, you may wish to take a look at other available third-party tools:

http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools

Or look at a tool like Ontologizer:

http://compbio.charite.de/contao/index.php/ontologizer2.html

@@ Line 175: / Line 175: @@
 | [otherwise] || #000000 || bgcolor="#000000" |
 |}
+= Limitations =
+{{Software:Database_Limitations}}
 [[Category:AmiGO_Manual]]
 [[Category:AmiGO]]

AmiGO Manual: Term Enrichment: Difference between revisions