Example Queries: Difference between revisions
Line 32: | Line 32: | ||
INNER JOIN term AS ancestory ON (ancestor.id=graph_path.term1_id) | INNER JOIN term AS ancestory ON (ancestor.id=graph_path.term1_id) | ||
WHERE term.name='nucleus'; | WHERE term.name='nucleus'; | ||
For some reason the following works better on mysql v 5.0.26 | |||
SELECT ancestor.*, | |||
graph_path.distance, | |||
graph_path.term1_id AS ancestor_id | |||
FROM | |||
term child, graph_path, term ancestor | |||
WHERE | |||
child.id=graph_path.term2_id | |||
AND ancestor.id=graph_path.term1_id | |||
AND child.name='nucleus'; | |||
===I can't find a GO ID in the database=== | ===I can't find a GO ID in the database=== |
Revision as of 18:42, 17 April 2007
This page will eventually replace the old example queries page on godatabase.org
Many of these queries require mysql5 and the GO-DB view layer;
Basic Queries - on ontology
Find terms by GO ID
Given a GO ID, find out information on this term:
SELECT * FROM term WHERE acc='GO:0005634';
Find terms by (preferred) name
SELECT * FROM term WHERE name='nucleus';
Find terms by synonym/alternate label
SELECT * FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id) WHERE term_synonym LIKE 'protein biosynth%';
Find ancestors of the node 'nucleus'
SELECT ancestor.*, graph_path.distance, graph_path.term1_id AS ancestor_id FROM term INNER JOIN graph_path ON (term.id=graph_path.term2_id); INNER JOIN term AS ancestory ON (ancestor.id=graph_path.term1_id) WHERE term.name='nucleus';
For some reason the following works better on mysql v 5.0.26
SELECT ancestor.*, graph_path.distance, graph_path.term1_id AS ancestor_id FROM term child, graph_path, term ancestor WHERE child.id=graph_path.term2_id AND ancestor.id=graph_path.term1_id AND child.name='nucleus';
I can't find a GO ID in the database
There are a number of possibilities
Terms by alternate ID
Example scenario:
A GO term has been merged with another term. The previous ID now becomes a secondary ID. It can be searched like this:
SELECT term.* FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id) WHERE acc_synonym='GO:0008151';
This finds GO:0008151 "cellular process", which has GO:0008151 as an alternate ID
Obsolete terms
Obsolete terms are stored in the term table, with is_obsolete set to 1.
To find all obsolete terms:
SELECT * FROM term WHERE is_obsolete=1;
Total number of obsolete terms:
SELECT count(*) FROM term WHERE is_obsolete=1;
Basic Queries - annotations
All genes annotated to 'nucleus' (including transitive indirect annotations)
SELECT term.name AS superterm_name, term.acc AS superterm_acc, term.term_type AS superterm_type, association.*, gene_product.symbol AS gp_symbol, gene_product.dbxref_id AS gp_dbxref_id, gene_product.species_id AS gp_species_id FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN association ON (graph_path.term2_id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) WHERE term.name = 'nucleus';
If you have mysql5 you can use a view to make this easier:
SELECT * FROM term_J_association_J_gene_product_via_graph WHERE superterm_name ='nucleus';
Reports / Aggregate Queries
In SQL, queries that perform some operation over a collection of records, such as summation, averages etc are known as aggregate queries. These are useful for generating reports
Total terms in database (non-obsolete)
SELECT count(*) FROM term WHERE is_obsolete=0;
Total terms, grouped by ontology (GO,MF,BP)
SELECT term_type AS ontology, count(*) AS n_terms FROM term WHERE is_obsolete=0 GROUP BY term_type;
note: this will give you a few additional ontologies that are not used directly for gene_product annotation, but are used for bookkeeping by the database
Total gene products in database
SELECT count(*) FROM gene_product;
Total gene products, grouped by DB authority (FlyBase, SGD, UniProt...)
SELECT xref_dbname, count(DISTINCT gene_product.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) GROUP BY xref_dbname;
Total species with at least one annotation
SELECT count(DISTINCT species_id) FROM gene_product;
(at present, the db contains no gene products that have not been annotated, so we do not need to join with the annotation table)
Total species with at least one non-IEA annotation
SELECT count(DISTINCT species_id) FROM gene_product INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) WHERE evidence.code != 'IEA';
NOTE: if you have the godb views installed, this query is simpler:
SELECT count(DISTINCT species_id) FROM association_J_evidence_J_gene_product WHERE code != 'IEA';
Total gene products annotated, grouped by species
This query excludes species that have only a couple of gene products annotated to them, in order to get the results to a manageable size
SELECT species.common_name, species.genus, species.species, count(*) AS n_gps FROM gene_product INNER JOIN species ON (gene_product.species_id=species.id) GROUP BY species.common_name, species.genus, species.species HAVING n_gps > 10;
Here we introduce a HAVING clause - this is because we cannot use a WHERE clause on the results of an aggregate function
Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence
SELECT xref_dbname, evidence.code, count(DISTINCT association.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) GROUP BY xref_dbname, evidence.code;
Average distance from root, by DB authority
Warning: this statistic may be misleading; see below
SELECT xref_dbname, avg(distance) AS avg_dist_from_root FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN graph_path ON (graph_path.term2_id=association.term_id) INNER JOIN term AS root ON (graph_path.term1_id=root.id) WHERE is_root=1 AND distance>1 GROUP BY xref_dbname;
This is just meant to illustrate how certain kinds of data mining can be performed. Results may be partially meaningless - depth is measured as the number of links between a term and the root. This is not necessarily a meaningful measure of specificity. Most terms have multiple paths to root; in the above query, global averages are calculated, which means results will be biased in favour of certain kinds of terms (for example, many biosynthetic pathway type terms like "cysteine biosynthesis" are both deep in the graph, and have multiple paths, meaning groups that annotate organisms with a heavy bias to these kind of terms will score higher)
A better approach is to first find the average depth per term, and then average over that on a per annotation basis.
Better yet is to do an analysis based on information content - the above is crude and is for illustrating complex aggregate queries.
All terms, with gene products directly annotated to them
Warning: expensive query Warning: will generate ~20,000 rows
SELECT term.acc, term.name, count(DISTINCT gene_product_id) FROM term INNER JOIN association ON (term.id=association.term_id) GROUP BY term.acc, term.name;
All terms, with gene products directly and transitively annotated to them
Warning: expensive query Warning: will generate ~20,000 rows
As above, but includes transitive annotations; eg if GP is annotated to X, and X is a beneath Y in the graph, GP is transitively annotated to Y.
SELECT term.acc, term.name, count(DISTINCT gene_product_id) FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN association ON (graph_path.term2_id=association.term_id) GROUP BY term.acc, term.name;
Complex Queries
Query by distance to root
- Caveat: Note that GO is structured as a DAG and thus there are typically multiple paths-to-root. Thus a term may have more than one 'depth'
- Caveat: The depth of a node or distance between nodes is not necessarily biologically meaningful. Other more advanced queries are required to get biologially meaningful information-theory based distances between terms
This query finds all terms together with their depths - if a term is found at paths with different depths, it will be listed twice
SELECT DISTINCT term.*, p.distance FROM term INNER JOIN graph_path AS p ON (p.term2_id=term.id) INNER JOIN term AS root ON (p.term1_id=root.id) WHERE root.is_root=1;
You can constrain this by adding constraints to the WHERE clause; eg
AND term.name like 'membrane%';
- Caveat: at the time of writing, the GO database inserts an additional fake 'root' above the 3 actual root nodes of GO. Thus the distance should be taken as being to this fake root and not the actual root for the particular ontology (function, process, component)
An aggregate form of this query allows us to find the maximum distance to root for each term.
SELECT DISTINCT term.acc, term.name, term.term_type, max(p.distance) FROM term INNER JOIN graph_path AS p ON (p.term2_id=term.id) INNER JOIN term AS root ON (p.term1_id=root.id) WHERE root.is_root=1 AND term.is_obsolete=0 GROUP BY term.acc, term.name, term.term_type;
Note that other aggregate functions such as min or avg could be used in place of max in the above.
Query by distance from leaf
Immune response gene products and their associated cell component terms in human
We want genes annotated at-or-below immune reponse, together with their associated directly annotated terms in CC
SELECT term1_acc AS process_acc, term1_name AS process_name, gp_symbol, term2_acc AS cc_acc, term2_name AS cc_name FROM gene_product_with_term_pair_via_graph INNER JOIN species ON (gp_species_id=species.id) WHERE superterm_name='immune response' AND species.common_name='human' AND is_not=0 AND association2_is_not=0 AND term2_type='cellular_component';
Note we exclude negative annotations
Pairs of GO terms associated via IC evidence
The IC evidence code (Inferred by Curator) is typically used in conjunction with a GO ID as evidence (in the 'with' field). This is typically used when a curator makes a cross-ontology inference, eg if a process is known to always take place in the nucleus, then an additional annotation is made to nucleus with the process GO ID in the evidence with column.
SELECT DISTINCT term.acc, term.term_type, term.name, term2.acc AS acc2, term2.term_type AS term_type2, term2.name AS name2 FROM evidence_IC AS ic INNER JOIN evidence_dbxref AS ex ON (ic.id=evidence_id) INNER JOIN dbxref ON (ex.dbxref_id=dbxref.id) INNER JOIN association AS a ON (a.id=ic.association_id) INNER JOIN term ON (a.term_id=term.id) INNER JOIN term AS term2 ON (concat(xref_dbname,':',xref_key) = term2.acc);
(See also the view term_pair_by_IC)
Note the slightly contorted way of getting the second term via a mysql concat; this is because the evidence with field is stored in the dbxref table which splits the bipartite identifier; on the other hand the term table stores this in a single column.