Example Queries

From GO Wiki
Jump to navigation Jump to search

This page will eventually replace the old example queries page on godatabase.org

Many of these queries require mysql5 and the GO-DB view layer;

Basic Queries - on ontology

Find terms by GO ID

Given a GO ID, find out information on this term:

 SELECT * FROM term WHERE acc='GO:0005634';

Find terms by (preferred) name

 SELECT * FROM term WHERE name='nucleus';

Find terms by synonym/alternate label

 SELECT * FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE term_synonym LIKE 'protein biosynth%';

Find ancestors of the node 'nucleus'

 SELECT ancestor.*, 
        graph_path.distance,
        graph_path.term1_id AS ancestor_id
 FROM 
  term
  INNER JOIN graph_path ON (term.id=graph_path.term2_id);
  INNER JOIN term AS ancestory ON (ancestor.id=graph_path.term1_id)
 WHERE term.name='nucleus';

I can't find a GO ID in the database

There are a number of possibilities

Terms by alternate ID

Example scenario:

A GO term has been merged with another term. The previous ID now becomes a secondary ID. It can be searched like this:

 SELECT
   term.*
 FROM
   term
   INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE
   acc_synonym='GO:0008151';

This finds GO:0008151 "cellular process", which has GO:0008151 as an alternate ID

Obsolete terms

Obsolete terms are stored in the term table, with is_obsolete set to 1.

To find all obsolete terms:

 SELECT *
 FROM term
 WHERE is_obsolete=1;

Total number of obsolete terms:

 SELECT count(*) FROM term WHERE is_obsolete=1;

Basic Queries - annotations

All genes annotated to 'nucleus' (including transitive indirect annotations)

SELECT
 term.name AS superterm_name,
 term.acc AS superterm_acc,
 term.term_type AS superterm_type,
 association.*,
 gene_product.symbol AS gp_symbol,
 gene_product.dbxref_id AS gp_dbxref_id,
 gene_product.species_id AS gp_species_id
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
WHERE
 term.name = 'nucleus';

If you have mysql5 you can use a view to make this easier:

SELECT * FROM term_J_association_J_gene_product_via_graph
WHERE superterm_name ='nucleus';

Reports / Aggregate Queries

In SQL, queries that perform some operation over a collection of records, such as summation, averages etc are known as aggregate queries. These are useful for generating reports

Total terms in database (non-obsolete)

 SELECT count(*)
 FROM   term
 WHERE is_obsolete=0;

Total terms, grouped by ontology (GO,MF,BP)

 SELECT term_type AS ontology,
        count(*) AS n_terms
 FROM   term
 WHERE  is_obsolete=0
 GROUP BY term_type;

note: this will give you a few additional ontologies that are not used directly for gene_product annotation, but are used for bookkeeping by the database

Total gene products in database

SELECT count(*)
FROM   gene_product;

Total gene products, grouped by DB authority (FlyBase, SGD, UniProt...)

SELECT xref_dbname, count(DISTINCT gene_product.id)
FROM   gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
GROUP BY xref_dbname;

Total species with at least one annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product;

(at present, the db contains no gene products that have not been annotated, so we do not need to join with the annotation table)

Total species with at least one non-IEA annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
 WHERE
  evidence.code != 'IEA';

NOTE: if you have the godb views installed, this query is simpler:

 SELECT count(DISTINCT species_id)
 FROM  association_J_evidence_J_gene_product
 WHERE code != 'IEA';

Total gene products annotated, grouped by species

This query excludes species that have only a couple of gene products annotated to them, in order to get the results to a manageable size

SELECT species.common_name, species.genus, species.species, count(*) AS n_gps
FROM   gene_product INNER JOIN species ON (gene_product.species_id=species.id)
GROUP BY species.common_name, species.genus, species.species
HAVING n_gps > 10;

Here we introduce a HAVING clause - this is because we cannot use a WHERE clause on the results of an aggregate function

Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence

SELECT xref_dbname, evidence.code, count(DISTINCT association.id)
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
GROUP BY xref_dbname, evidence.code;

Average distance from root, by DB authority

Warning: this statistic may be misleading; see below

SELECT xref_dbname, avg(distance) AS avg_dist_from_root
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN graph_path ON (graph_path.term2_id=association.term_id)
 INNER JOIN term AS root ON (graph_path.term1_id=root.id)
WHERE
 is_root=1 AND distance>1
GROUP BY xref_dbname;

This is just meant to illustrate how certain kinds of data mining can be performed. Results may be partially meaningless - depth is measured as the number of links between a term and the root. This is not necessarily a meaningful measure of specificity. Most terms have multiple paths to root; in the above query, global averages are calculated, which means results will be biased in favour of certain kinds of terms (for example, many biosynthetic pathway type terms like "cysteine biosynthesis" are both deep in the graph, and have multiple paths, meaning groups that annotate organisms with a heavy bias to these kind of terms will score higher)

A better approach is to first find the average depth per term, and then average over that on a per annotation basis.

Better yet is to do an analysis based on information content - the above is crude and is for illustrating complex aggregate queries.


All terms, with gene products directly annotated to them

Warning: expensive query Warning: will generate ~20,000 rows

SELECT term.acc, term.name, count(DISTINCT gene_product_id)
FROM   term INNER JOIN association ON (term.id=association.term_id)
GROUP BY term.acc, term.name;

All terms, with gene products directly and transitively annotated to them

Warning: expensive query Warning: will generate ~20,000 rows

As above, but includes transitive annotations; eg if GP is annotated to X, and X is a beneath Y in the graph, GP is transitively annotated to Y.

SELECT term.acc, term.name, count(DISTINCT gene_product_id)
FROM   term INNER JOIN graph_path ON (term.id=graph_path.term1_id)
            INNER JOIN association ON (graph_path.term2_id=association.term_id)
GROUP BY term.acc, term.name;

Complex Queries

Query by distance to root

  • Caveat: Note that GO is structured as a DAG and thus there are typically multiple paths-to-root. Thus a term may have more than one 'depth'
  • Caveat: The depth of a node or distance between nodes is not necessarily biologically meaningful. Other more advanced queries are required to get biologially meaningful information-theory based distances between terms

This query finds all terms together with their depths - if a term is found at paths with different depths, it will be listed twice

 SELECT DISTINCT
   term.*,
   p.distance
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1;

You can constrain this by adding constraints to the WHERE clause; eg

  AND term.name like 'membrane%';
  • Caveat: at the time of writing, the GO database inserts an additional fake 'root' above the 3 actual root nodes of GO. Thus the distance should be taken as being to this fake root and not the actual root for the particular ontology (function, process, component)

An aggregate form of this query allows us to find the maximum distance to root for each term.

 SELECT DISTINCT
   term.acc,
   term.name,
   term.term_type,
   max(p.distance)
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1
   AND term.is_obsolete=0
 GROUP BY
   term.acc,
   term.name,
   term.term_type;

Note that other aggregate functions such as min or avg could be used in place of max in the above.


Query by distance from leaf

Immune response gene products and their associated cell component terms in human

We want genes annotated at-or-below immune reponse, together with their associated directly annotated terms in CC

 SELECT
  term1_acc  AS process_acc, 
  term1_name AS process_name,
  gp_symbol, 
  term2_acc  AS cc_acc, 
  term2_name AS cc_name 
 FROM 
  gene_product_with_term_pair_via_graph 
  INNER JOIN species ON (gp_species_id=species.id)
 WHERE 
  superterm_name='immune response' AND 
  species.common_name='human'      AND 
  is_not=0                         AND 
  association2_is_not=0            AND 
  term2_type='cellular_component';

Note we exclude negative annotations

Pairs of GO terms associated via IC evidence

The IC evidence code (Inferred by Curator) is typically used in conjunction with a GO ID as evidence (in the 'with' field). This is typically used when a curator makes a cross-ontology inference, eg if a process is known to always take place in the nucleus, then an additional annotation is made to nucleus with the process GO ID in the evidence with column.

 SELECT DISTINCT
       term.acc,
       term.term_type,
       term.name,
       term2.acc AS acc2,
       term2.term_type AS term_type2,
       term2.name AS name2
 FROM evidence_IC AS ic 
       INNER JOIN evidence_dbxref AS ex ON (ic.id=evidence_id)
       INNER JOIN dbxref ON (ex.dbxref_id=dbxref.id)
       INNER JOIN association AS a ON (a.id=ic.association_id)
       INNER JOIN term ON (a.term_id=term.id) 
       INNER JOIN term AS term2 ON (concat(xref_dbname,':',xref_key) = term2.acc);

(See also the view term_pair_by_IC)

Note the slightly contorted way of getting the second term via a mysql concat; this is because the evidence with field is stored in the dbxref table which splits the bipartite identifier; on the other hand the term table stores this in a single column.

See Also