Example Queries: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 60: Line 60:
   INNER JOIN graph_path ON (term.id=graph_path.term1_id)
   INNER JOIN graph_path ON (term.id=graph_path.term1_id)
   INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id)
   INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id)
  WHERE term.name='nucleus' AND distance <> 0
  WHERE term.name='nucleus' AND distance <> 0 ;


===I can't find a GO ID in the database===
===I can't find a GO ID in the database===

Revision as of 00:12, 16 December 2007

This page has examples of various kinds of queries possible on the GO Database. This assumes some basic knowledge of SQL, and the availability of the GO database - either local or via remote queries. A web-based interface to the GO database mirrors at Berkeley BOP and EBI is also available. For more details, see:

This page will eventually replace the old example queries page on godatabase.org

If you have MySQL5 you can use views to implement many of these queries. Example queries will be hyperlinked to the View documentation


Basic Queries - on ontology

Find terms by GO ID

Given a GO ID, find out information on this term:

 SELECT * FROM term WHERE acc='GO:0005634';

Find terms by (preferred) name

 SELECT * FROM term WHERE name='nucleus';

Find terms by synonym/alternate label

 SELECT * FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE term_synonym LIKE 'protein biosynth%';

Find ancestors of the node 'nucleus'

 SELECT DISTINCT
        ancestor.*, 
        graph_path.distance,
        graph_path.term1_id AS ancestor_id
 FROM 
  term
  INNER JOIN graph_path ON (term.id=graph_path.term2_id)
  INNER JOIN term AS ancestor ON (ancestor.id=graph_path.term1_id)
 WHERE term.name='nucleus';

Note that if there are alternate paths of differing length, this query will report a separate row for each term-ancestor-distance combination. To only report ancestors once, remove graph_path.distance from the SELECT clause

For some reason the following works better on mysql v 5.0.26

SELECT ancestor.*, 
       graph_path.distance,
       graph_path.term1_id AS ancestor_id
FROM 
 term child, graph_path, term ancestor
WHERE
child.id=graph_path.term2_id
AND ancestor.id=graph_path.term1_id
AND child.name='nucleus';

Find descendants of the node 'nucleus'

SELECT DISTINCT descendant.acc, descendant.name, descendant.term_type
FROM
 term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id)
WHERE term.name='nucleus' AND distance <> 0 ;

I can't find a GO ID in the database

There are a number of possibilities

Terms by alternate ID

Example scenario:

A GO term has been merged with another term. The previous ID now becomes a secondary ID. It can be searched like this:

 SELECT
   term.*
 FROM
   term
   INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE
   acc_synonym='GO:0008151';

This finds GO:0008151 "cellular process", which has GO:0008151 as an alternate ID

Obsolete terms

Obsolete terms are stored in the term table, with is_obsolete set to 1.

To find all obsolete terms:

 SELECT *
 FROM term
 WHERE is_obsolete=1;

Total number of obsolete terms:

 SELECT count(*) FROM term WHERE is_obsolete=1;

Basic Queries - annotations

All genes in Drosophila annotated to 'nucleus' (including transitive indirect annotations)

SELECT
 term.name AS superterm_name,
 term.acc AS superterm_acc,
 term.term_type AS superterm_type,
 association.*,
 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name,
 dbxref.xref_dbname AS gp_dbname,
 dbxref.xref_key AS gp_acc,
 species.*
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
WHERE
 term.name = 'nucleus'
 AND
 species.genus = 'Drosophila';

If you have mysql5 you can use a view to make this easier:

SELECT * FROM term_J_association_J_gene_product_via_graph
WHERE superterm_name ='nucleus';

Peptide sequences of all genes/gene products annotated to 'pronucleus' (including transitive indirect annotations)

SELECT
 gene_product.symbol AS gp_symbol,
 dbxref.xref_dbname,
 dbxref.xref_key,
 seq.seq
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id)
 INNER JOIN seq ON (gene_product_seq.seq_id=seq.id)
WHERE
 term.name = 'pronucleus'
 AND association.is_not=0;

Annotations by gene product

Querying by gene or protein symbol

Finding gene products in the database with symbol 'BRCA1':

 SELECT
   *
 FROM
   gene_product
   INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
   INNER JOIN species ON (gene_product.species_id=species.id)
 WHERE
   symbol = 'BRCA1';

The search is case-insensitive (a MySQL default), so it will find mouse genes with symbol "Brca1"

Note that this search is probably not sensitive enough - different contributing databases to GO use different naming conventions. We can broaden the search with a text match:

 SELECT
   *
 FROM
   gene_product
   INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
   INNER JOIN species ON (gene_product.species_id=species.id)
 WHERE
   symbol like 'BRCA1%';

We can see that this search includes a record for a gene with symbol BRCA1_HUMAN (this is the convention of the contributing UniProt database)

We can also retrieve the GO terms attached:

 SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   gene_product.symbol like 'BRCA1%';

Querying by gene product ID

Each gene product is uniquely identified by a bipartite database-accession pair referenced in the gene_product.dbxref_id field; see

to retrieve by this field:

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key = 'FBgn0000490'
 AND
   dbxref.xref_dbname = 'FB';

this retrieves all annotations for Dmel decapentaplegic gene

Querying by multiple gene product IDs

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key IN 
     ('FBgn0000490','FBgn0015567','FBgn0003470')
 AND
   dbxref.xref_dbname = 'FB';

Querying by gene product (UniProt)

UniProt contribute annotations to both humans and other organisms; prior to building the database, annotations are filtered to avoid redundancy with Model Organism Database (MOD) contributing annotations.

In general you should try and use the unique UniProt ID:

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key = 'P38398';

Human BRCA1 annotations

Searching by sequence dbxref

A gene product record may be associated with multiple other identifiers through it's sequence record. For example, the mouse gene Tpm2 (identified in the MGI database as MGI:98810) is associated with the protein sequence record P58774 in UniProt.

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  alt_dbxref.xref_dbname,
  alt_dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id)
  INNER JOIN seq_dbxref ON (gene_product_seq.seq_id=seq_dbxref.seq_id)
  INNER JOIN dbxref AS alt_dbxref ON (seq_dbxref.dbxref_id=alt_dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   alt_dbxref.xref_key = 'P58774';

This can be the source of some confusion: there was at one point a UniProt-contributed annotation to P58774 (UniProt symbol TPM2_MOUSE). This was superseded by the MGI annotation - but the annotations are still linked to the UniProt record via the seq table.

Dictyostelium annotations with evidence from a kinase domain

SELECT
 gene_product.symbol,
 gene_product.full_name,
 dbxref.xref_dbname,
 dbxref.xref_key,
 species.genus,
 species.species,
 association.is_not,
 evidence.code,
 ex.xref_dbname AS with_dbname,
 ex.xref_key    AS with_acc,
 term.acc,
 term.term_type,
 term.name
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
 INNER JOIN evidence_dbxref ON (evidence.id=evidence_dbxref.evidence_id)
 INNER JOIN dbxref AS ex ON (evidence_dbxref.dbxref_id=ex.id)
 INNER JOIN term ON (association.term_id=term.id)
WHERE
  ex.xref_key='IPR000719'
 AND
  genus='Dictyostelium';

Reports / Aggregate Queries

In SQL, queries that perform some operation over a collection of records, such as summation, averages etc are known as aggregate queries. These are useful for generating reports

Total terms in database (non-obsolete)

 SELECT count(*)
 FROM   term
 WHERE is_obsolete=0;

Total terms, grouped by ontology (GO,MF,BP)

 SELECT term_type AS ontology,
        count(*) AS n_terms
 FROM   term
 WHERE  is_obsolete=0
 GROUP BY term_type;

note: this will give you a few additional ontologies that are not used directly for gene_product annotation, but are used for bookkeeping by the database

Total gene products in database

SELECT count(*)
FROM   gene_product;

Total gene products, grouped by DB authority (FlyBase, SGD, UniProt...)

SELECT xref_dbname, count(DISTINCT gene_product.id)
FROM   gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
GROUP BY xref_dbname;

Total species with at least one annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product;

(at present, the db contains no gene products that have not been annotated, so we do not need to join with the annotation table)

Total species with at least one non-IEA annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
 WHERE
  evidence.code != 'IEA';

NOTE: if you have the godb views installed, this query is simpler:

 SELECT count(DISTINCT species_id)
 FROM  association_J_evidence_J_gene_product
 WHERE code != 'IEA';

Total gene products annotated, grouped by species

This query excludes species that have only a couple of gene products annotated to them, in order to get the results to a manageable size

SELECT species.common_name, species.genus, species.species, count(*) AS n_gps
FROM   gene_product INNER JOIN species ON (gene_product.species_id=species.id)
GROUP BY species.common_name, species.genus, species.species
HAVING n_gps > 10;

Here we introduce a HAVING clause - this is because we cannot use a WHERE clause on the results of an aggregate function

Total annotations, grouped by species, broken down by evidence

 SELECT species.ncbi_taxa_id,  species.common_name,  evidence.code,  count(distinct association.id) AS evidence__association_count
 FROM species
  INNER JOIN gene_product ON (gene_product.species_id = species.id)
  INNER JOIN association ON (association.gene_product_id = gene_product.id)
  INNER JOIN evidence ON (evidence.association_id = association.id)
 GROUP BY  species.ncbi_taxa_id,  species.common_name,  evidence.code
 ORDER BY  species.ncbi_taxa_id,  species.common_name,  evidence.code 

Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence

SELECT xref_dbname, evidence.code, count(DISTINCT association.id)
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
GROUP BY xref_dbname, evidence.code;


Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence and ontology

SELECT xref_dbname, term.term_type, evidence.code, count(DISTINCT association.id)
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
 INNER JOIN term ON (association.term_id=term.id)
GROUP BY xref_dbname, term.term_type,evidence.code;

Average distance from root, by DB authority

Warning: this statistic may be misleading; see below

SELECT xref_dbname, avg(distance) AS avg_dist_from_root
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN graph_path ON (graph_path.term2_id=association.term_id)
 INNER JOIN term AS root ON (graph_path.term1_id=root.id)
WHERE
 is_root=1 AND distance>1
GROUP BY xref_dbname;

This is just meant to illustrate how certain kinds of data mining can be performed. Results may be partially meaningless - depth is measured as the number of links between a term and the root. This is not necessarily a meaningful measure of specificity. Most terms have multiple paths to root; in the above query, global averages are calculated, which means results will be biased in favour of certain kinds of terms (for example, many biosynthetic pathway type terms like "cysteine biosynthesis" are both deep in the graph, and have multiple paths, meaning groups that annotate organisms with a heavy bias to these kind of terms will score higher)

A better approach is to first find the average depth per term, and then average over that on a per annotation basis.

Better yet is to do an analysis based on information content - the above is crude and is for illustrating complex aggregate queries.


All terms, with gene products directly annotated to them

Warning: expensive query Warning: will generate ~20,000 rows

SELECT term.acc, term.name, count(DISTINCT gene_product_id)
FROM   term INNER JOIN association ON (term.id=association.term_id)
GROUP BY term.acc, term.name;

All terms, with gene products directly and transitively annotated to them

Warning: expensive query Warning: will generate ~20,000 rows

As above, but includes transitive annotations; eg if GP is annotated to X, and X is a beneath Y in the graph, GP is transitively annotated to Y.

SELECT term.acc, term.name, count(DISTINCT gene_product_id)
FROM   term INNER JOIN graph_path ON (term.id=graph_path.term1_id)
            INNER JOIN association ON (graph_path.term2_id=association.term_id)
GROUP BY term.acc, term.name;

Note that the above information is already pre-calculated in the database, for non-IEA annotations. The info is partitioned by the contributing database (speciesdb), which means it is additive. See:

http://www.geneontology.org/GO.database.schema.shtml#go-optimisations.table.gene-product-count

 SELECT term.acc, term.name, code, SUM(gene_product_count.product_count) AS pc
 FROM   term INNER JOIN gene_product_count ON (term.id = gene_product_count.term_id)
 GROUP BY term.acc, term.name, code;

Complex Queries

Query by distance to root

  • Caveat: Note that GO is structured as a DAG and thus there are typically multiple paths-to-root. Thus a term may have more than one 'depth'
  • Caveat: The depth of a node or distance between nodes is not necessarily biologically meaningful. Other more advanced queries are required to get biologially meaningful information-theory based distances between terms

This query finds all terms together with their depths - if a term is found at paths with different depths, it will be listed twice

 SELECT DISTINCT
   term.*,
   p.distance
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1;

You can constrain this by adding constraints to the WHERE clause; eg

  AND term.name like 'membrane%';
  • Caveat: at the time of writing, the GO database inserts an additional fake 'root' above the 3 actual root nodes of GO. Thus the distance should be taken as being to this fake root and not the actual root for the particular ontology (function, process, component)

An aggregate form of this query allows us to find the maximum distance to root for each term.

 SELECT DISTINCT
   term.acc,
   term.name,
   term.term_type,
   max(p.distance)
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1
   AND term.is_obsolete=0
 GROUP BY
   term.acc,
   term.name,
   term.term_type;

Note that other aggregate functions such as min or avg could be used in place of max in the above.


Query by distance from leaf

Immune response gene products and their associated cell component terms in human

We want genes annotated at-or-below immune reponse, together with their associated directly annotated terms in CC

 SELECT
  term1_acc  AS process_acc, 
  term1_name AS process_name,
  gp_symbol, 
  term2_acc  AS cc_acc, 
  term2_name AS cc_name 
 FROM 
  gene_product_with_term_pair_via_graph 
  INNER JOIN species ON (gp_species_id=species.id)
 WHERE 
  superterm_name='immune response' AND 
  species.common_name='human'      AND 
  is_not=0                         AND 
  association2_is_not=0            AND 
  term2_type='cellular_component';

Note we exclude negative annotations

Pairs of GO terms associated via IC evidence

The IC evidence code (Inferred by Curator) is typically used in conjunction with a GO ID as evidence (in the 'with' field). This is typically used when a curator makes a cross-ontology inference, eg if a process is known to always take place in the nucleus, then an additional annotation is made to nucleus with the process GO ID in the evidence with column.

 SELECT DISTINCT
       term.acc,
       term.term_type,
       term.name,
       term2.acc AS acc2,
       term2.term_type AS term_type2,
       term2.name AS name2
 FROM evidence_IC AS ic 
       INNER JOIN evidence_dbxref AS ex ON (ic.id=evidence_id)
       INNER JOIN dbxref ON (ex.dbxref_id=dbxref.id)
       INNER JOIN association AS a ON (a.id=ic.association_id)
       INNER JOIN term ON (a.term_id=term.id) 
       INNER JOIN term AS term2 ON (concat(xref_dbname,':',xref_key) = term2.acc);

(See also the view term_pair_by_IC)

Note the slightly contorted way of getting the second term via a mysql concat; this is because the evidence with field is stored in the dbxref table which splits the bipartite identifier; on the other hand the term table stores this in a single column.

Gene products that support other associations

The following query finds associations that are based on an ISS from FlyBase

 SELECT
   px_src.xref_dbname  AS src_db,
   px_src.xref_key     AS src_acc,
   p_src.symbol        AS src_symbol,
   px_der.xref_dbname  AS der_db,
   px_der.xref_key     AS der_acc,
   p_der.symbol        AS der_symbol,
   a_der.is_not        AS der_is_not,
   aqt_der.name        AS der_qual,
   t_der.acc           AS der_term_acc,
   t_der.name          AS der_term_name,
   e_der.code          AS der_code
 FROM
   gene_product AS p_src
   INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id)
   INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id)
   INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id)
   INNER JOIN association AS a_der ON (a_der.id=e_der.association_id)
   INNER JOIN term AS t_der ON (a_der.term_id=t_der.id)
   INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id)
   INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id)
   LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id)
   LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id)
 WHERE
   px_src.xref_dbname='FB'
  AND
   e_der.code='ISS';


Suspect ISS assignments

An association with an ISS evidence code should be accompanied by the ID of the source gene product from which the annotation is transferred. If an annotation to term A is derived from an annotation to term B in an orthologous gene product then we would expect A to subsume B.

 SELECT
   px_src.xref_dbname,
   px_src.xref_key,
   p_src.symbol,
   px_der.xref_dbname,
   px_der.xref_key,
   p_der.symbol,
   t_der.name,
   a_der.is_not,
   aqt_der.name        AS der_qual,
   e_der.code
 FROM
   gene_product AS p_src
   INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id)
   INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id)
   INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id)
   INNER JOIN association AS a_der ON (a_der.id=e_der.association_id)
   INNER JOIN term AS t_der ON (a_der.term_id=t_der.id)
   INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id)
   INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id)
   LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id)
   LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id)
 WHERE
   code = 'ISS'
  AND
   NOT EXISTS (SELECT a_src.id
               FROM 
                    association AS a_src
                    INNER JOIN graph_path AS pth ON (pth.term2_id=a_src.term_id)
               WHERE 
                     pth.term1_id=a_der.term_id
                     AND a_src.gene_product_id=p_src.id)

See Also