Example LEAD Queries

From GO Wiki
(Redirected from Example Queries)
Jump to navigation Jump to search

This page has examples of various kinds of queries possible on the GO LEAD Database. This assumes some basic knowledge of SQL, and the availability of the LEAD database - either local or via remote queries. A web-based interface to the GO database mirrors at Berkeley BOP and EBI is also available. For more details, see:

http://www.geneontology.org/GO.database.shtml

If you have MySQL5 you can use views to implement many of these queries. Example queries will be hyperlinked to the View documentation


Basic Queries - on ontology

Find terms by GO ID

Given a GO ID, find out information on this term:

 SELECT * FROM term WHERE acc='GO:0005634';

Find terms by (preferred) name

 SELECT * FROM term WHERE name='nucleus';

Find terms by synonym/alternate label

 SELECT * FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE term_synonym LIKE 'protein biosynth%';

Find ancestors of the node 'nucleus'

 SELECT DISTINCT
        ancestor.*, 
        graph_path.distance,
        graph_path.term1_id AS ancestor_id
 FROM 
  term
  INNER JOIN graph_path ON (term.id=graph_path.term2_id)
  INNER JOIN term AS ancestor ON (ancestor.id=graph_path.term1_id)
 WHERE term.name='nucleus';

Note that if there are alternate paths of differing length, this query will report a separate row for each term-ancestor-distance combination. To only report ancestors once, remove graph_path.distance from the SELECT clause

For some reason the following works better on mysql v 5.0.26

SELECT ancestor.*, 
       graph_path.distance,
       graph_path.term1_id AS ancestor_id
FROM 
 term child, graph_path, term ancestor
WHERE
child.id=graph_path.term2_id
AND ancestor.id=graph_path.term1_id
AND child.name='nucleus';

Find descendants of the node 'nucleus'

SELECT DISTINCT descendant.acc, descendant.name, descendant.term_type
FROM
 term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id)
WHERE term.name='nucleus' AND distance <> 0;

Find all asserted links in GO

SELECT
 child.term_type AS ontology,
 child.acc AS child_acc,
 child.name AS child_name,
 rel.acc AS rel_acc,
 parent.acc AS parent_acc,
 parent.name AS parent_name
FROM
 term AS child
 INNER JOIN term2term ON (child.id=term2_id)
 INNER JOIN term AS parent ON (parent.id=term1_id)
 INNER JOIN term AS rel ON (rel.id=relationship_type_id)
WHERE parent.acc LIKE 'GO:%';

This will return ALL links in GO, which as of 2008 is of the order of 40k edges (add to the where clause to restrict this)

Note the WHERE clause above is necessary to eliminate artefactual terms that are added to the GO database, such as a fake root node.

You can substitute term2term for graph_path to get the full transitive closure (ie all paths). WARNING - this will return millions of rows.

I can't find a GO ID in the database

There are a number of possibilities

Terms by alternate ID

Example scenario:

A GO term has been merged with another term. The previous ID now becomes a secondary ID. It can be searched like this:

 SELECT
   term.*
 FROM
   term
   INNER JOIN term_synonym ON (term.id=term_synonym.term_id)
 WHERE
   acc_synonym='GO:0008151';

This finds GO:0008151 "cellular process", which has GO:0008151 as an alternate ID

Obsolete terms

Obsolete terms are stored in the term table, with is_obsolete set to 1.

To find all obsolete terms:

 SELECT *
 FROM term
 WHERE is_obsolete=1;

Total number of obsolete terms:

 SELECT count(*) FROM term WHERE is_obsolete=1;

Find maximum tree depth for cellular component

SELECT distance as max from graph_path, term WHERE graph_path.term2_id =
term.id and term.term_type = 'cellular_component' ORDER BY distance desc
limit 1;

Basic Queries - annotations

All genes in Drosophila annotated to 'nucleus' (including transitive indirect annotations)

SELECT
 term.name AS superterm_name,
 term.acc AS superterm_acc,
 term.term_type AS superterm_type,
 association.*,
 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name,
 dbxref.xref_dbname AS gp_dbname,
 dbxref.xref_key AS gp_acc,
 species.genus,
 species.species,
 species.ncbi_taxa_id,
 species.common_name
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
WHERE
 term.name = 'nucleus'
 AND
 species.genus = 'Drosophila';

If you have mysql5 you can use a view to make this easier:

SELECT * FROM term_J_association_J_gene_product_via_graph
WHERE superterm_name ='nucleus';

All genes directly annotated to 'nucleus' (excluding child terms)

SELECT
 association.is_not,
 term.name,
 term.acc,
 term.term_type,
 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name,
 dbxref.xref_dbname AS gp_dbname,
 dbxref.xref_key AS gp_acc,
 species.genus,
 species.species,
 species.common_name,
 species.ncbi_taxa_id,
 association.assocdate,
 db.name AS assigned_by,
 db.fullname
FROM term
 INNER JOIN association ON (term.id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN db ON (association.source_db_id=db.id)
WHERE
 term.name = 'nucleus';

All genes annotated to 'signal transducer activity' but not 'signal transduction' (including transitive indirect annotations)

SELECT
 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name,
 dbxref.xref_dbname AS gp_dbname,
 dbxref.xref_key AS gp_acc,
 species.genus,
 species.species,
 species.common_name
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
WHERE
 term.name = 'signal transducer activity'
 AND
 is_not=0
 AND
 NOT EXISTS 
  (SELECT *
   FROM association AS a2
   INNER JOIN graph_path AS tc2 ON (tc2.term2_id=a2.term_id)
   INNER JOIN term AS t2 ON (tc2.term1_id=t2.id)
   WHERE a2.gene_product_id = association.gene_product_id
   AND   t2.name = 'signal transduction'
   AND   a2.is_not=0);

Peptide sequences of all genes/gene products annotated to 'pronucleus' (including transitive indirect annotations)

SELECT
 gene_product.symbol AS gp_symbol,
 dbxref.xref_dbname,
 dbxref.xref_key,
 seq.seq
FROM term
 INNER JOIN graph_path ON (term.id=graph_path.term1_id)
 INNER JOIN association ON (graph_path.term2_id=association.term_id)
 INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id)
 INNER JOIN seq ON (gene_product_seq.seq_id=seq.id)
WHERE
 term.name = 'pronucleus'
 AND association.is_not=0;

Annotations by gene product

Querying by gene or protein symbol

Finding gene products in the database with symbol 'BRCA1':

 SELECT
   *
 FROM
   gene_product
   INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
   INNER JOIN species ON (gene_product.species_id=species.id)
 WHERE
   symbol = 'BRCA1';

The search is case-insensitive (a MySQL default), so it will find mouse genes with symbol "Brca1"

Note that this search is probably not sensitive enough - different contributing databases to GO use different naming conventions. We can broaden the search with a text match:

 SELECT
   *
 FROM
   gene_product
   INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
   INNER JOIN species ON (gene_product.species_id=species.id)
 WHERE
   symbol like 'BRCA1%';

We can see that this search includes a record for a gene with symbol BRCA1_HUMAN (this is the convention of the contributing UniProt database)

We can also retrieve the GO terms attached:

 SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   gene_product.symbol like 'BRCA1%';

Querying by gene product ID

Each gene product is uniquely identified by a bipartite database-accession pair referenced in the gene_product.dbxref_id field; see

to retrieve by this field:

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key = 'FBgn0000490'
 AND
   dbxref.xref_dbname = 'FB';

this retrieves all annotations for Dmel decapentaplegic gene

Querying by multiple gene product IDs

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key IN 
     ('FBgn0000490','FBgn0015567','FBgn0003470')
 AND
   dbxref.xref_dbname = 'FB';

Querying by gene product (UniProt)

UniProt contribute annotations to both humans and other organisms; prior to building the database, annotations are filtered to avoid redundancy with Model Organism Database (MOD) contributing annotations.

In general you should try and use the unique UniProt ID:

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   dbxref.xref_key = 'P38398';

Human BRCA1 annotations

Searching by sequence dbxref

A gene product record may be associated with multiple other identifiers through its sequence record. For example, the mouse gene Tpm2 (identified in the MGI database as MGI:98810) is associated with the protein sequence record P58774 in UniProt.

SELECT
  gene_product.symbol,
  gene_product.full_name,
  dbxref.xref_dbname,
  dbxref.xref_key,
  alt_dbxref.xref_dbname,
  alt_dbxref.xref_key,
  species.genus,
  species.species,
  association.is_not,
  evidence.code,
  term.acc,
  term.name
 FROM   gene_product
  INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
  INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id)
  INNER JOIN seq_dbxref ON (gene_product_seq.seq_id=seq_dbxref.seq_id)
  INNER JOIN dbxref AS alt_dbxref ON (seq_dbxref.dbxref_id=alt_dbxref.id)
  INNER JOIN species ON (gene_product.species_id=species.id)
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
  INNER JOIN term ON (association.term_id=term.id)
 WHERE
   alt_dbxref.xref_key = 'P58775';

This can be the source of some confusion: there was at one point a UniProt-contributed annotation to P58774 (UniProt symbol TPM2_MOUSE). This was superseded by the MGI annotation - but the annotations are still linked to the UniProt record via the seq table.

Querying by annotation evidence (PMID)

select distinct
  term.acc,
  evidence.code,
  gene_product.symbol,
  concat(gdbx.xref_dbname, ':', gdbx.xref_key) as gbx,
  concat(dbxref.xref_dbname, ':', dbxref.xref_key) as dbx
from
  dbxref,
  evidence,
  association,
  gene_product,
  dbxref as gdbx,
  term
where
  dbxref.id = evidence.dbxref_id and
  evidence.association_id = association.id and
  association.gene_product_id = gene_product.id and
  gene_product.dbxref_id = gdbx.id and
  association.term_id = term.id and
  dbxref.xref_dbname = 'PMID' and
  dbxref.xref_key = '18391219'

Dictyostelium annotations with evidence from a kinase domain

SELECT
 gene_product.symbol,
 gene_product.full_name,
 dbxref.xref_dbname,
 dbxref.xref_key,
 species.genus,
 species.species,
 association.is_not,
 evidence.code,
 ex.xref_dbname AS with_dbname,
 ex.xref_key    AS with_acc,
 term.acc,
 term.term_type,
 term.name
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN species ON (gene_product.species_id=species.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
 INNER JOIN evidence_dbxref ON (evidence.id=evidence_dbxref.evidence_id)
 INNER JOIN dbxref AS ex ON (evidence_dbxref.dbxref_id=ex.id)
 INNER JOIN term ON (association.term_id=term.id)
WHERE
  ex.xref_key='IPR000719'
 AND
  genus='Dictyostelium';

Reports / Aggregate Queries

In SQL, queries that perform some operation over a collection of records, such as summation, averages etc are known as aggregate queries. These are useful for generating reports

Total terms in database (non-obsolete)

 SELECT count(*)
 FROM   term
 WHERE is_obsolete=0;

Total terms, grouped by ontology (GO,MF,BP)

 SELECT term_type AS ontology,
        count(*) AS n_terms
 FROM   term
 WHERE  is_obsolete=0
 GROUP BY term_type;

note: this will give you a few additional ontologies that are not used directly for gene_product annotation, but are used for bookkeeping by the database

Total gene products in database

SELECT count(*)
FROM   gene_product;

Total gene products grouped by type

SELECT type.acc,count(gene_product.id)
FROM   gene_product
 INNER JOIN term AS type ON (gene_product.type_id=type.id)
GROUP BY
 type.acc;

Total gene products, grouped by DB authority (FlyBase, SGD, UniProt...)

SELECT xref_dbname, count(DISTINCT gene_product.id)
FROM   gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
GROUP BY xref_dbname;

Number of gene products directly annotated to a term, grouped by DB authority

SELECT xref_dbname, count(DISTINCT gene_product.id)
FROM gene_product
LEFT JOIN dbxref ON gene_product.dbxref_id=dbxref.id
LEFT JOIN association ON association.gene_product_id=gene_product.id
LEFT JOIN term ON association.term_id=term.id
WHERE term.acc='GO:XXXXXXX'
GROUP BY xref_dbname;

Total species with at least one annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product;

(at present, the db contains no gene products that have not been annotated, so we do not need to join with the annotation table)

Total species with at least one non-IEA annotation

 SELECT count(DISTINCT species_id)
 FROM   gene_product
  INNER JOIN association ON (gene_product.id=association.gene_product_id)
  INNER JOIN evidence ON (association.id=evidence.association_id)
 WHERE
  evidence.code != 'IEA';

NOTE: if you have the godb views installed, this query is simpler:

 SELECT count(DISTINCT species_id)
 FROM  association_J_evidence_J_gene_product
 WHERE code != 'IEA';

Total gene products annotated, grouped by species

This query excludes species that have only a couple of gene products annotated to them, in order to get the results to a manageable size

SELECT species.common_name, species.genus, species.species, count(*) AS n_gps
FROM   gene_product INNER JOIN species ON (gene_product.species_id=species.id)
GROUP BY species.common_name, species.genus, species.species
HAVING n_gps > 10;

Here we introduce a HAVING clause - this is because we cannot use a WHERE clause on the results of an aggregate function

Total annotations, grouped by species, broken down by evidence

 SELECT species.ncbi_taxa_id,  species.common_name,  evidence.code,  count(distinct association.id) AS evidence__association_count
 FROM species
  INNER JOIN gene_product ON (gene_product.species_id = species.id)
  INNER JOIN association ON (association.gene_product_id = gene_product.id)
  INNER JOIN evidence ON (evidence.association_id = association.id)
 GROUP BY  species.ncbi_taxa_id,  species.common_name,  evidence.code
 ORDER BY  species.ncbi_taxa_id,  species.common_name,  evidence.code;

Total annotations IMP annotations under regulation of biological process

SELECT
  COUNT(DISTINCT association.gene_product_id)
FROM 
 association,
 evidence,
 graph_path,
 term
WHERE 
 association.is_not = 0 AND 
 evidence.code = 'IMP' AND 
 evidence.association_id = association.id AND 
 graph_path.term2_id = association.term_id AND 
 term.id = graph_path.term1_id AND 
 term.acc = 'GO:0050789';

Total non-IEA annotations, grouped by species, where there are > 100 annotations

 SELECT species.ncbi_taxa_id,  species.common_name, count(distinct association.id) AS association_count
 FROM species
  INNER JOIN gene_product ON (gene_product.species_id = species.id)
  INNER JOIN association ON (association.gene_product_id = gene_product.id)
  INNER JOIN evidence ON (evidence.association_id = association.id)
 WHERE code !='IEA'
 GROUP BY  species.ncbi_taxa_id,  species.common_name
 HAVING  count(distinct association.id) > 100;

Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence

SELECT xref_dbname, evidence.code, count(DISTINCT association.id)
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
GROUP BY xref_dbname, evidence.code;


Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence and ontology

SELECT xref_dbname, term.term_type, evidence.code, count(DISTINCT association.id)
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN evidence ON (association.id=evidence.association_id)
 INNER JOIN term ON (association.term_id=term.id)
GROUP BY xref_dbname, term.term_type,evidence.code;

Average distance from root, by DB authority

Warning: this statistic may be misleading; see below

SELECT xref_dbname, avg(distance) AS avg_dist_from_root
FROM   gene_product
 INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
 INNER JOIN association ON (gene_product.id=association.gene_product_id)
 INNER JOIN graph_path ON (graph_path.term2_id=association.term_id)
 INNER JOIN term AS root ON (graph_path.term1_id=root.id)
WHERE
 is_root=1 AND distance>1
GROUP BY xref_dbname;

This is just meant to illustrate how certain kinds of data mining can be performed. Results may be partially meaningless - depth is measured as the number of links between a term and the root. This is not necessarily a meaningful measure of specificity. Most terms have multiple paths to root; in the above query, global averages are calculated, which means results will be biased in favour of certain kinds of terms (for example, many biosynthetic pathway type terms like "cysteine biosynthesis" are both deep in the graph, and have multiple paths, meaning groups that annotate organisms with a heavy bias to these kind of terms will score higher)

A better approach is to first find the average depth per term, and then average over that on a per annotation basis.

Better yet is to do an analysis based on information content - the above is crude and is for illustrating complex aggregate queries.


All terms, with gene products directly annotated to them

Warning: expensive query Warning: will generate ~20,000 rows

SELECT term.acc, term.name, count(DISTINCT gene_product_id)
FROM   term INNER JOIN association ON (term.id=association.term_id)
GROUP BY term.acc, term.name;

Filtering by taxon

The GO database uses a nested set model for storing the taxonomic tree. See this tutorial.

All mammals known to the GO database

 SELECT
  s.*
 FROM
  species AS s,
  species AS t
 WHERE
  (s.left_value BETWEEN t.left_value AND t.right_value) AND
  t.common_name='mammals';

Total number of mammalian genes with annotations

 SELECT
  count(DISTINCT g.id)
 FROM
  association AS a,
  gene_product AS g,
  species AS s,
  species AS t
 WHERE
  a.gene_product_id=g.id AND
  g.species_id=s.id AND
  (s.left_value BETWEEN t.left_value AND t.right_value) AND
  t.ncbi_taxa_id=40674;


Complex Queries

Query by distance to root

  • Caveat: Note that GO is structured as a DAG and thus there are typically multiple paths-to-root. Thus a term may have more than one 'depth'
  • Caveat: The depth of a node or distance between nodes is not necessarily biologically meaningful. Other more advanced queries are required to get biologially meaningful information-theory based distances between terms

This query finds all terms together with their depths - if a term is found at paths with different depths, it will be listed twice

 SELECT DISTINCT
   term.*,
   p.distance
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1;

You can constrain this by adding constraints to the WHERE clause; eg

  AND term.name like 'membrane%';
  • Caveat: at the time of writing, the GO database inserts an additional fake 'root' above the 3 actual root nodes of GO. Thus the distance should be taken as being to this fake root and not the actual root for the particular ontology (function, process, component)

An aggregate form of this query allows us to find the maximum distance to root for each term.

 SELECT DISTINCT
   term.acc,
   term.name,
   term.term_type,
   max(p.distance)
 FROM
   term
   INNER JOIN graph_path AS p ON (p.term2_id=term.id)
   INNER JOIN term AS root ON (p.term1_id=root.id)
 WHERE
   root.is_root=1
   AND term.is_obsolete=0
 GROUP BY
   term.acc,
   term.name,
   term.term_type;

Note that other aggregate functions such as min or avg could be used in place of max in the above.


Query by distance from leaf

gene products localized to a protein complex and involved in protein binding

SELECT
gene_product.symbol AS gp_symbol,
gene_product.symbol AS gp_full_name,
dbxref.xref_dbname AS gp_dbname,
dbxref.xref_key AS gp_acc,
species.genus,
species.species,
species.common_name
FROM term AS t1
INNER JOIN graph_path AS cl1 ON (t1.id=cl1.term1_id)
INNER JOIN association AS a1 ON (cl1.term2_id=a1.term_id)
INNER JOIN gene_product ON (a1.gene_product_id=gene_product.id)
INNER JOIN species ON (gene_product.species_id=species.id)
INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
INNER JOIN association AS a2 ON (a2.gene_product_id=gene_product.id)
INNER JOIN graph_path AS cl2 ON (a2.term_id=cl2.term2_id)
INNER JOIN term AS t2 ON (cl2.term1_id=t2.id)
WHERE
t1.name = 'protein binding'
AND
t2.name = 'protein complex'
AND
a1.is_not=0
AND
a2.is_not=0
AND
dbxref.xref_dbname='MGI';

Immune response gene products and their associated cell component terms in human

We want genes annotated at-or-below immune reponse, together with their associated directly annotated terms in CC

 SELECT
  term1_acc  AS process_acc, 
  term1_name AS process_name,
  gp_symbol, 
  term2_acc  AS cc_acc, 
  term2_name AS cc_name 
 FROM 
  gene_product_with_term_pair_via_graph 
  INNER JOIN species ON (gp_species_id=species.id)
 WHERE 
  superterm_name='immune response' AND 
  species.common_name='human'      AND 
  is_not=0                         AND 
  association2_is_not=0            AND 
  term2_type='cellular_component';

Note we exclude negative annotations

Pairs of GO terms associated via IC evidence

The IC evidence code (Inferred by Curator) is typically used in conjunction with a GO ID as evidence (in the 'with' field). This is typically used when a curator makes a cross-ontology inference, eg if a process is known to always take place in the nucleus, then an additional annotation is made to nucleus with the process GO ID in the evidence with column.

 SELECT DISTINCT
       term.acc,
       term.term_type,
       term.name,
       term2.acc AS acc2,
       term2.term_type AS term_type2,
       term2.name AS name2
 FROM IC_evidence AS ic 
       INNER JOIN evidence_dbxref AS ex ON (ic.id=evidence_id)
       INNER JOIN dbxref ON (ex.dbxref_id=dbxref.id)
       INNER JOIN association AS a ON (a.id=ic.association_id)
       INNER JOIN term ON (a.term_id=term.id) 
       INNER JOIN term AS term2 ON (concat(xref_dbname,':',xref_key) = term2.acc);

(See also the view term_pair_by_IC)

Note the slightly contorted way of getting the second term via a mysql concat; this is because the evidence with field is stored in the dbxref table which splits the bipartite identifier; on the other hand the term table stores this in a single column.

Gene products that support other associations

The following query finds associations that are based on an ISS from FlyBase

 SELECT
   px_src.xref_dbname  AS src_db,
   px_src.xref_key     AS src_acc,
   p_src.symbol        AS src_symbol,
   px_der.xref_dbname  AS der_db,
   px_der.xref_key     AS der_acc,
   p_der.symbol        AS der_symbol,
   a_der.is_not        AS der_is_not,
   aqt_der.name        AS der_qual,
   t_der.acc           AS der_term_acc,
   t_der.name          AS der_term_name,
   e_der.code          AS der_code
 FROM
   gene_product AS p_src
   INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id)
   INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id)
   INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id)
   INNER JOIN association AS a_der ON (a_der.id=e_der.association_id)
   INNER JOIN term AS t_der ON (a_der.term_id=t_der.id)
   INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id)
   INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id)
   LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id)
   LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id)
 WHERE
   px_src.xref_dbname='FB'
  AND
   e_der.code='ISS';


Suspect ISS assignments

An association with an ISS evidence code should be accompanied by the ID of the source gene product from which the annotation is transferred. If an annotation to term A is derived from an annotation to term B in an orthologous gene product then we would expect A to subsume B.

 SELECT
   px_src.xref_dbname,
   px_src.xref_key,
   p_src.symbol,
   px_der.xref_dbname,
   px_der.xref_key,
   p_der.symbol,
   t_der.name,
   a_der.is_not,
   aqt_der.name        AS der_qual,
   e_der.code
 FROM
   gene_product AS p_src
   INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id)
   INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id)
   INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id)
   INNER JOIN association AS a_der ON (a_der.id=e_der.association_id)
   INNER JOIN term AS t_der ON (a_der.term_id=t_der.id)
   INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id)
   INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id)
   LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id)
   LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id)
 WHERE
   code = 'ISS'
  AND
   NOT EXISTS (SELECT a_src.id
               FROM 
                    association AS a_src
                    INNER JOIN graph_path AS pth ON (pth.term2_id=a_src.term_id)
               WHERE 
                     pth.term1_id=a_der.term_id
                     AND a_src.gene_product_id=p_src.id);

Terms uniquely found in a species

 SELECT
  species.ncbi_taxa_id,
  species.genus,
  species.species,
  species.common_name,
  t.acc,
  t.name,
  gpc.product_count AS num_gps_only_found_in_taxon
 FROM
  term AS t
  INNER JOIN gene_product_count AS gpc ON (gpc.term_id=t.id)
  INNER JOIN species ON (species.id=gpc.species_id)
 WHERE NOT EXISTS (SELECT * FROM gene_product_count AS gpc2 WHERE gpc2.term_id=t.id AND gpc2.species_id != gpc.species_id AND gpc2.product_count > 0)
  AND gpc.product_count > 10
 ORDER BY gpc.product_count DESC;

This finds terms that are found only in one taxon. Note that as annotation is never complete, absence of annotation does not mean that instantiation of that term is only possible in that taxon. For example you will probably see C-elegans and "hermaphrodite genitalia development" near the top. This does not mean that there are no other 'naturally' hermaphoroditic species, just that there is a large model organism bias and no other model organisms exhibit this behaviour.

The results are in descending order of number of annotations. The threshold above filters out anything with < 10 genes annotated, to avoid too many results. The assumption is that somewhere below this number the results are meaningless as we expect to see certain distributions by chance (Obviously a detailed statistical model may be best, something like this may be possible in SQL using views; left as an exercise for the reader, email gohelp if you are interested)

Only species (leaf taxon nodes) are counted. This query uses the gene_product_count table, which means that the number given is for the total number of distinct gene product at OR BELOW that node. See the gene_product_count table documentation for details.

Note that the results will include some redundancy. If "fin development" is reported as being uniquely found in zebrafish, then "negative regulation of fin development" may also be found. Removing these redundancies is possible but actually leads to quite complex SQL that will execute less efficiently. This may not strictly be necessary, as ordering by the cumulative gene product count ensures that the more general unique terms are found first.

A possible extension is allow for arbitrary taxonomic groupings: e.g. show terms only found_in mammalia. This is possible using the nested set representation in the species table, left as an exercise for the reader for now.


Terms found almost uniquely in a species

 SELECT
  species.ncbi_taxa_id,
  species.genus,
  species.species,
  species.common_name,
  t.acc,
  t.name,
  gpc.product_count AS num_gps_only_mostly_in_taxon,
  sum(gpc2.product_count) AS sum_gpc
 FROM
  term AS t
  INNER JOIN gene_product_count AS gpc ON (gpc.term_id=t.id)
  INNER JOIN species ON (species.id=gpc.species_id)
  INNER JOIN gene_product_count AS gpc2 ON (gpc2.term_id=t.id)
 WHERE 
  gpc2.species_id IS NOT NULL AND
  gpc.product_count > 10
 GROUP BY
  species.ncbi_taxa_id,
  species.genus,
  species.species,
  species.common_name,
  t.acc,
  t.name,
  gpc.product_count
 HAVING
  sum_gpc -  num_gps_only_mostly_in_taxon BETWEEN 1 AND 3
 ORDER BY gpc.product_count DESC;

You are strongly advised to read the documentation for the previous (similar, faster) query first!

Non-redundant annotations for a gene product

   SELECT
    DISTINCT
    association.is_not,
    term.name,
    term.acc,
    term.term_type,
    gene_product.symbol AS gp_symbol,
    gene_product.symbol AS gp_full_name,
    dbxref.xref_dbname AS gp_dbname,
    dbxref.xref_key AS gp_acc
   FROM term
    INNER JOIN association ON (term.id=association.term_id)
    INNER JOIN gene_product ON (association.gene_product_id=gene_product.id)
    INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id)
   WHERE
    symbol = 'Shh' and dbxref.xref_dbname='MGI'
    AND NOT EXISTS
     (SELECT a.id
      FROM
       association AS a
       INNER JOIN graph_path AS tc ON (a.term_id = tc.term2_id)
      WHERE a.gene_product_id = gene_product.id
       AND tc.term1_id = association.term_id
       AND tc.term1_id != tc.term2_id)
   

This finds annotations to any MGI gene product called 'Shh' for a term t where there does not already exist a separate annotation to the same gene product with any descendant term of t.

Internal Reports

count how many GO terms have the source GOC:BHF

SELECT
 COUNT(DISTINCT tx.term_id)
FROM
 term_dbxref AS tx
 INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id)
WHERE
 tx.is_for_definition = 1 AND
 x.xref_key = 'BHF';

list all GO terms have the source GOC:BHF

SELECT
 t.acc,
 t.term_type,
 t.name
FROM
 term AS t
 INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id)
 INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id)
WHERE
 tx.is_for_definition = 1 AND
 x.xref_key = 'BHF';

all annotations directly to terms with the source GOC:BHF

This is only for direct annotations. order in descending order by date

SELECT
 g.symbol,
 s.genus,
 s.species,
 a.is_not,
 t.acc,
 t.name,
 t.term_type,
 a.assocdate
FROM
 term AS t
 INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id)
 INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id)
 INNER JOIN association AS a ON (a.term_id = t.id)
 INNER JOIN gene_product AS g ON (a.gene_product_id=g.id)
 INNER JOIN species AS s ON (g.species_id=s.id)
WHERE
 tx.is_for_definition = 1 AND
 x.xref_key = 'BHF'
ORDER BY
 a.assocdate DESC;

annotation by year stats for source GOC:BHF

This is only for direct annotations. order in descending order by date

SELECT
 t.acc,
 t.name,
 substring(a.assocdate,1,4),
 count(DISTINCT g.id) AS num_genes
FROM
 term AS t
 INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id)
 INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id)
 INNER JOIN association AS a ON (a.term_id = t.id)
 INNER JOIN gene_product AS g ON (a.gene_product_id=g.id)
WHERE
 tx.is_for_definition = 1 AND
 x.xref_key = 'BHF'
GROUP BY
 t.acc,
 t.name,
 substring(a.assocdate,1,4);


Database update information

What version of the database is this?

 SELECT * FROM instance_data;

What is the date of the most recently created or checked annotation?

 SELECT max(assocdate) FROM association;

What is the timestamp of the ontologies in the database?

 SELECT source_path,FROM_UNIXTIME(source_mtime) AS file_last_modified_time,FROM_UNIXTIME(source_parsetime) AS file_parsed_time
 FROM source_audit
 WHERE source_path like 'gene_ontology%';

What is the timestamp of the annotations in the database?

 SELECT source_path,FROM_UNIXTIME(source_mtime) AS file_last_modified_time,FROM_UNIXTIME(source_parsetime) AS file_parsed_time
 FROM source_audit
 WHERE source_path like 'gene_association%';

How many annotations are there more recent than 2008/01/01?

 SELECT count(*) FROM association WHERE assocdate>20080101;

Note that this queries based on the annotation date, NOT the date of the whole file (see above)

How are annotations distributed by year?

 SELECT substring(assocdate,1,4) AS yr,count(*) AS num_assocs
 FROM association 
 GROUP BY substring(assocdate,1,4);