Example LEAD Queries
This page has examples of various kinds of queries possible on the GO LEAD Database. This assumes some basic knowledge of SQL, and the availability of the LEAD database - either local or via remote queries. A web-based interface to the GO database mirrors at Berkeley BOP and EBI is also available. For more details, see:
http://www.geneontology.org/GO.database.shtml
If you have MySQL5 you can use views to implement many of these queries. Example queries will be hyperlinked to the View documentation
Basic Queries - on ontology
Find terms by GO ID
Given a GO ID, find out information on this term:
SELECT * FROM term WHERE acc='GO:0005634';
Find terms by (preferred) name
SELECT * FROM term WHERE name='nucleus';
Find terms by synonym/alternate label
SELECT * FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id) WHERE term_synonym LIKE 'protein biosynth%';
Find ancestors of the node 'nucleus'
SELECT DISTINCT ancestor.*, graph_path.distance, graph_path.term1_id AS ancestor_id FROM term INNER JOIN graph_path ON (term.id=graph_path.term2_id) INNER JOIN term AS ancestor ON (ancestor.id=graph_path.term1_id) WHERE term.name='nucleus';
Note that if there are alternate paths of differing length, this query will report a separate row for each term-ancestor-distance combination. To only report ancestors once, remove graph_path.distance from the SELECT clause
- see also: view
For some reason the following works better on mysql v 5.0.26
SELECT ancestor.*, graph_path.distance, graph_path.term1_id AS ancestor_id FROM term child, graph_path, term ancestor WHERE child.id=graph_path.term2_id AND ancestor.id=graph_path.term1_id AND child.name='nucleus';
Find descendants of the node 'nucleus'
SELECT DISTINCT descendant.acc, descendant.name, descendant.term_type FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id) WHERE term.name='nucleus' AND distance <> 0;
Find all asserted links in GO
SELECT child.term_type AS ontology, child.acc AS child_acc, child.name AS child_name, rel.acc AS rel_acc, parent.acc AS parent_acc, parent.name AS parent_name FROM term AS child INNER JOIN term2term ON (child.id=term2_id) INNER JOIN term AS parent ON (parent.id=term1_id) INNER JOIN term AS rel ON (rel.id=relationship_type_id) WHERE parent.acc LIKE 'GO:%';
This will return ALL links in GO, which as of 2008 is of the order of 40k edges (add to the where clause to restrict this)
Note the WHERE clause above is necessary to eliminate artefactual terms that are added to the GO database, such as a fake root node.
You can substitute term2term for graph_path to get the full transitive closure (ie all paths). WARNING - this will return millions of rows.
I can't find a GO ID in the database
There are a number of possibilities
Terms by alternate ID
Example scenario:
A GO term has been merged with another term. The previous ID now becomes a secondary ID. It can be searched like this:
SELECT term.* FROM term INNER JOIN term_synonym ON (term.id=term_synonym.term_id) WHERE acc_synonym='GO:0008151';
This finds GO:0008151 "cellular process", which has GO:0008151 as an alternate ID
Obsolete terms
Obsolete terms are stored in the term table, with is_obsolete set to 1.
To find all obsolete terms:
SELECT * FROM term WHERE is_obsolete=1;
Total number of obsolete terms:
SELECT count(*) FROM term WHERE is_obsolete=1;
Find maximum tree depth for cellular component
SELECT distance as max from graph_path, term WHERE graph_path.term2_id = term.id and term.term_type = 'cellular_component' ORDER BY distance desc limit 1;
Basic Queries - annotations
All genes in Drosophila annotated to 'nucleus' (including transitive indirect annotations)
SELECT term.name AS superterm_name, term.acc AS superterm_acc, term.term_type AS superterm_type, association.*, gene_product.symbol AS gp_symbol, gene_product.symbol AS gp_full_name, dbxref.xref_dbname AS gp_dbname, dbxref.xref_key AS gp_acc, species.genus, species.species, species.ncbi_taxa_id, species.common_name FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN association ON (graph_path.term2_id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) WHERE term.name = 'nucleus' AND species.genus = 'Drosophila';
If you have mysql5 you can use a view to make this easier:
SELECT * FROM term_J_association_J_gene_product_via_graph WHERE superterm_name ='nucleus';
- See view
All genes directly annotated to 'nucleus' (excluding child terms)
SELECT association.is_not, term.name, term.acc, term.term_type, gene_product.symbol AS gp_symbol, gene_product.symbol AS gp_full_name, dbxref.xref_dbname AS gp_dbname, dbxref.xref_key AS gp_acc, species.genus, species.species, species.common_name, species.ncbi_taxa_id, association.assocdate, db.name AS assigned_by, db.fullname FROM term INNER JOIN association ON (term.id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN db ON (association.source_db_id=db.id) WHERE term.name = 'nucleus';
All genes annotated to 'signal transducer activity' but not 'signal transduction' (including transitive indirect annotations)
SELECT gene_product.symbol AS gp_symbol, gene_product.symbol AS gp_full_name, dbxref.xref_dbname AS gp_dbname, dbxref.xref_key AS gp_acc, species.genus, species.species, species.common_name FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN association ON (graph_path.term2_id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) WHERE term.name = 'signal transducer activity' AND is_not=0 AND NOT EXISTS (SELECT * FROM association AS a2 INNER JOIN graph_path AS tc2 ON (tc2.term2_id=a2.term_id) INNER JOIN term AS t2 ON (tc2.term1_id=t2.id) WHERE a2.gene_product_id = association.gene_product_id AND t2.name = 'signal transduction' AND a2.is_not=0);
Peptide sequences of all genes/gene products annotated to 'pronucleus' (including transitive indirect annotations)
SELECT gene_product.symbol AS gp_symbol, dbxref.xref_dbname, dbxref.xref_key, seq.seq FROM term INNER JOIN graph_path ON (term.id=graph_path.term1_id) INNER JOIN association ON (graph_path.term2_id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id) INNER JOIN seq ON (gene_product_seq.seq_id=seq.id) WHERE term.name = 'pronucleus' AND association.is_not=0;
Annotations by gene product
Querying by gene or protein symbol
Finding gene products in the database with symbol 'BRCA1':
SELECT * FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) WHERE symbol = 'BRCA1';
The search is case-insensitive (a MySQL default), so it will find mouse genes with symbol "Brca1"
Note that this search is probably not sensitive enough - different contributing databases to GO use different naming conventions. We can broaden the search with a text match:
SELECT * FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) WHERE symbol like 'BRCA1%';
We can see that this search includes a record for a gene with symbol BRCA1_HUMAN (this is the convention of the contributing UniProt database)
We can also retrieve the GO terms attached:
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, term.acc, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) WHERE gene_product.symbol like 'BRCA1%';
Querying by gene product ID
Each gene product is uniquely identified by a bipartite database-accession pair referenced in the gene_product.dbxref_id field; see
- http://www.geneontology.org/GO.database.schema.shtml#go-associations.table.gene-product
- http://www.geneontology.org/GO.format.annotation.shtml (cols 1 and 2)
to retrieve by this field:
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, term.acc, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) WHERE dbxref.xref_key = 'FBgn0000490' AND dbxref.xref_dbname = 'FB';
this retrieves all annotations for Dmel decapentaplegic gene
Querying by multiple gene product IDs
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, term.acc, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) WHERE dbxref.xref_key IN ('FBgn0000490','FBgn0015567','FBgn0003470') AND dbxref.xref_dbname = 'FB';
Querying by gene product (UniProt)
UniProt contribute annotations to both humans and other organisms; prior to building the database, annotations are filtered to avoid redundancy with Model Organism Database (MOD) contributing annotations.
In general you should try and use the unique UniProt ID:
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, term.acc, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) WHERE dbxref.xref_key = 'P38398';
Human BRCA1 annotations
Searching by sequence dbxref
A gene product record may be associated with multiple other identifiers through its sequence record. For example, the mouse gene Tpm2 (identified in the MGI database as MGI:98810) is associated with the protein sequence record P58774 in UniProt.
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, alt_dbxref.xref_dbname, alt_dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, term.acc, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN gene_product_seq ON (gene_product.id=gene_product_seq.gene_product_id) INNER JOIN seq_dbxref ON (gene_product_seq.seq_id=seq_dbxref.seq_id) INNER JOIN dbxref AS alt_dbxref ON (seq_dbxref.dbxref_id=alt_dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) WHERE alt_dbxref.xref_key = 'P58775';
This can be the source of some confusion: there was at one point a UniProt-contributed annotation to P58774 (UniProt symbol TPM2_MOUSE). This was superseded by the MGI annotation - but the annotations are still linked to the UniProt record via the seq table.
Querying by annotation evidence (PMID)
select distinct term.acc, evidence.code, gene_product.symbol, concat(gdbx.xref_dbname, ':', gdbx.xref_key) as gbx, concat(dbxref.xref_dbname, ':', dbxref.xref_key) as dbx from dbxref, evidence, association, gene_product, dbxref as gdbx, term where dbxref.id = evidence.dbxref_id and evidence.association_id = association.id and association.gene_product_id = gene_product.id and gene_product.dbxref_id = gdbx.id and association.term_id = term.id and dbxref.xref_dbname = 'PMID' and dbxref.xref_key = '18391219'
Dictyostelium annotations with evidence from a kinase domain
SELECT gene_product.symbol, gene_product.full_name, dbxref.xref_dbname, dbxref.xref_key, species.genus, species.species, association.is_not, evidence.code, ex.xref_dbname AS with_dbname, ex.xref_key AS with_acc, term.acc, term.term_type, term.name FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN evidence_dbxref ON (evidence.id=evidence_dbxref.evidence_id) INNER JOIN dbxref AS ex ON (evidence_dbxref.dbxref_id=ex.id) INNER JOIN term ON (association.term_id=term.id) WHERE ex.xref_key='IPR000719' AND genus='Dictyostelium';
Reports / Aggregate Queries
In SQL, queries that perform some operation over a collection of records, such as summation, averages etc are known as aggregate queries. These are useful for generating reports
Total terms in database (non-obsolete)
SELECT count(*) FROM term WHERE is_obsolete=0;
Total terms, grouped by ontology (GO,MF,BP)
SELECT term_type AS ontology, count(*) AS n_terms FROM term WHERE is_obsolete=0 GROUP BY term_type;
note: this will give you a few additional ontologies that are not used directly for gene_product annotation, but are used for bookkeeping by the database
Total gene products in database
SELECT count(*) FROM gene_product;
Total gene products grouped by type
SELECT type.acc,count(gene_product.id) FROM gene_product INNER JOIN term AS type ON (gene_product.type_id=type.id) GROUP BY type.acc;
Total gene products, grouped by DB authority (FlyBase, SGD, UniProt...)
SELECT xref_dbname, count(DISTINCT gene_product.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) GROUP BY xref_dbname;
- See view
Number of gene products directly annotated to a term, grouped by DB authority
SELECT xref_dbname, count(DISTINCT gene_product.id) FROM gene_product LEFT JOIN dbxref ON gene_product.dbxref_id=dbxref.id LEFT JOIN association ON association.gene_product_id=gene_product.id LEFT JOIN term ON association.term_id=term.id WHERE term.acc='GO:XXXXXXX' GROUP BY xref_dbname;
Total species with at least one annotation
SELECT count(DISTINCT species_id) FROM gene_product;
(at present, the db contains no gene products that have not been annotated, so we do not need to join with the annotation table)
Total species with at least one non-IEA annotation
SELECT count(DISTINCT species_id) FROM gene_product INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) WHERE evidence.code != 'IEA';
NOTE: if you have the godb views installed, this query is simpler:
SELECT count(DISTINCT species_id) FROM association_J_evidence_J_gene_product WHERE code != 'IEA';
Total gene products annotated, grouped by species
This query excludes species that have only a couple of gene products annotated to them, in order to get the results to a manageable size
SELECT species.common_name, species.genus, species.species, count(*) AS n_gps FROM gene_product INNER JOIN species ON (gene_product.species_id=species.id) GROUP BY species.common_name, species.genus, species.species HAVING n_gps > 10;
Here we introduce a HAVING clause - this is because we cannot use a WHERE clause on the results of an aggregate function
Total annotations, grouped by species, broken down by evidence
SELECT species.ncbi_taxa_id, species.common_name, evidence.code, count(distinct association.id) AS evidence__association_count FROM species INNER JOIN gene_product ON (gene_product.species_id = species.id) INNER JOIN association ON (association.gene_product_id = gene_product.id) INNER JOIN evidence ON (evidence.association_id = association.id) GROUP BY species.ncbi_taxa_id, species.common_name, evidence.code ORDER BY species.ncbi_taxa_id, species.common_name, evidence.code;
Total annotations IMP annotations under regulation of biological process
SELECT COUNT(DISTINCT association.gene_product_id) FROM association, evidence, graph_path, term WHERE association.is_not = 0 AND evidence.code = 'IMP' AND evidence.association_id = association.id AND graph_path.term2_id = association.term_id AND term.id = graph_path.term1_id AND term.acc = 'GO:0050789';
Total non-IEA annotations, grouped by species, where there are > 100 annotations
SELECT species.ncbi_taxa_id, species.common_name, count(distinct association.id) AS association_count FROM species INNER JOIN gene_product ON (gene_product.species_id = species.id) INNER JOIN association ON (association.gene_product_id = gene_product.id) INNER JOIN evidence ON (evidence.association_id = association.id) WHERE code !='IEA' GROUP BY species.ncbi_taxa_id, species.common_name HAVING count(distinct association.id) > 100;
Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence
SELECT xref_dbname, evidence.code, count(DISTINCT association.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) GROUP BY xref_dbname, evidence.code;
Total annotations, grouped by DB authority (FlyBase, SGD, UniProt...), broken down by evidence and ontology
SELECT xref_dbname, term.term_type, evidence.code, count(DISTINCT association.id) FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN evidence ON (association.id=evidence.association_id) INNER JOIN term ON (association.term_id=term.id) GROUP BY xref_dbname, term.term_type,evidence.code;
Average distance from root, by DB authority
Warning: this statistic may be misleading; see below
SELECT xref_dbname, avg(distance) AS avg_dist_from_root FROM gene_product INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association ON (gene_product.id=association.gene_product_id) INNER JOIN graph_path ON (graph_path.term2_id=association.term_id) INNER JOIN term AS root ON (graph_path.term1_id=root.id) WHERE is_root=1 AND distance>1 GROUP BY xref_dbname;
This is just meant to illustrate how certain kinds of data mining can be performed. Results may be partially meaningless - depth is measured as the number of links between a term and the root. This is not necessarily a meaningful measure of specificity. Most terms have multiple paths to root; in the above query, global averages are calculated, which means results will be biased in favour of certain kinds of terms (for example, many biosynthetic pathway type terms like "cysteine biosynthesis" are both deep in the graph, and have multiple paths, meaning groups that annotate organisms with a heavy bias to these kind of terms will score higher)
A better approach is to first find the average depth per term, and then average over that on a per annotation basis.
Better yet is to do an analysis based on information content - the above is crude and is for illustrating complex aggregate queries.
All terms, with gene products directly annotated to them
Warning: expensive query Warning: will generate ~20,000 rows
SELECT term.acc, term.name, count(DISTINCT gene_product_id) FROM term INNER JOIN association ON (term.id=association.term_id) GROUP BY term.acc, term.name;
Filtering by taxon
The GO database uses a nested set model for storing the taxonomic tree. See this tutorial.
All mammals known to the GO database
SELECT s.* FROM species AS s, species AS t WHERE (s.left_value BETWEEN t.left_value AND t.right_value) AND t.common_name='mammals';
Total number of mammalian genes with annotations
SELECT count(DISTINCT g.id) FROM association AS a, gene_product AS g, species AS s, species AS t WHERE a.gene_product_id=g.id AND g.species_id=s.id AND (s.left_value BETWEEN t.left_value AND t.right_value) AND t.ncbi_taxa_id=40674;
Complex Queries
Query by distance to root
- Caveat: Note that GO is structured as a DAG and thus there are typically multiple paths-to-root. Thus a term may have more than one 'depth'
- Caveat: The depth of a node or distance between nodes is not necessarily biologically meaningful. Other more advanced queries are required to get biologially meaningful information-theory based distances between terms
This query finds all terms together with their depths - if a term is found at paths with different depths, it will be listed twice
SELECT DISTINCT term.*, p.distance FROM term INNER JOIN graph_path AS p ON (p.term2_id=term.id) INNER JOIN term AS root ON (p.term1_id=root.id) WHERE root.is_root=1;
You can constrain this by adding constraints to the WHERE clause; eg
AND term.name like 'membrane%';
- Caveat: at the time of writing, the GO database inserts an additional fake 'root' above the 3 actual root nodes of GO. Thus the distance should be taken as being to this fake root and not the actual root for the particular ontology (function, process, component)
An aggregate form of this query allows us to find the maximum distance to root for each term.
SELECT DISTINCT term.acc, term.name, term.term_type, max(p.distance) FROM term INNER JOIN graph_path AS p ON (p.term2_id=term.id) INNER JOIN term AS root ON (p.term1_id=root.id) WHERE root.is_root=1 AND term.is_obsolete=0 GROUP BY term.acc, term.name, term.term_type;
Note that other aggregate functions such as min or avg could be used in place of max in the above.
Query by distance from leaf
gene products localized to a protein complex and involved in protein binding
SELECT gene_product.symbol AS gp_symbol, gene_product.symbol AS gp_full_name, dbxref.xref_dbname AS gp_dbname, dbxref.xref_key AS gp_acc, species.genus, species.species, species.common_name FROM term AS t1 INNER JOIN graph_path AS cl1 ON (t1.id=cl1.term1_id) INNER JOIN association AS a1 ON (cl1.term2_id=a1.term_id) INNER JOIN gene_product ON (a1.gene_product_id=gene_product.id) INNER JOIN species ON (gene_product.species_id=species.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) INNER JOIN association AS a2 ON (a2.gene_product_id=gene_product.id) INNER JOIN graph_path AS cl2 ON (a2.term_id=cl2.term2_id) INNER JOIN term AS t2 ON (cl2.term1_id=t2.id) WHERE t1.name = 'protein binding' AND t2.name = 'protein complex' AND a1.is_not=0 AND a2.is_not=0 AND dbxref.xref_dbname='MGI';
Immune response gene products and their associated cell component terms in human
We want genes annotated at-or-below immune reponse, together with their associated directly annotated terms in CC
SELECT term1_acc AS process_acc, term1_name AS process_name, gp_symbol, term2_acc AS cc_acc, term2_name AS cc_name FROM gene_product_with_term_pair_via_graph INNER JOIN species ON (gp_species_id=species.id) WHERE superterm_name='immune response' AND species.common_name='human' AND is_not=0 AND association2_is_not=0 AND term2_type='cellular_component';
Note we exclude negative annotations
Pairs of GO terms associated via IC evidence
The IC evidence code (Inferred by Curator) is typically used in conjunction with a GO ID as evidence (in the 'with' field). This is typically used when a curator makes a cross-ontology inference, eg if a process is known to always take place in the nucleus, then an additional annotation is made to nucleus with the process GO ID in the evidence with column.
SELECT DISTINCT term.acc, term.term_type, term.name, term2.acc AS acc2, term2.term_type AS term_type2, term2.name AS name2 FROM IC_evidence AS ic INNER JOIN evidence_dbxref AS ex ON (ic.id=evidence_id) INNER JOIN dbxref ON (ex.dbxref_id=dbxref.id) INNER JOIN association AS a ON (a.id=ic.association_id) INNER JOIN term ON (a.term_id=term.id) INNER JOIN term AS term2 ON (concat(xref_dbname,':',xref_key) = term2.acc);
(See also the view term_pair_by_IC)
Note the slightly contorted way of getting the second term via a mysql concat; this is because the evidence with field is stored in the dbxref table which splits the bipartite identifier; on the other hand the term table stores this in a single column.
Gene products that support other associations
The following query finds associations that are based on an ISS from FlyBase
SELECT px_src.xref_dbname AS src_db, px_src.xref_key AS src_acc, p_src.symbol AS src_symbol, px_der.xref_dbname AS der_db, px_der.xref_key AS der_acc, p_der.symbol AS der_symbol, a_der.is_not AS der_is_not, aqt_der.name AS der_qual, t_der.acc AS der_term_acc, t_der.name AS der_term_name, e_der.code AS der_code FROM gene_product AS p_src INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id) INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id) INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id) INNER JOIN association AS a_der ON (a_der.id=e_der.association_id) INNER JOIN term AS t_der ON (a_der.term_id=t_der.id) INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id) INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id) LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id) LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id) WHERE px_src.xref_dbname='FB' AND e_der.code='ISS';
Suspect ISS assignments
An association with an ISS evidence code should be accompanied by the ID of the source gene product from which the annotation is transferred. If an annotation to term A is derived from an annotation to term B in an orthologous gene product then we would expect A to subsume B.
SELECT px_src.xref_dbname, px_src.xref_key, p_src.symbol, px_der.xref_dbname, px_der.xref_key, p_der.symbol, t_der.name, a_der.is_not, aqt_der.name AS der_qual, e_der.code FROM gene_product AS p_src INNER JOIN dbxref AS px_src ON (p_src.dbxref_id=px_src.id) INNER JOIN evidence_dbxref AS ex_der ON (ex_der.dbxref_id=p_src.dbxref_id) INNER JOIN evidence AS e_der ON (e_der.id=ex_der.evidence_id) INNER JOIN association AS a_der ON (a_der.id=e_der.association_id) INNER JOIN term AS t_der ON (a_der.term_id=t_der.id) INNER JOIN gene_product AS p_der ON (a_der.gene_product_id=p_der.id) INNER JOIN dbxref AS px_der ON (p_der.dbxref_id=px_der.id) LEFT OUTER JOIN association_qualifier AS aq_der ON (a_der.id=aq_der.association_id) LEFT OUTER JOIN term AS aqt_der ON (aqt_der.id=aq_der.term_id) WHERE code = 'ISS' AND NOT EXISTS (SELECT a_src.id FROM association AS a_src INNER JOIN graph_path AS pth ON (pth.term2_id=a_src.term_id) WHERE pth.term1_id=a_der.term_id AND a_src.gene_product_id=p_src.id);
Terms uniquely found in a species
SELECT species.ncbi_taxa_id, species.genus, species.species, species.common_name, t.acc, t.name, gpc.product_count AS num_gps_only_found_in_taxon FROM term AS t INNER JOIN gene_product_count AS gpc ON (gpc.term_id=t.id) INNER JOIN species ON (species.id=gpc.species_id) WHERE NOT EXISTS (SELECT * FROM gene_product_count AS gpc2 WHERE gpc2.term_id=t.id AND gpc2.species_id != gpc.species_id AND gpc2.product_count > 0) AND gpc.product_count > 10 ORDER BY gpc.product_count DESC;
This finds terms that are found only in one taxon. Note that as annotation is never complete, absence of annotation does not mean that instantiation of that term is only possible in that taxon. For example you will probably see C-elegans and "hermaphrodite genitalia development" near the top. This does not mean that there are no other 'naturally' hermaphoroditic species, just that there is a large model organism bias and no other model organisms exhibit this behaviour.
The results are in descending order of number of annotations. The threshold above filters out anything with < 10 genes annotated, to avoid too many results. The assumption is that somewhere below this number the results are meaningless as we expect to see certain distributions by chance (Obviously a detailed statistical model may be best, something like this may be possible in SQL using views; left as an exercise for the reader, email gohelp if you are interested)
Only species (leaf taxon nodes) are counted. This query uses the gene_product_count table, which means that the number given is for the total number of distinct gene product at OR BELOW that node. See the gene_product_count table documentation for details.
Note that the results will include some redundancy. If "fin development" is reported as being uniquely found in zebrafish, then "negative regulation of fin development" may also be found. Removing these redundancies is possible but actually leads to quite complex SQL that will execute less efficiently. This may not strictly be necessary, as ordering by the cumulative gene product count ensures that the more general unique terms are found first.
A possible extension is allow for arbitrary taxonomic groupings: e.g. show terms only found_in mammalia. This is possible using the nested set representation in the species table, left as an exercise for the reader for now.
Terms found almost uniquely in a species
SELECT species.ncbi_taxa_id, species.genus, species.species, species.common_name, t.acc, t.name, gpc.product_count AS num_gps_only_mostly_in_taxon, sum(gpc2.product_count) AS sum_gpc FROM term AS t INNER JOIN gene_product_count AS gpc ON (gpc.term_id=t.id) INNER JOIN species ON (species.id=gpc.species_id) INNER JOIN gene_product_count AS gpc2 ON (gpc2.term_id=t.id) WHERE gpc2.species_id IS NOT NULL AND gpc.product_count > 10 GROUP BY species.ncbi_taxa_id, species.genus, species.species, species.common_name, t.acc, t.name, gpc.product_count HAVING sum_gpc - num_gps_only_mostly_in_taxon BETWEEN 1 AND 3 ORDER BY gpc.product_count DESC;
You are strongly advised to read the documentation for the previous (similar, faster) query first!
Non-redundant annotations for a gene product
SELECT DISTINCT association.is_not, term.name, term.acc, term.term_type, gene_product.symbol AS gp_symbol, gene_product.symbol AS gp_full_name, dbxref.xref_dbname AS gp_dbname, dbxref.xref_key AS gp_acc FROM term INNER JOIN association ON (term.id=association.term_id) INNER JOIN gene_product ON (association.gene_product_id=gene_product.id) INNER JOIN dbxref ON (gene_product.dbxref_id=dbxref.id) WHERE symbol = 'Shh' and dbxref.xref_dbname='MGI' AND NOT EXISTS (SELECT a.id FROM association AS a INNER JOIN graph_path AS tc ON (a.term_id = tc.term2_id) WHERE a.gene_product_id = gene_product.id AND tc.term1_id = association.term_id AND tc.term1_id != tc.term2_id)
This finds annotations to any MGI gene product called 'Shh' for a term t where there does not already exist a separate annotation to the same gene product with any descendant term of t.
Internal Reports
count how many GO terms have the source GOC:BHF
SELECT COUNT(DISTINCT tx.term_id) FROM term_dbxref AS tx INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id) WHERE tx.is_for_definition = 1 AND x.xref_key = 'BHF';
list all GO terms have the source GOC:BHF
SELECT t.acc, t.term_type, t.name FROM term AS t INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id) INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id) WHERE tx.is_for_definition = 1 AND x.xref_key = 'BHF';
all annotations directly to terms with the source GOC:BHF
This is only for direct annotations. order in descending order by date
SELECT g.symbol, s.genus, s.species, a.is_not, t.acc, t.name, t.term_type, a.assocdate FROM term AS t INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id) INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id) INNER JOIN association AS a ON (a.term_id = t.id) INNER JOIN gene_product AS g ON (a.gene_product_id=g.id) INNER JOIN species AS s ON (g.species_id=s.id) WHERE tx.is_for_definition = 1 AND x.xref_key = 'BHF' ORDER BY a.assocdate DESC;
annotation by year stats for source GOC:BHF
This is only for direct annotations. order in descending order by date
SELECT t.acc, t.name, substring(a.assocdate,1,4), count(DISTINCT g.id) AS num_genes FROM term AS t INNER JOIN term_dbxref AS tx ON (t.id=tx.term_id) INNER JOIN dbxref AS x ON (x.id = tx.dbxref_id) INNER JOIN association AS a ON (a.term_id = t.id) INNER JOIN gene_product AS g ON (a.gene_product_id=g.id) WHERE tx.is_for_definition = 1 AND x.xref_key = 'BHF' GROUP BY t.acc, t.name, substring(a.assocdate,1,4);
Database update information
What version of the database is this?
SELECT * FROM instance_data;
What is the date of the most recently created or checked annotation?
SELECT max(assocdate) FROM association;
What is the timestamp of the ontologies in the database?
SELECT source_path,FROM_UNIXTIME(source_mtime) AS file_last_modified_time,FROM_UNIXTIME(source_parsetime) AS file_parsed_time FROM source_audit WHERE source_path like 'gene_ontology%';
What is the timestamp of the annotations in the database?
SELECT source_path,FROM_UNIXTIME(source_mtime) AS file_last_modified_time,FROM_UNIXTIME(source_parsetime) AS file_parsed_time FROM source_audit WHERE source_path like 'gene_association%';
How many annotations are there more recent than 2008/01/01?
SELECT count(*) FROM association WHERE assocdate>20080101;
Note that this queries based on the annotation date, NOT the date of the whole file (see above)
How are annotations distributed by year?
SELECT substring(assocdate,1,4) AS yr,count(*) AS num_assocs FROM association GROUP BY substring(assocdate,1,4);