SWUG:Database changes 2007

From GO Wiki
Jump to: navigation, search

Database: New Features

These features have been added in 2007

Documentation

The documentation for the GO database as been revamped. See:

Schema documentation

This replaces the old documentation on godatabase.org (now redirected)

GOOSE

We have created GO Online SQL Environment (GOOSE). This was in response to increased demands for advanced queries that cannot be met by the AmiGO interface, which mostly caters to common queries.

GOOSE allows you to executed arbitrary SQL over any GO database mirror. Expertise in SQL and knowledge of the schema helps; however, GOOSE also includes standard query templates from the example queries page:

This means that GOOSE makes an ideal learning environment for intermediate users

Use of GOOSE will be simplified by the addition of views (see below)

SQL-Views

We have built a large library of SQL views to simplify querying the GO database. Views can be materialized for speed. This will allow us to make future versions of AmiGO faster.

SQL Views can be fairly powerful; for example we have some views for calculating the Information Content of a term based on the number of annotations (see Collaboration_with_MIT_GO-Engineering).

This means we can easily write queries to fetch for example the average information content broken down by database. The following is the IC of all annotations in the refG subset:

 SELECT * FROM avg_annotation_entropy_by_annotation_subset_and_source       ORDER BY avg_entropy;


 reference_genome        WB      9.7708568320593
 reference_genome        RefSeq  10.0933288773331
 reference_genome        ZFIN    10.6886457671244
 reference_genome        MGI     10.9633757349792
 reference_genome        TAIR    11.1110688320685
 reference_genome        UniProt 11.3526212071035
 reference_genome        RGD     11.6885303554913
 reference_genome        DDB     12.154186698224
 reference_genome        FB      12.6074013048053
 reference_genome        SGD     12.801913637563
 reference_genome        GeneDB_Spombe   13.2108204603711

Support for Database xref metadata

aka: GO.xrf_abs

Extensions to the db table

Term subset

GO Slims are now loaded into database

This will allow us to filter-by-subset in AmiGO. This is already being used in the new amigo map2slim interface

gene_product_subset

This allows us to make "slims" of gene products. The most important one will be the reference_genome slim - by tagging these it makes it easier to do analyses and filters on the refG subset.


Support for multi-species interaction Annotations

See GO Annotation Guide

The majority of gene products act within the organism that encoded them. However, sometimes gene products encoded by one organism can act on or in other organisms. For example in obligate parasitic species, almost all of their gene products will be interacting with another organism, their host. Interactions may also be between organisms of the same species: for example, the proteins used by bacteria to adhere to one another to form a biofilm.

For annotating gene products involved in these multi-organism interactions, there is a special set of biological process terms in the interaction between organisms node.

The species in the interaction can be recorded in an annotation by using terms from this node and entering two taxon IDs in the Taxon column. The first taxon ID should be that of the species encoding the gene product, and the second should be the taxon of the other species in the interaction. Where the interaction is between organisms of the same species, both taxon IDs should be the same. The taxon column of the annotation file is described in more detail in the annotation file format section.

This is stored in the database using the following new table:

Currently we lack multi-organism interaction annotation data

Support for association properties

  • Aka structured notes
  • Aka annotation-cross-products

See docs on wiki

Awaiting annotation

Consider/replaced_by tags

Stored in term2term_metadata table

Synonym types

Stored in synonym_category_id

Precomputed gene product counts, by species

Previously the gene_product counts were only pre-computed for the annotation database (FlyBase, UniProt, SGD). Many annotation databases cover >1 species.

counts are now pre-computed for species too:

gene_product_count table

Taxon hierarchy

The species table (which should truly be called the taxon table) has support for taxon hierarchies: species table

We use a nested set model

Later versions of the database will have this populated

Allows us to filter by taxa above the species level - eg kingdom, phylum, .. For example, filter by Viridiplantae