RG: Software

From GO Wiki
Jump to: navigation, search

Software for use with the Reference Genome project.

gp2protein files

1. The gp2protein file will be used, as is. This tab delimited file contains a header and two columns of data. The header lines, comment lines, begin with an '!' character. The first column is the gene ID (or gene product ID) used by the submitting project. The web page should be accessible using this ID. The second column contains accession numbers, not entry names, for protein sequences at NCBI or UniProtKB. Each data item in the columns is of the form: DATABASE_ABBR:GENEID, here are some examples:

SGD:S000000001	NCBI_NP:NP_009400

dictyBase:DDB0216437	UniProtKB:Q55H43

MGI:MGI:101775	UniProtKB:Q00609

The GENEID can include a ':' as in the MGI example.  The DATABASE_ABBR should be the primary, not a synonym, as represented in the GO.xrf_abbs file.

2. All genes or gene products known to within the organisms genome are to be included. If the gene/gp has not been annotated it will still be included in the gp2protein file. If a protein accession number from NCBI or UniProtKB is not known that gene/gp should not be included -- however once the sequence is available from the sequence databases this information should be included.

3. If multiple protein sequences, the result of alternative splicing for example, are known they can all be included. In this case each alternative protein sequence would hopefully have a unique ID assigned by the submitting project. However, if the alternative protein sequences have not been annotated it is acceptable to just include one protein ID. In some cases this might be the longest amino acid sequence, or the conical or best representative sequence as determined by the submitting project.

4. This file of IDs will be used to include protein sequences within the GO database. At this time only proteins associated with non-IEA annotations are provided by the GO database, both the GO Lite and GO Full versions.

5. The gp2protein files will be used to create FASTA files with a standard defline. Thus the Gene Ontology project will have succeeded in providing a consistent complete collection of protein sequences in FASTA format. This will be the first time such a set of files has been available for all the model organisms.

6. The def lines will be of the following grammer:

Header ::= '>' SeqID SPC ObjID SPC [GeneName] [SystematicName]
SeqID ::= SeqDB ':' LocalID
SeqDB ::= 'RefSeq' | ObjDB
ObjDB ::= 'UniProtKB' | ModDB
ModDB ::= 'FB' | 'SGD' | ...
GeneName ::= '"' AnyText '"'
SystematicName ::= '"' AnyText '"'
SPC ::= ' '
  • For example:

>NCBI_NP:NP_009400 SGD:S000000001 "TCC3" "YAL001C"

>UniProtKB:Q55H43 dictyBase:DDB0216437 "JC1V2_0_00003"

>UniProtKB:Q00609 MGI:MGI:101775 "Cd80"

Status Tracker

Current developmental status

Original Document

The original document describing the tracker: Media:Refgene_Database_V3.ppt. Please let others know about changes to this document.

API and Objects

  • Meta: a general object for general queries.
    • #target_genes -> int
    • species -> list of Species
    • get_gene_products(with_constraint?) -> list of GeneProduct
    • get_targets(with_constraint?) -> list of Target
    • get_orthologs(with_constraint?) -> list of Ortholog
  • Species: as GO, but with extras
    • homologs -> list of Ortholog
    •  !homologs -> list of Ortholog
    • #homologs -> int
    • #!homologs -> int
    • #comprehensive -> int
    • common_name -> string
    • ncbi_taxa_id -> int
    • etc...
  • Evidence: as GO
    • etc...
  • Target: an object representing a target for the MODs
    • symbol
    • id
    • etc (and similar to GP)...
  • Ortholog: an object representing a curated gene product
    • status -> bool
    • date_complete -> data || undef
    • references_used -> list of string
    • references_outstanding -> list of string
    • etc (and similar to GP)...
  • GeneProduct: as GO
    • symbol ->string
    • species -> Species
    • full_name ->string
    • etc...
  • Paralog?
  • Xenolog?
  • Report: a set of functions to apply to groups of the above for page generation, automatic sanity checking would be nice.
    •  ???

The Target and Ortholog objects may be a subclass of a YTD superclass. This superclass will need to be able to read and write to the database and be able to maintain temporary values in case of validation/sanity issues (or similar functionality). Maybe also a subclass of GeneProduct.

Also, we're going to have to spend a bit of time making sure the the login/authentication is done properly--the last thing that we'll want is bots coming in and clicking on things.

Architecture

API: http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-dev/gwt/src/org/bbop/client/RefGenomeService.java?view=markup


Refg-arch.png

The API is a RefG-centric facade over the OBDAPI. (note the OBDAPI is java1.5 thus cannot be used directly in GWT)

See http://www.berkeleybop.org/obd for docs on the OBDAPI

Parsers

Two parsers will also be necessary for the tool.

  • A target parser that converts a tab-delimited file to a list of Target. The files format looks like:
mod-id ???
  • An ortholog parser that converts a tab-delimited file to a list of Ortholog. The files format looks like:
gene-symbol gene-id reference-p date-complete #-ref-used outstanding-refs

Pipeline

Annotation_pipeline

Summary and Graph Views

A beta of the summary and graph views can currently be found here. This uses parts of a newer AmiGO framework written in perl.

Please be aware that this can be somewhat slow (we'll move to pre-rendering if this isn't moved to a faster machine). Also, please be aware that in some cases your graph will not be immediately visible and that you need to zoom out a bit (graph not properly centered).

Usage and other documentation can be found at: AmiGO_Manual:_RG_Graphical_View

Current development

Some of the numbers and displays are a little opaque--hopefully, we can use these as a jumping-off point for improvement.

There is now also a separate page to keep track of the current integration development with AmiGO:

RG:_Software_:AmiGO

Parsing google spreadsheets

Current pipeline

Mary is in charge of the part of the pipeline that parses the Google spreadsheets, performs additional checks and ID fixes, and deposits them here:

This is incorporated into the weekly and monthly GO database builds. E.g. it can be queried in GOOSE.

These are the tables populated:

Can be queried in AmiGO1.6beta:

Siddhartha has also written a java parser for retrieving information directly from the parser

This will also be included in the separate homology/tracking database

Future

There will be no parsing of google spreadsheets when the tracker interface goes live, as gdocs will be abandoned, the db will be the primary store