Software Group progress report for 2010

From GO Wiki
Revision as of 17:33, 5 January 2011 by Girlwithglasses (talk | contribs) (→‎Ontology Support)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Note there are separate reports for the two supplements


Annotation Support


Production (Stuart, Gail, Ben, Mike C)

  1. Rebuilt goweb-dev (not replaced)
    • Upgraded OS from RH 32-bit -> RH 64bit
    • Upgraded mysql from 5.0 -> 5.1
  2. Replaced GO loading machine
    • (old) goad = PowerEdge 1850, 2x2.8GHz Dual Core Intel Xeon, 12GB RAM, 2x300GB disks
    • (new) claret = PowerEdge R610, 2x2.53GHz Quad Core Intel E5540 Xeon, 24GB RAM, 4x500GB disks (64bit OS)
      • sucessfully tested loading on claret
      • have also done initial testing on load-qfo replacement for loading sequences into godb
  3. Ordered 2 machines to replace GO frontends, goweb1/goweb2
    • Anticipated that these machines will be into production by the end of 2010.
  4. During process of moving to new machines, the OS, Mysql, and other software were upgraded to the lastest versions.
  5. Worked on testing and installing of Amigo 1.8 (required 64bit OS for clucene-based searching)
  6. Made several upgrades to GAF filtering script
    • update from GAF 1.0 -> 2.0
    • added the following feature_types
      • gene_product
      • polypeptide
    • added the following evidence codes: IMR, IRD
    • added qualifier 'rapid_divergence'
  7. Put in place script

Schema Overhaul

See also Database_Enhancement_ARRA_progress_report_for_2010

Schema Overhaul

  1. SQL schema creation into postgres through a java program
  2. generation of TSV files through Schema_Overhaul#OBO_Access_Layer except the following tables (in progress)
    • ontology_imports, ontology_subset, all_only_relationship, never_some_relationship , relation_chain
  3. Schema_Overhaul#OBO_Access_Layer is in progress
  4. loading of TSV files into postgres
  5. incremental update of the GOLD database (in progress)
  6. Command Line Interface is built
  7. Basic Admin servlet interface is built to run the db operations through a web interface (in progress)

(this may belong in a separate report - add here for now anyway)

Annotation QC



  • Overhaul with different layout, some organisational changes, more easily accessed menu navigation, vertical rhythm
  • New documentation for ontology relations, structure
  • Updating of tools list (hampered by lack of new tool submissions)

Workflow Support

  • Created a basic GO_Galaxy_Environment
    • integrated map2slim
    • integrated slim-creator
    • integrated enrichment tools
      • GO TermFinder
      • Ontologizer

MOOSE libraries

  • Ontology slimmer
  • Ontology and annotation slimmer (map2slim)
  • Added algorithms of transitive closure and transitive reduction for use by slimming scripts
  • Started some preliminary support for basic boolean logic (for creating bucket terms)

Reference Genome Support

RefG in AmiGO

  • Loading Trees (SVEN)
    • Matching IDs
      • Currently ~64% of of the ids are being matched after the new QFO load.
      • Noticed that 1970 proteins are in more then one group (ie YEAST|SGD:S000006392|UniProtKB:Q06580). Informed Paul Thomas.
    • Report Pages
      • 2. Concurrent annotation: Code written, not part of load yet.
        • Some group name in file provided by Pascale Gaudet are not in group list. Informed Paul Thomas.
      • 8. 'Date comprehensively annotated' for groups that can provide this information: Have no idea where this data is

  • Report to assess the GO annotation status of all PANTHER families and subfamilies based on annotations for all reference genome organism genes in the groups. Currently, the report is generated independently but, as part of the software overhaul, will be integrated with other parts of RefG software.


This software tool is now at version beta29. Many improvements have been made in terms of speed and functionality of the software. There are still some improvements to be made, but beta29 is a 'working version', in that it allows to produce valid GAF files that can be uploaded by the GO databases and the Model Organism Databases. We now generate reports on the annotation status of PAINT families. Those reports indicate how many species contain homologs in a given family, how many members of each family exist in every species, how many members have experimental annotations associated with them, the date a member of the family was last annotated, etc.

Ontology Support


See Compositional_Term_Submission_Tool

  • Created initial prototype, in use by ontology group and annotators
  • Integrated autocomplete
  • Added involved_in template


Transition to OWL

See Transition_to_OWL

Integration of Catalytic (Enzyme) Activity Terms With Other DBs

Aim to represent enzyme reactions in the form "input: CHEBI:nnnnn" and "output: CHEBI:mmmmm".

Put into place a system that will allow automatic syncing of GO to other metabolic pathway resources, such as Reactome, EC, MetaCyc, KEGG, RHEA, etc..

Can then extend this to include pathway data.

Data sources

  • IntEnz XML (data from EC in XML format)
  • MetaCyc metabolic DB flat files
  • KEGG ligand (reactions/pathways) flat files
  • RHEA (chemical reactions using ChEBI terms) XML/biopax data

Reactome mapping currently generated manually by Reactome and sent to GO for integration, so was omitted from this stage.

  • First parse/pass completed:
    • All GO terms with an EC, MetaCyc or KEGG ref examined.
    • All DBs had EC refs, so used these as the basis for an alignment.
    • Pulled out all EC numbers where the databases examined had only ONE reaction noted.
    • Checked reactions for inconsistencies between DBs (submitted quite a number of queries to MetaCyc and IntEnz, which involved either GO or the other database being corrected).
    • If all was well, added full set of xrefs (EC, MetaCyc, KEGG, RHEA) to the GO term, and rewrote the definition to use the ChEBI terms.
    • Results: new xrefs: 1520 RHEA, 1560 KEGG; many EC and MetaCyc refs corrected and updated. New terms also added to cover missing reactions.
    • Many reactions/enzyme terms examined by hand and corrected.
  • Proposed second pass: enzymes which catalyze multiple reactions.
    • Needs resolution of GO policy on multi-activity enzymes.