Software Group progress report for 2010: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
 
(35 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Note there are separate reports for the two supplements
* [[Database_Enhancement_ARRA_progress_report_for_2010]]
* [[CL_ARRA_progress_report_for_2010]]
== Management ==
== Management ==


Line 7: Line 13:
=== Database ===
=== Database ===


=== Production ===
=== Production (Stuart, Gail, Ben, Mike C) ===


BEN TO FILL IN
#Rebuilt goweb-dev (not replaced)
#*Upgraded OS from RH 32-bit -> RH 64bit
#*Upgraded mysql from 5.0 -> 5.1
#Replaced GO loading machine
#* (old) goad = PowerEdge 1850, 2x2.8GHz Dual Core Intel Xeon, 12GB RAM,  2x300GB disks
#* (new) claret = PowerEdge R610, 2x2.53GHz Quad Core Intel E5540 Xeon, 24GB RAM, 4x500GB disks (64bit OS)
#** sucessfully tested loading on claret
#** have also done initial testing on load-qfo replacement for loading sequences into godb
#Ordered 2 machines to replace GO frontends, goweb1/goweb2
#*Anticipated that these machines will be into production by the end of 2010.
#During process of moving to new machines, the OS, Mysql, and other software were upgraded to the lastest versions.
#Worked on testing and installing of Amigo 1.8 (required 64bit OS for clucene-based searching)
#Made several upgrades to GAF filtering script
#* update from GAF 1.0 -> 2.0
#* added the following feature_types
#** gene_product
#** polypeptide
#* added the following evidence codes: IMR, IRD
#* added qualifier 'rapid_divergence'
# Put in place filter-paint-associations.pl script


=== Schema Overhaul ===
=== Schema Overhaul ===
See also [[Database_Enhancement_ARRA_progress_report_for_2010]]


[[Schema Overhaul]]
[[Schema Overhaul]]


SHAHID TO FILL IN
#SQL schema creation into postgres through a java program
#generation of TSV files through [[Schema_Overhaul#OBO_Access_Layer]] except the following tables (in progress)
#*ontology_imports, ontology_subset, all_only_relationship, never_some_relationship , relation_chain
#[[Schema_Overhaul#OBO_Access_Layer]] is in progress
#loading of TSV files into postgres
#incremental update of the GOLD database (in progress)
#Command Line Interface is built
#Basic Admin servlet interface is built to run the db operations through a web interface (in progress)


(this may belong in a separate report - add here for now anyway)
(this may belong in a separate report - add here for now anyway)
Line 21: Line 55:
=== Annotation QC ===
=== Annotation QC ===


* v1 of [[Taxon_Constraint_Check_Engine] in production. v2 (java rewrite) in progress
* v1 of [[Taxon_Constraint_Check_Engine]] in production. v2 (java rewrite) in progress
* [[Function_to_Process_Inference_Engine]] in production
* [[Function_to_Process_Inference_Engine]] in production
* [[Annotation_Rule_Engine]]
* [[Annotation_Rule_Engine]]
  AMELIA TO FILL IN
** current and proposed checks now captured in the annotation xml file


=== AmiGO ===
=== AmiGO ===
Line 33: Line 67:
=== Website ===
=== Website ===


AMELIA - ANYTHING HERE?
*Overhaul with different layout, some organisational changes, more easily accessed menu navigation, vertical rhythm
*New documentation for ontology relations, structure
*Updating of tools list (hampered by lack of new tool submissions)


=== Workflow Support ===
=== Workflow Support ===
Line 46: Line 82:
=== MOOSE libraries ===
=== MOOSE libraries ===


AMELIA TO FILL IN
* Ontology slimmer
* Ontology and annotation slimmer (map2slim)
* Added algorithms of transitive closure and transitive reduction for use by slimming scripts
* Started some preliminary support for basic boolean logic (for creating bucket terms)


== Reference Genome Support ==
== Reference Genome Support ==
Line 52: Line 91:
=== RefG in AmiGO ===
=== RefG in AmiGO ===


SVEN TO FILL IN - loading trees, gene info, integrated amigo report pages


SETH TO FILL IN - js phylo views, existing refg pages [[AmiGO_Phylotrees]] [[AmiGO_and_QuickGO_Integration]] SETH/TONY
* Loading Trees (SVEN)
** Matching IDs
*** Currently ~64% of of the ids are being matched after the new QFO load.
*** Noticed that 1970 proteins are in more then one group (ie YEAST|SGD:S000006392|UniProtKB:Q06580). Informed Paul Thomas.
** Report Pages
*** 2. Concurrent annotation: Code written, not part of load yet.
**** Some group name in file provided by Pascale Gaudet are not in group list. Informed Paul Thomas.
*** 8. 'Date comprehensively annotated' for groups that can provide this information: Have no idea where this data is
 
 
* js phylo views: work in progress on dev machine (nothing public yet, rapidly changing alpha versions). See: [[AmiGO_Phylotrees]].
* existing refg pages: to be dropped--no further work
* [[AmiGO_and_QuickGO_Integration]] SETH/TONY


MARY TO FILL IN
* Report to assess the GO annotation status of all PANTHER families and subfamilies based on annotations for all reference genome organism genes in the groups. Currently, the report is generated independently but, as part of the software overhaul, will be integrated with other parts of RefG software.


=== Paint ===
=== Paint ===


SUZI TO FILL IN
This software tool is now at version beta29. Many improvements have been made in terms of speed and functionality of the software. There are still some improvements to be made, but beta29 is a 'working version', in that it allows to produce valid GAF files that can be uploaded by the GO databases and the Model Organism Databases. We now generate reports on the annotation status of PAINT families. Those reports indicate how many species contain homologs in a given family, how many members of each family exist in every species, how many members have experimental annotations associated with them, the date a member of the family was last annotated, etc.


== Ontology Support ==
== Ontology Support ==
Line 69: Line 119:


* Created initial prototype, in use by ontology group and annotators
* Created initial prototype, in use by ontology group and annotators
* Integrated autocomplete
* Added involved_in template


=== OBO-Edit ===
=== OBO-Edit ===


See
*[http://wiki.geneontology.org/index.php/OBO-Edit_Release_Timeline Release Tracker]
 
*[http://sourceforge.net/tracker/?limit=25&func=&group_id=36855&atid=418257&assignee=&status=&category=&artgroup=&keyword=&submitter=&artifact_id=&assignee=&status=1&category=&artgroup=&submitter=&keyword=&artifact_id=&submit=Filter Bug tracker highlighting upcoming fixes based on priority]
AMINA TO FILL IN
*2010 fixes, features and updates: [[v2.1 fixes and updates]]


=== Transition to OWL ===
=== Transition to OWL ===


See [[Transition_to_OWL]]
See [[Transition_to_OWL]]
* Initiated plan for ontology support in next cycle [[Software_Group_2010_Future_Plans#Plan]]
* Initiated plan for ontology support in next cycle [[Software_Group_2010_Future_Plans#Plan]]
* first draft of obof1.4 guide http://www.geneontology.org/GO.format.obo-1_4.shtml
* first draft of obof1.4 guide http://www.geneontology.org/GO.format.obo-1_4.shtml
Line 86: Line 136:
* rewritten parser and obo2owl converter, 100% java http://code.google.com/p/oboformat/
* rewritten parser and obo2owl converter, 100% java http://code.google.com/p/oboformat/


===Integration of Catalytic (Enzyme) Activity Terms With Other DBs===
Aim to represent enzyme reactions in the form "input: CHEBI:nnnnn" and "output: CHEBI:mmmmm".
Put into place a system that will allow automatic syncing of GO to other metabolic pathway resources, such as Reactome, EC, MetaCyc, KEGG, RHEA, etc..
Can then extend this to include pathway data.
Data sources
*IntEnz XML (data from EC in XML format)
*MetaCyc metabolic DB flat files
*KEGG ligand (reactions/pathways) flat files
*RHEA (chemical reactions using ChEBI terms) XML/biopax data
Reactome mapping currently generated manually by Reactome and sent to GO for integration, so was omitted from this stage.
*First parse/pass completed:
**All GO terms with an EC, MetaCyc or KEGG ref examined.
**All DBs had EC refs, so used these as the basis for an alignment.
**Pulled out all EC numbers where the databases examined had only ONE reaction noted.
**Checked reactions for inconsistencies between DBs (submitted quite a number of queries to MetaCyc and IntEnz, which involved either GO or the other database being corrected).
**If all was well, added full set of xrefs (EC, MetaCyc, KEGG, RHEA) to the GO term, and rewrote the definition to use the ChEBI terms.
**Results: new xrefs: 1520 RHEA, 1560 KEGG; many EC and MetaCyc refs corrected and updated. New terms also added to cover missing reactions.
**Many reactions/enzyme terms examined by hand and corrected.
*Proposed second pass: enzymes which catalyze multiple reactions.
**Needs resolution of GO policy on multi-activity enzymes.




[[Category:Reports]]
[[Category:Reports]]

Latest revision as of 18:33, 5 January 2011

Note there are separate reports for the two supplements

Management

Annotation Support

Database

Production (Stuart, Gail, Ben, Mike C)

  1. Rebuilt goweb-dev (not replaced)
    • Upgraded OS from RH 32-bit -> RH 64bit
    • Upgraded mysql from 5.0 -> 5.1
  2. Replaced GO loading machine
    • (old) goad = PowerEdge 1850, 2x2.8GHz Dual Core Intel Xeon, 12GB RAM, 2x300GB disks
    • (new) claret = PowerEdge R610, 2x2.53GHz Quad Core Intel E5540 Xeon, 24GB RAM, 4x500GB disks (64bit OS)
      • sucessfully tested loading on claret
      • have also done initial testing on load-qfo replacement for loading sequences into godb
  3. Ordered 2 machines to replace GO frontends, goweb1/goweb2
    • Anticipated that these machines will be into production by the end of 2010.
  4. During process of moving to new machines, the OS, Mysql, and other software were upgraded to the lastest versions.
  5. Worked on testing and installing of Amigo 1.8 (required 64bit OS for clucene-based searching)
  6. Made several upgrades to GAF filtering script
    • update from GAF 1.0 -> 2.0
    • added the following feature_types
      • gene_product
      • polypeptide
    • added the following evidence codes: IMR, IRD
    • added qualifier 'rapid_divergence'
  7. Put in place filter-paint-associations.pl script

Schema Overhaul

See also Database_Enhancement_ARRA_progress_report_for_2010

Schema Overhaul

  1. SQL schema creation into postgres through a java program
  2. generation of TSV files through Schema_Overhaul#OBO_Access_Layer except the following tables (in progress)
    • ontology_imports, ontology_subset, all_only_relationship, never_some_relationship , relation_chain
  3. Schema_Overhaul#OBO_Access_Layer is in progress
  4. loading of TSV files into postgres
  5. incremental update of the GOLD database (in progress)
  6. Command Line Interface is built
  7. Basic Admin servlet interface is built to run the db operations through a web interface (in progress)

(this may belong in a separate report - add here for now anyway)

Annotation QC

AmiGO

Website

  • Overhaul with different layout, some organisational changes, more easily accessed menu navigation, vertical rhythm
  • New documentation for ontology relations, structure
  • Updating of tools list (hampered by lack of new tool submissions)

Workflow Support

  • Created a basic GO_Galaxy_Environment
    • integrated map2slim
    • integrated slim-creator
    • integrated enrichment tools
      • GO TermFinder
      • Ontologizer

MOOSE libraries

  • Ontology slimmer
  • Ontology and annotation slimmer (map2slim)
  • Added algorithms of transitive closure and transitive reduction for use by slimming scripts
  • Started some preliminary support for basic boolean logic (for creating bucket terms)

Reference Genome Support

RefG in AmiGO

  • Loading Trees (SVEN)
    • Matching IDs
      • Currently ~64% of of the ids are being matched after the new QFO load.
      • Noticed that 1970 proteins are in more then one group (ie YEAST|SGD:S000006392|UniProtKB:Q06580). Informed Paul Thomas.
    • Report Pages
      • 2. Concurrent annotation: Code written, not part of load yet.
        • Some group name in file provided by Pascale Gaudet are not in group list. Informed Paul Thomas.
      • 8. 'Date comprehensively annotated' for groups that can provide this information: Have no idea where this data is


  • Report to assess the GO annotation status of all PANTHER families and subfamilies based on annotations for all reference genome organism genes in the groups. Currently, the report is generated independently but, as part of the software overhaul, will be integrated with other parts of RefG software.

Paint

This software tool is now at version beta29. Many improvements have been made in terms of speed and functionality of the software. There are still some improvements to be made, but beta29 is a 'working version', in that it allows to produce valid GAF files that can be uploaded by the GO databases and the Model Organism Databases. We now generate reports on the annotation status of PAINT families. Those reports indicate how many species contain homologs in a given family, how many members of each family exist in every species, how many members have experimental annotations associated with them, the date a member of the family was last annotated, etc.

Ontology Support

TermGenie

See Compositional_Term_Submission_Tool

  • Created initial prototype, in use by ontology group and annotators
  • Integrated autocomplete
  • Added involved_in template

OBO-Edit

Transition to OWL

See Transition_to_OWL

Integration of Catalytic (Enzyme) Activity Terms With Other DBs

Aim to represent enzyme reactions in the form "input: CHEBI:nnnnn" and "output: CHEBI:mmmmm".

Put into place a system that will allow automatic syncing of GO to other metabolic pathway resources, such as Reactome, EC, MetaCyc, KEGG, RHEA, etc..

Can then extend this to include pathway data.

Data sources

  • IntEnz XML (data from EC in XML format)
  • MetaCyc metabolic DB flat files
  • KEGG ligand (reactions/pathways) flat files
  • RHEA (chemical reactions using ChEBI terms) XML/biopax data

Reactome mapping currently generated manually by Reactome and sent to GO for integration, so was omitted from this stage.

  • First parse/pass completed:
    • All GO terms with an EC, MetaCyc or KEGG ref examined.
    • All DBs had EC refs, so used these as the basis for an alignment.
    • Pulled out all EC numbers where the databases examined had only ONE reaction noted.
    • Checked reactions for inconsistencies between DBs (submitted quite a number of queries to MetaCyc and IntEnz, which involved either GO or the other database being corrected).
    • If all was well, added full set of xrefs (EC, MetaCyc, KEGG, RHEA) to the GO term, and rewrote the definition to use the ChEBI terms.
    • Results: new xrefs: 1520 RHEA, 1560 KEGG; many EC and MetaCyc refs corrected and updated. New terms also added to cover missing reactions.
    • Many reactions/enzyme terms examined by hand and corrected.
  • Proposed second pass: enzymes which catalyze multiple reactions.
    • Needs resolution of GO policy on multi-activity enzymes.