2007-09 SAB minutes
- 1 Ontology Development
- 2 Reference Genome
- 3 Annotation Outreach
- 4 User Support and Providing Tools and Resources for Users
David Hill presented the report on the major areas of development over this past year. One of the first major discussion points was how to continue to have content meetings in light of limited resources.
The content meetings are extremely productive in that preliminary work is put into training the scientists about ontology development and distributing proposed ontology changes among the group before the meeting is held. During the face-to-face meetings, there is a lot of discussion about providing a consensus view of the ontology and making the changes. These meetings are 10-15 people and cost approximately $15,000.
Submitting an R13 meeting/conference grant was one suggestion. Another suggestion was to have an active outreach to industry. We could write to heads of bioinformatics and pharma groups and see what areas they want expanded. In exachange, we propose they host a meeting working on this area. Reaching companies through PR and marketing was strongly discouraged. There is an EBI industry group that Michael Ashburner started that might be tapped for contacts.
The work on cross-products was commended because it moves GO further towards being computed upon.
In order to increase our efficiency in structural work, Larry Hunter suggested collaborating more with NCBO and connection with the greater ontology world. Chris Mungall pointed out that GO is providing resources that allow it to be connected to the greater ontology world, such as publishing in RDF for the semantic web and providing a mapping to OWL. David Botstein cautioned against representing ontologies for the sake of theoretical frameworks, that GO should remain grounded in biology and content. There was discussion about the production status of some of the tools provided by NCBO at this point in time.
Larry Hunter asked about quality control metrics. John Day-Richter did a demo of OBO-Edit to show several tools built in to maintain quality control on the ontology. OBO-edit has a reasoner which identifies errors the ontology editor has made. In addition, the editor can add their own filters to identify errors, such as the disjoint errors. OBO-edit is used by GO to edit ontologies as well as other ontology groups, such as Jackson Laboratory phenotype curators.
Rex Chisholm presented the progress of the Reference Genome group after approximately a year’s worth of work. The focus has been on “comprehensive” annotation because it is possible.
Larry Hunter asked how many papers are linked to a gene. The process of obtaining the literature sets are so different, the individual database groups report the numbers on the Google spreadsheet.
AZ guy asked if any text mining processes have been incorporated in order to identify appropriate papers. Although MODs have had some collaboration with groups, the papers are all manually reviewed. Larry suggested that the MODs could be involved with groups to help identify papers in a common way.
There was significant discussion about how the priorities should be set for the list of genes. Since currently OMIM and the RGD disease portal are being used to help set priorities, there may be fewer genes to annotate for non-mammalian organisms. Simon suggested prioritizing genes that have been identified in the recent genome-wide association studies. Many of these have not been annotated yet in GO. Another suggestion from Larry Hunter was identifying metabolic disease genes.
In order to increase the number of genes annotated, Larry suggested that genes with fewer papers be selected. Rex pointed out that the counter-argument is that these genes may not be of general interest. However, this could help those doing high-throughput annotations. Judy pointed out that many organisms do provide breadth of annotation using IEA annotation and these data are available to the high-throughput community. In response to the concern about the total number of genes, David Hill pointed out that all the papers addressed in a publication are annotated during the process of curating for the Reference Genome gene. These genes, however, are not tracked but contribute to the overall goal of providing GO annotations based on experimental literature.
There was some discussion about the literature review process used to define “comprehensive” annotation. David Botstein suggested that a review is used for highly studied genes and the primary literature is used for genes with fewer papers. The caveat to this suggestion is that the experimental system is not clearly stated in review articles. Since the goal of the Reference Genome project is to capture experimental data in that organism, David Hill pointed out that the review is often a good place to start in order to identify the relevant publications that can be used for an annotation. Larry suggested we do an experiment to see if we can reach “comprehensive” annotation for a gene using ~25 publications.
Rex then proceeded to describe the need to identify the ortholog in the respective organism because the human gene is the one on the list. Currently, individual curators identify them because they are the best suited to understand how their organisms’ genome compares to the other genomes. Tools such as YOGY, INPARANOID, OrthoMCL, TreeFam, and Homologene are used in order to find orthologs and not just domain conservation. The method and ortholog are recorded but curators do not mark when they feel the assertion is wrong. In order to save some time for the curators, Larry suggested that a decision tree that reflects the curators decision process could be made into a tool. There was a little discussion about the software needs of the Reference Genome group and how the list of genes will be integrated with AmiGO as well as ortholog calls made by the curators. Integrating the list of genes and an ability to search for Reference Genome genes in AmiGO will be important in publicizing this project.
Other aspects of the Reference Genome project briefly touched upon were curation consistency and how curation drives ontology development. Midori and David reiterated that ontology requests from the Reference Genome project are made high priority and there are very few requests left open.
The discussion on Reference Genomes finished with a discussion of goals for this upcoming year. The majority of the conversation focused on continuing to make progress on the number of genes annotated and the strategy for identifying target genes. Mike Cherry again reminded the advisory board that there are other genes being annotated during this time independent of the reference genome effort.
With regard to identifying target genes, Barry inquired whether we have been communicating with potential users on the side of clinical medicine, especially those working on disease models in order to help us prioritize which diseases to focus on. In addition, the individual user communities of each of the model organisims can provide a feedback mechanism. Judy remarked that we do have many outlets to take advantage of feedback to help us prioritize.
David Botstein commented that we should refine how we say we use OMIM. Not all genes in OMIM are well characterized and not all diseased in OMIM impact a significant percent of the population. Before we publicize the Reference Genome project, we should identify the total number of genes from OMIM that fit our criteria: whether it be diseases that are well characterized or diseases with the highest number of afflicted people, etc. This may actually be a manageable subset of all OMIM records.
Other suggestions for identifying gene lists were the ENCODE set, key signaling pathways and other biochemical pathways. Not all these suggestions are mutually exclusive so a handful of genes can be picked from all these lists. There was some discussion on how it would be interesting to see if annotations from these other lists produce similar types of results as those from the disease gene list.
Another issue confronting the Reference Genome Project is the resources – GOA curators do not get a break because they always have the most number of genes to curate (since it’s all human) and these genes have the most literature. There was some discussion again about how little effort has been put into parsing the human literature in a sensible way.
Another issue that was brought up by Craig in the general discussion period is how do curators keep track of new information after a gene is marked curated? Rex answered that it’s a distributed process; each group has their own method; people track new papers and curate them.
Larry also pointed out the importance of ’’Metrics’’’: we need to think about what are we measuring, why, it might make sense to have a subcommittee that looks into that closely, find out what numbers you need, how to present them to the world, also you may want to figure out hours spend and money spent. Judy suggests presenting this in the wiki or other GO web pages. Rex and Judy point out that this was the beginning of the effort, we now have a clearer idea of the important parameters and hope to make a lot of progress next year in capturing the metrics.
Barry was pointing out that the reference genome effort was a nice framework where one can measure progress and figure out what/how to improve; but dosen’t see how it’s improving the ontology itself. David Hill pointed out that SourceForge requests generated by this group are ‘high quality’ term requests, ie they then get used for annotations. It was agreed that it is difficult to determine how useful GO term is
Talk slides: Media:outreach_princeton.ppt
Presented by Jennifer Deegan. The purpose of the annotation outreach group is to get more groups to annotate and see if we can get their annotation in the GO database. Accomplished within the past year:
- Added SOP targeted to new users on the GO website, which contains flowcharts to help do IEA, ISS and manual annotations
- Posted a list of the meetings and conferences the Outreach group has attended on the wiki
- Produced a DAG map of the annotation groups we are in contact with
The group would like feedback from the SAB with respect to
- How to help people with no funding
- How to help people who people who don’t like the GO (function/process links missing; too many obsoletes)
- How to help people who don’t want to use the full complexity of GO (too many terms; the use of references and evidence codes) and would like to annotate a genome in two days (which is being done by some groups)
The SAB wondered about the validity of annotations done so quickly. Barry asked whether those ‘fairly correct’ annotations; are they incorrect or just high level? If they are generally correct, maybe we can use some of their strategies. Jennifer says that the annotations tend to be incomplete. Mike Cherry points out that we cannot really expect much more from those groups without dedicated curators: If the ‘quick and dirty’ approach is sufficient for their purposes, it’s hard to push them to do more; their resources for annotation are limited. Rex mentioned we have a ‘mentors system’: experienced curators mentor other new groups; however so far those collaborations have not returned any data to GO, so maybe it’s not worth the investment? Craig NM says that you need to define the qualities we need to call an annotation ‘useful’. Larry Hunter adds that we need to consider other models for integrating data; one is to use the ‘weak’ annotation these groups provide; another model is to do ‘annotation jamborees’ where experts give input and curators then expert curators would use that data (this has been done, but the GOC is concerned because the annotations resulting from the jamborees rarely get updated). Judy suggests an alternative to genome-centered jamborees would be to meet with an expert (for example in diabetes) and annotate a set of genes with their help.
David (skype) asks whether adding the Function/Process links in the ontology would help curation.
There was also a more general discussion about the funding of sequencing projects without support for annotation. That’s outside the control of the GOC; but we would like to be able to provide data for better IEAs. Larry asks what is the cost to curate a genome? If a researcher wants funding to do that, how much money does he/she need? Nobody was able to give a precise figure; too many factors: number of papers, etc. It was also suggested that we might put text on the GO website with approximate numbers to be included in a grant application that a genome group is making to cover sequencing costs. If they had the information readily to hand they might easily include a request for annotation support too. It was further suggested that we might put costs on the GO website to have the GO consortium do genome annotation in-house for groups that require. This could be posted as a service that we offer with defined financial costs and benefits.
User Support and Providing Tools and Resources for Users
User Advocacy: Media:GO-User-Advocacy-SAB2007-final.ppt
Presented by Eurie Hong: The purpose of the User advocacy group is to provide communication channels between the GOC and the research community. Tools put in place to that end:
- quarterly newsletter that talks about papers that use GO; gene of the quarter; new software development; sent out to GO friends and GO databases; also MODs highlight newsletter
- GO help: group of curators that answer emails to GO help; either answer or forward to the appropriate group; ranges from 20-140 emails (all emails; queries and responses), lately ~ 70 mails per month; number of email requests increased since we switched to the web form. The emails usually get answered within 24 hours, but the resolution of the problem can take longer. Larry points out we should keep track of the actual number of requests, not just the total number of mails.
- website development
- documentation (FAQ, amigo help, minimal (third party)tool standards to organize the tools page better ); workshops (see slides)
- We will develop tutorials with the help of Moodle group (moodle.org)
- We will not have general Users meetings anymore; there were very wide range such that they were not probably helping anyone (too wide); in the future we’ll do more focused meetings; for example maintain MGED meeting;
- We’ll make a set of core slides for presentations to help communicate the core GO ideas better
Suggestions from the SAB:
- Craig: you could also make videos.
- Larry: you should define the communities you are trying to target: you may want to explain/distinguish the ‘new’ GO users better, since this is a really varied group, to help target their needs, then there is also pharmaceutical companies, bioontology people [Outreach group has started doing that, see SOPs]. Barry points out it would also be helpful to know how much the actual annotations are used (GOC thinks this is hard to figure out); Larry suggests tracking papers using GO, and define use a set of use cases (in addition to microarrays). Judy notes that we do have some of that data (http://geneontology.org/cgi-bin/biblio.cgi): we have close to 1500 papers, classified in broad categories, about ½ are microarrays. David State is concerned that GO usage drives its own development, for example GO doesn’t do pathology, therefore there are no users working on pathology.
- Barry: one common criticism is that the GO is full of mistakes; they need to send us suggestions for corrections; we should recruit people that could send corrections; Craig suggests using a wiki-like medium to allow people making suggestions. Suzi replied that there is a wiki in place that can be used like that.
- The discussion about errors in the GO led to the remark by Larry that people were not really aware of GO’s dynamism; the AZ person pointed out that people were also turned off by too much dynamism and terms getting obsolete.
Presented by Ben Hitz.
The SGD group at Stanford is responsible for hosting various production aspects of GO. Included: Maintenance and hosting of geneontology.org and godatabase.org web sites, hosting of the AmiGO ontology browsing server, periodic loading, export, and ftp hosting of the GO database, and filtering of gene association files supplied by members of the consortium. Usage of the primary websites has been steady throughout the year, with geneontology.org and godatabase.org receiving 80,000 and 120,000 visits per month, respectively (an increase of ~20% over 2006).
Software & Databases
- AmiGO: Maintenance and support of the production AmiGO web site and has been provided by SGD beginning on May 4, 2005. Amigo was updated with new features in Februrary, 2007. We have installed AmiGO on a new development server to allow more efficient testing and deployment to productions servers.
- GO Database: Maintenance and support of the GO relational databases has been supported for the entire year. Recently (Sept. 2007), we have reviewed and tested many incremental changes to the go-dev source code that have accumulated over the last 2 years. These changes have been approved and the code has been deployed to the production servers. A new method for bulk loading the associations into the database has been implemented, but not fully tested due to the “drift” in the source code between the most up-to-date CVS version and the version used in production (see above). This code can do a golite (no IEA) load in 12 hrs (compare: 20 hr) and gofull (all IEA) in 4 dy (compare: 11 dy). The sequence loading was also improved both in speed and accuracy (more sequences found). We also added a “gp2protein” report for each datasource. This report is emailed to providers of the gp2protein file indicating which sequences specified in their file could not be loaded into the database. Finally, a bug was fixed which prevented the taxonomy information (downloaded from NCBI) from being updated.
- Gene association filters: All submitted gene association files are filtered for errors in content or syntax before being published to the FTP site or loaded into the relational database. The filtering program is revised and modified continuously to account for changes in standards and format. Most significant changes committed: All new IEA associations must use the WITH column (Jan 2007). PENDING CHANGES:
- Throw error message for :: “Double colons” in DB_OBJECT_ID, GOID, REFERENCE, WITH and TAXON ID fields.
- Check for multiple DB_OBJECT_SYMBOLs associated with a DB_OBJECT_ID.
Three new Linux machines have been installed and configured to handle GO database loading and AmiGO processes, along with a load balancer for distributing amigo usage between two machines. These were fully deployed Feb., 2007. A fourth machine, go-dev was installed and configured to serve as a server for AmiGO testing and development.
Presented by Chris Mungall.
- Several improvements were made to AmiGO: (see slides)
- search (relevance ranking; search term highlighting),
- annotation analysis (map2slims; term enrichments for gene sets).
- Future plans include
- displaying reference genome genes
- AmiGO architecture overhaul (see slide); goals are to 1. reduce software development time; 2. enhance user experience, 3. integrate external resources better (?)
- extensions to support: other ontologies, etc
- There is also a new (ready to be implemented) web interface for simplifying term requests that adds terms directly to the Source Forge tracker without the user having to go to Source Forge.
- GOOSE: new web-based SQL tool that allows doing more complex queries than the normal or advanced AmiGO search. That site also has a large template of queries people may use.
Future software development include: 1. Improve batch/custom download system; 2. ID mapping /with EBI (Dan Barrell); 3. webservices; 4. advanced query interface; 5 access to external analyses; 6 reference genome and orthology-centric displays; 7 reference genome curation interface
Suggestions from the SAB:
- Craig recommends more graphics on the GO front page (google-map-like), which may be more intuitive for users that searching a list of terms.
- Larry wondered how we are going to display the new relationships. Chris answers that it probably doesn’t need to be in the ‘main GO’; we’ll do it incrementally to make the displays user friendly. Larry points out that it is extremely important to display of all this information; therefore it should be one of our priorities.