Collaboration with MIT GO-Engineering
The MIT group has used an algorithm to label GO nodes as too general or too specific based on the structure of the graph and the amount of annotation to terms. (See the manuscript at the bottom of this page) We are collaborating with them to examine their suggestions and modify the GO as needed.
- David Hill (GO)
- Jane Lomax (GO)
- Midori Harris (GO)
- Chris Mungall (GO)
- Judith Blake (GO)
- Mary Dolan (GO)
- Jonathan Liu (MIT)
- Gil Alterovitz (Harvard & MIT)
- Marco Ramoni (Harvard Medical School)
- Ming Xiang (MIT)
MIT group provides files to GO group for examination (done) GO group reviews file and makes notes about items to discuss (March 2) GO group conference call (March 12, 2007; 11:00 am PST, 2:00 pm EST, 6:00 pm GMT)
Face-to-Face meeting at MIT (March 16)
GO group will have dinner together. Meet in hotel lobby at 6 pm.
MORNING/OPEN SESSION: Harvard Institutes of Medicine New Research Building Third Floor Rotunda
- 9:00am Opening remarks
- 9:20-9:40am Marco Ramoni, Harvard Medical School
- 9:40-10:00a Judith A. Blake, The Jackson Laboratory/GO Consortium [Gene Ontology Overview and Perspective]
- 10:00-10:20am Michael Xiang, Massachusetts Institute of Technology
- 10:20-10:40am Chris Mungall, Lawerence Berkeley National Laboratory/GO Consortium [Building the Gene Ontology]
- 10:40-11:10am Coffee Break
- 11:10-11:30am David Hill, The Jackson Laboratory/ GO Consortium [Annotating Gene Products to the Gene Ontology]
- 11:30-11:50am Gil Alterovitz, Harvard/ Massachusetts Institute of Technology
- 11:50-12:10pm Jane Lomax, European Bioinformatics Institute Hinxton/GO Consortium [Recent Improvements to the Gene Ontology]
- 12:10-12:30pm Closing Remarks
AFTERNOON/CLOSED SESSION: Harvard Institutes of Medicine New Research Building (NRB) Room 258 by Invite only
- 12:30-1:30pm Lunch
- 1:30-5:30pm REGO Workshop meeting
- 6:00-7:00pm Dinner
Generously funded in part by NHGRI and i2b2
GO curators addressed the high-level nodes that were deemed too specific for their placement in the ontology. Here is a summary of what we found:
Following analysis of the GO using an engineering informational theoretic approach, Gil Alterovitz and Marco Ramoni of the Harvard/MIT Division of Health Sciences and Technology invited collaboration with the Gene Ontology Consortium (GOC) to evaluate the hypothesis and recommendations resulting from their work. Several phone conferences followed. Then, members of the GOC team* joined the Alterovitz/Ramoni lab groups in Boston on March 16, 2007, for a working session on the recommendations. As a result of that meeting, the MIT/Harvard group will run some re-analysis based on their improved understanding of the structure of the GO to determine if the informational theoretic approach could be used in a more refined way, for example to identify specific problems in the is_a vs. part_of hierarchies contained in GO. Initial results presented by the Boston group provided an opportunity to examine a quantifiable way of determining areas of the ontology that could be improved biologically.
The GOC group continued to work on the recommendations from the first analysis. The Boston meeting provided an opportunity for general evaluation of the information content approach and some specific recommendations. The goal of the work from the GOC perspective is to provide a method and tool for comprehensive evaluation of the GO structure as a metric for setting priorities and measuring progress for GO development projects.
The GO is designed as a directed acyclic graph in which terms are arranged using is_a and part_of relationships to other terms. Ideally, terms would be placed in the graph such that the is_a or part_of parent of a term would represent, from a biological perspective, the term in the hierarchy most closely related to the term being placed. Because of the complexity of the GO graphs, curators can place a term in the graph in a correct logical position, but in a position that is not optimal based on biological knowledge. For example, a term describing a very specific enzyme activity such as ‘IkappaB kinase activity’ could be made a direct is_a child of ‘protein kinase activity’. Although logically correct, this is not an optimal placement in the graph because ‘protein kinase activity’ has a direct child ‘protein serine/threonine kinase activity’ that more precisely classifies ‘IkappaB kinase activity’. In addition, curators can also place a term in the graph and fail to associate existing terms as children of the term correctly if they are not aware that the other terms exist or should be included as children. In the kinase example above, if the existing graph contained two terms, ‘protein kinase activity’ and an is_a child ‘IkappaB kinase activity’ and a term ‘protein serine/threonine kinase activity’ were added to the graph, the parent-child relationship between ‘IkappaB kinase activity’ and ‘protein serine/threonine kinase activity’ might be overlooked. This most often occurs when children are in separate branches of the graph. The suboptimal placement of a term and the failure to identify all possible children of a term are difficult to spot empirically. Results discussed below show that an information theoretical approach can be used to identify areas of the ontology where these types of errors are likely to have occurred.
Too Specific Nodes
GOC curators examined terms from the Biological Process ontology that were identified as ‘too specific’, indicating that the term has too few genes annotated to it with respect to other terms at the same level of the ontology. Terms that are deep in the ontology may be identified as ‘too specific’ for a trivial reason such as lack of annotation, often because curators have not focused much gene-annotation effort in that biological area. Terms that are identified as ‘too specific’ that are shallow in the ontology are more likely to benefit from a change in ontology structure because they take into account more of the entire ‘universe’ of annotations. It also makes more sense to start at a level of the graph closer to the root node since any changes made may propagate down the graph by changing the ‘level’ of existing nodes.
We identified 11 shallow terms at levels 1 or 2 in the ontology and examined them to determine if their biological context in the graph could be improved. We found that there were three fundamental reasons why terms were identified as being ‘too specific’ using the informational theoretical approach: 1) A term was a general biological type, but lacked annotation; 2) A term was missing children, resulting in fewer annotations being attributed to the term than were logically true; or 3) The term's placement in the ontology could be improved by a rearrangement of the graph, either by placing new terms in the path to the root, or by moving the term to a deeper level.
Summary of results of our analysis:
Of the 11 Biological Process terms we examined, we were able to move 10 terms to a more optimal location in the ontology. Two terms, ‘neurotrophin production’ (GO:0032898) and ‘pilus retraction’ (GO:0043108) showed the largest change in placement with respect to level within the graph. GO curators were able to find or create more specific parents for these two terms at levels10 and 9 respectively. Other terms that were moved showed a less dramatic change. ‘Induction of autolysin activity in another organism’ was moved from level 2 to level 5 and ‘cell wall peptidoglycan catabolic process in another organism’ was moved from level 2 to level 6. Five terms, ‘forward locomotion’ (GO:0043056), ‘backward locomotion’ (GO:0043056), ‘circumnutation’ (GO:0010031), ‘lymphocyte anergy’ (GO:GO:0002249) and ‘multicellular organism reproduction’ (GO:0032504) were moved down one level from level 2 to level 3. Although this move improved the information content of these nodes with respect to their peers, these terms were still xyz standard deviations away from other terms at level 3. (Mike to add data here, this is just an example of one possible outcome) This modest improvement is likely influenced by two factors. First, as we found with ‘multicellular organism reproduction’ (GO:0032504), there may be terms in the ontology that are biologically related to GO:00032504, but are not yet linked in the ontology. For example, for ‘reproductive process in a multicellular organism’ (GO:0048609) we first added a missing part_of link for ‘reproductive process in a multicellular organism’ (GO:0048609) and then created a new child term ‘reproductive behavior in a multicellular organism’ (GO:0033057) as a direct child of ‘reproductive process in a multicellular organism’ (GO:0048609). We then grouped appropriate reproductive behaviors under the new term. This type of rearrangement will effectively add more annotations to the original ‘too specific’ ‘reproductive process in a multicellular organism’ (GO:0048609). This type of rearrangement would also improve the ‘locomotion’ terms. It is interesting to note that problems with the ‘locomotion’ terms in the GO have previously been identified intuitively, due to inconsistencies arising from differences in the interpretation of terms representing any kind of movement and those representing movement from place to place. The GOC plans to focus on rearrangements in this part of the ontology, and welcomes the additional support for this objective provided by the present quantitative assessment of information content problems in the area of locomotion. Finally, there was one term, ‘pigmentation’ (GO:0043473), whose placement in the graph was not altered. Biologically, pigmentation is an almost universal process resulting from diverse evolutionary pressures such as camouflage and reproduction. In an examination of pigmentation we again found missing children. In particular, ‘pigment cell differentiation’ (GO:0050931) was made a child of ‘pigmentation during development’ (GO:0048066). But more importantly, ‘pigmentation’ serves as an example of a term having high information content due to a lack of annotation. We believe this to be the case with at least 7 of the 11 ‘too specific nodes’. The most trivial evidence for this is that some nodes such as ‘callose localization’ (GO:0052545) and ‘neurotrophin production’ (GO:0032898) have no annotations and were clearly created in the ontology for future use. In the case of ‘pigmentation’ (GO:0043473) there is clear indication that the genes involved in the process are under annotated. On April 16, 2007, The Mouse Genome Informatics (MGI) resource had 44 genes annotated to the GO term ‘pigmentation’ using the version of the GO created on April 13, 2007. However, a search at MGI for genes in mice having a ‘pigmentation’ phenotype/defect returns 417 genes. This illustrates that there are many more potential genes that can be annotated to the GO term ‘pigmentation’ than have currently been annotated. Our results show that the use of information content analysis to identify terms in the GO that are ‘too specific’ with respect to their neighbors is a productive process. In this initial study, we have worked at the most general levels of GO to address issues of specificity. We have shown that identification of terms that are labeled ‘too specific’ via information content approaches can examined by GO curators and can be used as guides in improving the relationships and terms in the ontology, creating a better representation of biology. Furthermore, identification of these types of nodes can also target areas of the ontology that have a lack of gene annotation with respect to ontology development. By using successive rounds of improvement and analysis and working our way down the graphs, we should be able to improve the GO both biologically and computationally.
- GOC members attending the meeting in Boston were Judy Blake, Mary Dolan, Midori Harris, David Hill, Jane Lomax, and Chris Mungall. Judy Blake and Suzi Lewis participated in follow-up phone calls. David Hill and Jane Lomax conducted most of the evaluation of recommendations of the GO-Engineering analysis.
The MIT group will submit a paper describing the analysis. We hope to further this collaboration by continuing to refine the ontology and by developing more sophisticated methods to calculate information content by taking into account disjoint relationships in the ontology as well as orthology information when counting gene products annotated to terms.
- Improving the Gene Ontology - Abstract Methods
- Re-Engineering Gene Ontology manuscript
- Quantifying the Specificity of GO Terms manuscript
- Quantifying the Specificity of GO Terms supplementary info
- "Information bottleneck" graph
- Spreadsheet column descriptions
- List of "too general" nodes, Oct. 2006
- List of "too specific" nodes, Nov. 2006
- comments on "too specific" list, Nov. 2006
- List of "too general" nodes, Dec. 2006 (is_a-complete GO)
- Suggested changes, 11/26/06
- Suggested changes, 12/10/06
- Responses to 12/10/06 suggestions