GO-MIT meeting notes
REGO (Re-engineering the Gene Ontology) Workshop meeting
Friday, March 16, 2007 (afternoon session)
Harvard Institutes of Medicine New Research Building (NRB) Room 258 (by Invite only)
Participants:
- Gil Alterovitz (Harvard, MIT)
- Judy Blake (GO, MGI)
- Mary Dolan (GO, MGI)
- Midori Harris (GO, EBI)
- David Hill (GO, MGI)
- Jon Liu (MIT)
- Jane Lomax (GO, EBI)
- Chris Mungall (GO, LBNL)
- Marco Ramoni (Harvard)
- Michael Xiang (MIT)
Opening discussion
GO motives:
- How to identify & prioritize areas to improve
- How to evaluate the state of an ontology and measure progress/improvement
Mike - how data were generated
- Look at variation in information content - calculate mean & SD for information content at each level
- Flag nodes off by >1.96 SD (difference from mean for level)
- Found lots more 'too general' than 'too specific' nodes -- this may be an artifact of using longest path, and not distinguishing between is_a & part_of relationships
- has a tendency to place nodes deeper in graph
- Discussion point: does it make a difference to the results if you use only is_a or only part_of?
- GO people point out that the analysis can treat is_a tree separately; the ontologies are not really "part_of complete"
David - Children of 'anatomical structure development' form just a flat list now; we could impose much more organization.
- - development terms are part_of at much deeper levels than they are by is_a relationships
- (example of 'amygdala development' on screen)
- - would certainly expect to get different result from analysis using is_a only -- that should flag lots of terms, and give us extremely useful information about where to add intermediate terms.
Moment of insight: the two relationship types are essentially different axes of classification within GO; MIT group had been treating them as equivalent
- GO BP terms can be differentiated by type of process, or by level at which it occurs
- could have chosen one or the other, but would have lost lots of info, along with a perfectly natural way to organize
- explains position of 'pigmentation,' 'immune process', etc.
- starting to think of "sub-ontologies" present within process ontology
- want way to tell computer what granularity level applies
- users want way to get from tangled graph to "my" part of ontology, i.e. a specific area of interest
- clever ways to use partitions - e.g. go slims
- partitions don't have bias in representation; see true features of nodes
- Action: Harvard/MIT group to repeat analysis for is_a and part_of separately.
- Action: GO to inform Harvard/MIT group of BP "sub-ontologies," and indicate if any sub-ontologies exist for any of the other branches of GO.
- Action: Harvard/MIT group to consider sub-ontologies separately for partition/entropy analysis.
What about when several papers report the same thing?
- Not a problem; analysis only uses 1 or 0, i.e. annotation to a node exists or not (don't count separate instances)
- Action: Harvard/MIT group to deal with annotation column 4 ("NOT"); remember to propagate DOWN for NOT annotations (or don't use them).
Fuzzy sets
- a. fuzzy annotation - e.g. core function, as opposed to other annotated things?
- what does it mean?
- easiest to illustrate for component
- Gil - indicate which feature to focus on? rank annotation importance
- David - but how do you rank? usually depends on context
- Chris - probability of that context arising
- annotation captures a potential to perform function
- supplement with other info, esp. expression (if right other things are present, gene prod does x)
- Judy - look at yeast, compare/contrast with (e.g.) mouse - annotation much nearer complete for yeast; the relevant terms available, whereas mammal annotation requires more extenstive ontology additions
- Possible action item: compare results for mouse-specific and yeast-specific sets
- b. fuzzy term-term relationships
- (no further discussion)
Going through sample suggestions on handout
- pilus retraction
- action item: for pilus biogenesis & pilus retraction, find bacterial expert(s) (e.g. Michelle) to check definitions, and whether pilus retraction should be part_of unidirectional conjugation
- multicellular organism reproduction
- this one would be OK looking only at is_a relationships
- cell wall peptidoglycan ... (GO:0051672)
- pass suggestion on to PAMGO group; looks OK to us
- lymphocyte anergy
- ask Alex!
- neurotrophin production
- use GO:0043524 'negative regulation of neuron apoptosis' (has narrow synonym 'neuron survival')
- make neurotrophin production part_of GO:0043524
- considering adding more general 'production' terms, but wait until we discuss it more -- existing production terms seem tied to experimental observations; also bringing in level of observation
Intervening discussion
Aside on synonyms - Gil suggests using thesaurus (e.g. NCI Thesaurus) to generate synonyms comprehensively; could implement some automation (but would have to do some manual mapping)
- Action item: GO to look into mapping thesaurus to GO terms for use in synonym generation
Aside on technology - marco suggests a tool that can suggest changes, and rank based on computations using co-occurrence of terms for same set of genes
- circumnutation
- ask Jen & Tanya
When GO curators add a BP term, we ask ourselves what's the closest existing GO term (what kind of process is it?), then ask if every instance of process is part of an instance of the proposed part_of parent, type of proposed is_a parent
[didn't catch a comment from Mike about one of the computational challenges, but it has to do with finding parent terms]
More on uses of NCI thesaurus to get better synonym coverage ...
- links to literature
- helps users
- extend semantics of search (forgiving - tolerates non-exact matches)
- can evaluate how 'good' synonyms are using info linked to thesaurus terms, GO terms
- but Chris notes that GO wouldn't expect correlation btw number of "annotations" and synonym scope
- piggyback on their updates
- we have to estimate how big the mappnig task would be
- MIT group can explore possible contacts who could help; can spin as an extension of thesaurus itself, advantage to users, etc.
- Action item: GO will go through sample set on handout; consult experts as needed; make changes; prepare report to share with MIT group explaining what changes we made and why (and if different from suggestion, also explain any reasons why we couldn't do exactly as suggested) so Harvard/MIT can learn about altering the computer tool that makes suggestions.
- Action item: Harvard/MIT group to integrate the changes implemented by GO into the re-engineering paper and circulate to GO.
- Action item: GO to let Harvard/MIT know when there's a new version of GO to evaluate against a previous one (and let H/M know which old version to compare to) for p-values of significant improvement
Information bottlenecks and entropy rate
Most dramatic bottleneck is function level 12 to 13
- Mike suggests looking at 'too general' nodes at the bottleneck level
GO asks MIT group if they can organize the list of too general nodes
- project list onto graph; see if problems fall into particular area(s)
- but what does GO mean by 'an area'?
- answer: do problematic nodes have a common parent? or are lots of problem terms on one level?
- one possible reason for bottleneck is graph and annotation history, e.g. children of brain development added much more recently than brain development itself
- Action item: Harvard/MIT to get lists of (a) all nodes and (b) too general nodes for info bottleneck
- Action item: Harvard/MIT group to propose "work groups" based on info bottlenecks
snippet from paper suggesting adding 'positive regulation of carbohydrate-transporting ATPase activity' etc.
- how does that addition change the info content? it introduces children that extend into next level; hypothesis is that some annotations should be transferred from problem level to new children
- bottleneck ameliorated because new child terms open up possibility of transferring information to another level
- Chris - any GO node can potentially be expanded; do info metrics assume 'closed world'?
- usually only makes sense to add child nodes at deep levels
- David - do siblings at bottleneck level have regulation children? model for suggested additions?
- we've differentiated sibs, but not problem terms (have sibling terms with and without child terms)
- leaves should be at similar specificity
- huge job! strategy has been to make new leaves as & when needed for annotation
- info analysis - find bottleneck, and what causes it - as way to identify interest group topics
Found the function bottleneck!
- biggest dip occurs at level 13, so bottleneck occurs at level 12
- one term: cation-transporting ATPase activity (GO:0019829) deviates from avg by >2 SD (7.1 bits)
- child GO:0046961 has lots of annotions
- this is a true info bottleneck in biology -- lots of gene products ARE hydrogen ion-transporting ATPases
- ChEBI might suggest restructuring
- illustrates biology perspective vs. information distribution perspective
- how do we reconcile them?
Bottlenecks results from one level containing a few general nodes and several much more specific nodes
- to fix, (a) move more specific nodes down; (b) split general nodes into more specific ones
Different kind of bias: several siblings on one level, of which one has more annotations than the rest
- entropy rate
- random walk over tree; how many times you end up at a particular node
- if tree is efficient & balanced, get equal probability for each node on a level
- can add weight to an edge, e.g. based on where it's going
- a somewhat orthogonal metric that assesses how balanced the tree is
Identify where bias is
- where bias results from uneven biological knowledge
- is there any way to compare GO with the literature? text mine Alberts ;)
- our goal isn't zero bias, but we should strive for GO bias to parallel actual community knowledge bias
- also want to track how bias changes over time
- can run analysis on different versions/releases of GO
- predict how much one vs another bit of ontology development would affect bias
- follow trends - what are/were the hot topics
Question (David): in fitness curve - why did component get worse??
- component tends to have high-level terms, each with long list of children
- a good case to see whether results are different using only is_a