GO-MIT meeting notes

From GO Wiki
Jump to navigation Jump to search

REGO (Re-engineering the Gene Ontology) Workshop meeting

Friday, March 16, 2007 (afternoon session)

Harvard Institutes of Medicine New Research Building (NRB) Room 258 (by Invite only)

Participants:

  • Gil Alterovitz (Harvard, MIT)
  • Judy Blake (GO, MGI)
  • Mary Dolan (GO, MGI)
  • Midori Harris (GO, EBI)
  • David Hill (GO, MGI)
  • Jon Liu (MIT)
  • Jane Lomax (GO, EBI)
  • Chris Mungall (GO, LBNL)
  • Marco Ramoni (Harvard)
  • Michael Xiang (MIT)


Opening discussion

GO motives:

  • How to identify & prioritize areas to improve
  • How to evaluate the state of an ontology and measure progress/improvement

Mike - how data were generated

Look at variation in information content - calculate mean & SD for information content at each level
Flag nodes off by >1.96 SD (difference from mean for level)
Found lots more 'too general' than 'too specific' nodes -- this may be an artifact of using longest path, and not distinguishing between is_a & part_of relationships
has a tendency to place nodes deeper in graph
  • Discussion point: does it make a difference to the results if you use only is_a or only part_of?
GO people point out that the analysis can treat is_a tree separately; the ontologies are not really "part_of complete"

David - Children of 'anatomical structure development' form just a flat list now; we could impose much more organization.

- development terms are part_of at much deeper levels than they are by is_a relationships
(example of 'amygdala development' on screen)
- would certainly expect to get different result from analysis using is_a only -- that should flag lots of terms, and give us extremely useful information about where to add intermediate terms.

Moment of insight: the two relationship types are essentially different axes of classification within GO; MIT group had been treating them as equivalent

GO BP terms can be differentiated by type of process, or by level at which it occurs
could have chosen one or the other, but would have lost lots of info, along with a perfectly natural way to organize
explains position of 'pigmentation,' 'immune process', etc.
starting to think of "sub-ontologies" present within process ontology
want way to tell computer what granularity level applies
users want way to get from tangled graph to "my" part of ontology, i.e. a specific area of interest
clever ways to use partitions - e.g. go slims
partitions don't have bias in representation; see true features of nodes
  • Action: Harvard/MIT group to repeat analysis for is_a and part_of separately.
  • Action: GO to inform Harvard/MIT group of BP "sub-ontologies," and indicate if any sub-ontologies exist for any of the other branches of GO.
  • Action: Harvard/MIT group to consider sub-ontologies separately for partition/entropy analysis.

What about when several papers report the same thing?

Not a problem; analysis only uses 1 or 0, i.e. annotation to a node exists or not (don't count separate instances)
  • Action: Harvard/MIT group to deal with annotation column 4 ("NOT"); remember to propagate DOWN for NOT annotations (or don't use them).

Fuzzy sets

a. fuzzy annotation - e.g. core function, as opposed to other annotated things?
what does it mean?
easiest to illustrate for component
Gil - indicate which feature to focus on? rank annotation importance
David - but how do you rank? usually depends on context
Chris - probability of that context arising
annotation captures a potential to perform function
supplement with other info, esp. expression (if right other things are present, gene prod does x)
Judy - look at yeast, compare/contrast with (e.g.) mouse - annotation much nearer complete for yeast; the relevant terms available, whereas mammal annotation requires more extenstive ontology additions
  • Possible action item: compare results for mouse-specific and yeast-specific sets
b. fuzzy term-term relationships
(no further discussion)

Going through sample suggestions on handout

- pilus retraction

  • action item: for pilus biogenesis & pilus retraction, find bacterial expert(s) (e.g. Michelle) to check definitions, and whether pilus retraction should be part_of unidirectional conjugation

- multicellular organism reproduction

this one would be OK looking only at is_a relationships

- cell wall peptidoglycan ... (GO:0051672)

  • pass suggestion on to PAMGO group; looks OK to us

- lymphocyte anergy

  • ask Alex!

- neurotrophin production

use GO:0043524 'negative regulation of neuron apoptosis' (has narrow synonym 'neuron survival')
make neurotrophin production part_of GO:0043524
considering adding more general 'production' terms, but wait until we discuss it more -- existing production terms seem tied to experimental observations; also bringing in level of observation

Intervening discussion

Aside on synonyms - Gil suggests using thesaurus (e.g. NCI Thesaurus) to generate synonyms comprehensively; could implement some automation (but would have to do some manual mapping)

  • Action item: GO to look into mapping thesaurus to GO terms for use in synonym generation

Aside on technology - marco suggests a tool that can suggest changes, and rank based on computations using co-occurrence of terms for same set of genes

- circumnutation

ask Jen & Tanya

When GO curators add a BP term, we ask ourselves what's the closest existing GO term (what kind of process is it?), then ask if every instance of process is part of an instance of the proposed part_of parent, type of proposed is_a parent

[didn't catch a comment from Mike about one of the computational challenges, but it has to do with finding parent terms]

More on uses of NCI thesaurus to get better synonym coverage ...

links to literature
helps users
extend semantics of search (forgiving - tolerates non-exact matches)
can evaluate how 'good' synonyms are using info linked to thesaurus terms, GO terms
but Chris notes that GO wouldn't expect correlation btw number of "annotations" and synonym scope
piggyback on their updates
we have to estimate how big the mappnig task would be
MIT group can explore possible contacts who could help; can spin as an extension of thesaurus itself, advantage to users, etc.
  • Action item: GO will go through sample set on handout; consult experts as needed; make changes; prepare report to share with MIT group explaining what changes we made and why (and if different from suggestion, also explain any reasons why we couldn't do exactly as suggested) so Harvard/MIT can learn about altering the computer tool that makes suggestions.
  • Action item: Harvard/MIT group to integrate the changes implemented by GO into the re-engineering paper and circulate to GO.
  • Action item: GO to let Harvard/MIT know when there's a new version of GO to evaluate against a previous one (and let H/M know which old version to compare to) for p-values of significant improvement

Information bottlenecks and entropy rate

Most dramatic bottleneck is function level 12 to 13

Mike suggests looking at 'too general' nodes at the bottleneck level

GO asks MIT group if they can organize the list of too general nodes

project list onto graph; see if problems fall into particular area(s)
but what does GO mean by 'an area'?
answer: do problematic nodes have a common parent? or are lots of problem terms on one level?
one possible reason for bottleneck is graph and annotation history, e.g. children of brain development added much more recently than brain development itself
  • Action item: Harvard/MIT to get lists of (a) all nodes and (b) too general nodes for info bottleneck
  • Action item: Harvard/MIT group to propose "work groups" based on info bottlenecks

snippet from paper suggesting adding 'positive regulation of carbohydrate-transporting ATPase activity' etc.

how does that addition change the info content? it introduces children that extend into next level; hypothesis is that some annotations should be transferred from problem level to new children
bottleneck ameliorated because new child terms open up possibility of transferring information to another level
Chris - any GO node can potentially be expanded; do info metrics assume 'closed world'?
usually only makes sense to add child nodes at deep levels
David - do siblings at bottleneck level have regulation children? model for suggested additions?
we've differentiated sibs, but not problem terms (have sibling terms with and without child terms)
leaves should be at similar specificity
huge job! strategy has been to make new leaves as & when needed for annotation
info analysis - find bottleneck, and what causes it - as way to identify interest group topics

Found the function bottleneck!

biggest dip occurs at level 13, so bottleneck occurs at level 12
one term: cation-transporting ATPase activity (GO:0019829) deviates from avg by >2 SD (7.1 bits)
child GO:0046961 has lots of annotions
this is a true info bottleneck in biology -- lots of gene products ARE hydrogen ion-transporting ATPases
ChEBI might suggest restructuring
illustrates biology perspective vs. information distribution perspective
how do we reconcile them?

Bottlenecks results from one level containing a few general nodes and several much more specific nodes

to fix, (a) move more specific nodes down; (b) split general nodes into more specific ones

Different kind of bias: several siblings on one level, of which one has more annotations than the rest

entropy rate
random walk over tree; how many times you end up at a particular node
if tree is efficient & balanced, get equal probability for each node on a level
can add weight to an edge, e.g. based on where it's going
a somewhat orthogonal metric that assesses how balanced the tree is

Identify where bias is

where bias results from uneven biological knowledge
is there any way to compare GO with the literature? text mine Alberts ;)
our goal isn't zero bias, but we should strive for GO bias to parallel actual community knowledge bias
also want to track how bias changes over time
can run analysis on different versions/releases of GO
predict how much one vs another bit of ontology development would affect bias
follow trends - what are/were the hot topics

Question (David): in fitness curve - why did component get worse??

component tends to have high-level terms, each with long list of children
a good case to see whether results are different using only is_a