Minutes annotation approaches stanford 2012

Monday morning (Feb. 27th) session

Case studies of state-of-the-art annotation approaches

Suzi: today's morning session focuses on:

what tools are available, where are the tools insufficient, what audience do they target, what are the needs for such audiences? Need to get a better picture: where are the gaps, what are the common elements for annotation tools;
annotation approaches.

Jim Hu's talk on CACAO

[His slides are very explicative and comprehensive, and can be found on the wiki] Basic problems:

getting people involved;
quality of the work being done.

The key point is NOT to expect students to be full-blown curators.

This type of annotation has been growing.

It takes time to review the work, so a multiple-challenge system was implemented, to be able to begin new "challenges" while previous ones are still being judged.

Need to recruit more people.

In some ways, dealt with some problems by banning the use of several terms and evidence codes by the students.

An important point to consider for the GOC: do we want students to submit new term requests, or not?

UniProt IDs and taxonomy: sometimes it's impossible to figure out what UniProt ID to annotate.

We may not be experts in some of the areas that we're reviewing, so some things need to be reviewed externally.

This was found to be useful because, for instance, we may not feel confident on the eukaryotic stuff, so we report the suggested annotations, attach brief statements and paper figures to show why students made some assertions, and have this checked externally (e.g. by Alex Mitchell at InterPro and by Emily).

Rama: why would you want your work to be checked again?

Brenley: because we don't feel confident in annotating e.g. mammals.

Emily: The external person then does not need to read everything, but can focus on the paper figure/statements that are attached to the report, so their job is faster.

Jim: There's a lot of pressure from having to grade everything.

Suzi: a "second pair of eyes" is a mechanism of QC.

Ruth: some attached text helps the expert curator to review students' work more quickly; you still need to review students work.

Paul T: with this system, you need to be able to rely on a smaller court that says "this amount of work doesn't need to be reviewed"; it would be interesting to consider that too, i.e., to evaluate the work of the smaller court.

Brenley: students are much harder on each other than she'd be, and prone to mark things as unacceptable that could instead be changed a bit and would then be ok.

Judi: looking at efficiency, at allocating funding... what would the points be to take forward? To extend these teaching experiences or instead fund curators more?

Jim: the students are doing more than we do.

Paul t: ballpark figure?

Brenley: they're generating about 750 annotations in a semester (2 months).

Peter: managing students; this is a model worth thinking about (how about involving colleagues?).

Pascale: students could also go and explain about go.

Judi: if go had tools able to put in annotations, it becomes another mechanism to support community annotation efforts.

Val Wood's presentation CANTO, the Pombe community annotation tool

The community was really positive about this approach.

Comprehensive phenotypes: we hope to get these back from the community.

There is a test version available for all, and one specific for GO.

Emily: items nice to include?

Val: developer still working on stuff.

Emily: when can it be made available?

Val: needs a few tweaks; also, where to put it?

Emily: would be nice to be able to visibly promote it, then have it tested by expert curators.

Val: also, we've got annotations extensions working; feedback?

Paul t: what do you see as an active curators role?

Val: review QC, consistency... new term requests, going to get many GO term and phenotype requests. Based on the pilot, it's cut curators' work down by 90% and made them more productive; you get better annotations. Also, done 2 undergraduate practicals, and students were even better than us curators at finding GO terms in a paper, because of using the tool!

Suzi: 2 points: integration; coupling between ontology development and annotation. Smoother integration between these two by using the new tool.

Claire's talk - UniProt annotation pipeline

What does UniProt actually do?

40 curators - but this number will grow.

Have been involved with go for a long time.

UniProtKB files: 2 different ones, ask Emily for details, will become available with next release.

InterPRO not applied in taxonomic way; ask Prudence for details; it's being filtered.

Unirules: doing more and more automatic annotations; but very taxonomic-specific; possibly could move to this approach in the future.

UniPathway: collaboration with a French university; Yasmin works on this along with Anne Morgat.

Key point: when we standardize, we investigate what this means for GO.

Rolf: to clarify: Anne Morgat and SIB are involved in project call Microme on microbial metabolism; not a long-term viable model?

Peter: GO hasn't yet touched "other" pathways, such as bacterial and some plant metabolism.

Claire: we're asking for new terms in GO based on this, and using them!

Paul T: how incomplete are the GO terms for this part?

Yasmin: about half of them are in place already.

David: 290 missing GO terms are not that many to add.

Emily: as soon as these terms are available, we'll be able to generate annotations for them immediately.

Judi: real benefit from this.

Suzi: would be really interesting to run some of the QC checks.

Harold's talk: MGI extended annotation

Can annotate to several different ontologies and fields including cell, anatomy, target.

Annotations are grouped together to make a sentence/story as 'stanzas'

Mixes in vitro with in vivo data (e.g. cell lines) which is problematic. This data predates column 16 and they are trying to retrofit, but a layer of manual curation required for relating annotations. Some inconsistency of use. Some proportion could be migrated safely automatically, the remainder could be worked through manually, starting with the terms with the most annotation.

Ontology Development in line with Annotation

Paola's talk - overview of lessons from apoptosis project

Terms not finished yet - annotation can't really start until terms are finished.

Project initiated by APO-SYS consortium. Content meeting in July. Pablo started in September - he is a dedicated curator and annotated. Up-to-date with literature. This was very helpful. He also developed a set of guidelines for annotations to help annotators choose correct terms. Also adding comments to terms to aid annotation.

Unlike say cardiac conduction, there were already a lot of legacy apoptosis terms with many annotations. This has to be considered when making changes to ontologies - i.e. how annotations will be moved, whether re-annotation is required.

Top-level terms that needed reorganising. Using the Reactome model for the reorganisation.

Workflow for ontology development. Essentially an interative process with experts.

One problem was that the literature was often inconclusive, expert input was required.

Pablo has started curation already, this has helped figure out where terms are required.

Annotator input is required for 'problematic papers' where authors only refer to the assay. Difficult to decide which term to use, protein complexes = how specific to add? Regulation - direct or indirect?

IMPORTANT: Is the author simply describing the assay, or actually interpreting the underlying biological process? e.g. regulation of caspase activity v/s caspase regulator?

Ontology development - inherently slow? Should we have these big projects, that are slow or just make terms responsively? Individual requests are slow and ends up with a poor structure.

Lots of reading required for a good ontology structure.

Paul S: Developments done at two levels - existing bad terms with annotations, or entirely new terms. For former case, is there a point (80%?) at which annotators should just start annotating? 2 months?

Paola: we estimated that the wait should be 6 months

Pascale: 6 months is too long for a core process like apoptosis

Jane: could we mark terms/branches as complete as we go along? Paola has been using a dbxref.

David: it's a trade-off - annotators usually want the most specific term, not the top-level terms.

Paul: incremental releases of some sort required. Annotation reuse, migrating existing annotations.

Emily: Might be worth flagging doomed terms for not to be used in annotation

Karen: Expert input - with apoptosis interest waned after first couple of meetings/emails. Useful to have a friendly expert.

Suzi: Interaction with Reactome - could we have joint meetings with experts?

Val: Edit first, curate later. Not always possible, can we add specific terms e.g. under root while development is continuing?

Rama's talk - transcription ontology and annotation

Reasons for overhaul - badly defined terms, badly placed, missing parents, ambiguous terms.

The aim was to provide HOW the function acts within the process.

Resulted in some very long terms. Often with the complex terms the information required to annotate to them is not entailed in one paper. SGD invented a new evidence code for concatenating information from different papers so these terms could be used.

Annotators found these terms difficult to interpret and use (e.g. long version of 'repressor')

Pascale: if we can't understand these terms, how do we expect our users to?

Suzi/David: it may be that these terms are easier to understand by walking down the tree, interpreting is easier in context of parents.

Val: will the very long terms eventually be replaced by something more intuitive to biologists? David - we don't know, when you try and reduce complexity of term names you introduce ambiguity? Val - the correct specific phrases don't exist in the literature - perhaps if the community standardise their nomenclature, we could replace our complex text strings.

Paul S: perhaps having examples is the answer?

David: Karen always explained these terms with examples, but we didn't include these in the definitions.

Ben: need a better interface 'it looks like you're trying to annotate a transcription factor, did you mean protein or DNA binding?'

Ruth: are there any transcription factors that don't bind proteins? Karen - yes (RNA example)

Karen: you don't need to have a long-standing relationship with experts - we often asking individual questions and got a good response. At SGD only terms that were annotated to obsolete terms we re-annotated. There are a lot more TFs that could be re-annotated to more specific terms.

Eva: Is there a place for context-specific synonyms to display to the users?

Paul: is there a way that we can simplify the process of finding the correct terms?

Paul S: better tools for suggesting terms - also used etc.

Suzi: rather than showing existing annotations, which would cause bias, we should put examples in the definitions.

David: trade-off between complexity/specificity and interpretability

Karen: we have added comments to terms e.g. if the author doesn't mention which polymerase they are working on it is almost certainly POL II

Ruth: it would be helpful to have exact comments for e.g. distance of proximal promoter from start codon to still be proximal

Mike's summary

source of information
- can be difficult to interpret
time
- of creating structure
- of release
- definitions
Experts
- lose attention/commitment
terms
- complexity - too complex for curators to use
- too complex for one paper
- consistency
- training annotators to use