Common Annotation Framework Specification

Working Meeting, Caltech, June 4th-8th, 2012

Agenda for CAF/CAT Meeting

Logistics for CAF/CAT Meeting

Software Group Call, May 23, 2012

Time:

08:00 California time / 10:00 Central time /11:00 AM Eastern time / 16:00 UK time / 17:00 Central European.

Phone numbers to call:

Toll-free USA number 1-866-953-9688 (US Toll number 1-212-548-2460 in case of problems with 866 number)
Toll-free UK 0808 238 6001 (toll number: 646 834-9311)
Toll-free Switzerland 0800 562 830 (toll number: 646 834-9311)

Participant Pin: (801-561)

Common Annotation Framework Specifications

User Interface

General considerations

The curation interface should be modular; different types of information will be displayed in the different modules and the display can be customized, e.g. modules could be moved or resized, closed from view, etc.

Identifiers

What identifiers can be used for querying and annotation?
What will be the source of the identifiers?
How often will the list of valid identifiers be updated?

Internal Modules

Curator login
- Curator login would allow for database/project-specific curation interface and indicate level of training
- For example, logging in as a WormBase curator, the tool would "load" WormBase gene IDs, Variation IDs, Cell type IDs, Anatomy term IDs, etc.; GO version would be taxon-restricted for nematodes
- For curation entities - accessed on-the-fly from project databases or stored in the GO database/central repository?
- Will curators ever be annotating for species other than the project for which they are primarily responsible? This would affect what IDs would be loaded.
Papers
- Allows curator to enter a paper identifier (e.g. PMID, MOD identifier, Agricola identifier, etc.) and retrieve bibliographic information
- Additional paper-related information that might be useful: gene names/IDs, variation names/IDs - curation for a specific paper could be restricted to a list of gene names or entities present within that paper. Drawback: new gene names may not be recognized or approved for use.
Basic form functions
- Query - paper, entity, GO term, project, comments (basically query any annotation field)
  - Display 'metadata' for a given query object, e.g. paper bibliographic information
  - Query of a gene product would show all annotations - manual, inter-ontology inferences, IEAs, PAINT
- Annotate - new, duplicate, edit, delete

Annotation in the context of GAF2.0
- 1 DB - Automatically populated based upon curation entities file loaded
- 2 DB Object ID - Autocomplete; Source: participating consortium groups; Frequency of DB Object ID Updates
- 3 DB Object Symbol - Autocomplete if search, otherwise autopopulate; Source: participating consortium groups; Frequency of DB Object Symbol Updates
- 4 Qualifier - Drop-down menu; will need Point-of-Entry and Legacy QC checking
- 5 GO ID - Autocomplete from latest ontology files; also display GO term
- 6 DB:Reference (|DB:Reference) - Bibliographic data from PubMed (additional data from MODs? e.g. genes studied)
- 7 Evidence Code - Drop-down menu; Source: evidence code ontology; will need Point-of-Entry and Legacy QC checking; dependencies
- 8 With (or) From - Identifier Sources: may be project dependent; will need Point-of-Entry and Legacy QC checking; dependencies
- 9 Aspect - Automatically populated based upon GO term selection
- 10 DB Object Name - Source: Project/MOD identifiers
- 11 DB Object Synonym - Source: Project/MOD identifiers
- 12 DB Object Type - drop-down or prepopulate based upon identifier used?
- 13 Taxon (|Taxon) - drop-down or prepopulate based upon identifier used?
- 14 Date - Prepopulate
- 15 Assigned by - Prepopulate based upon curator login
- 16 Annotation Extension - Source: Project/MOD identifiers as well as External identifiers (e.g. ChEBI IDs); will need Point-of-Entry and Legacy QC checking; dependencies
- 17 Gene Product Form ID - Source: Project/MOD identifiers; will need Point-of-Entry and Legacy QC checking; dependencies

Annotation Expressivity - LEGO-style annotations
- Multiple Column 16 entries

Annotation quality control checks
- Run error reports in real time, as well as flagging annotation lines
- Point-of-entry QC checks
  - Required fields are complete
  - Valid identifiers used
  - Reciprocal annotations made when necessary (IGI, IPI)
  - Dependencies are correct, e.g. correct relation used with GO term and Column 16
  - Other existing hard and soft QC checks
- Legacy annotation QC checks
  - Check point-of-entry errors for annotations made prior to CAF tool
  - Obsolete GO terms

Curator Comments
- - Controlled vocabulary, topics
Annotation Complete as of YYYY-MM-DD

External Modules

PANTHER Families and PAINT tool (?)
- Curators would like the ability to view PANTHER families and PAINT annotations
- PAINT annotation - stand-alone or part of the CAF tool?
Ontology development - links to SourceForge, TermGenie
On-line help (User's Guide)
- Link out to GO Nuts?
Text mining: On-the-fly data-type flagging, entity recognition, CC and MF text mining results
- Paper viewer that allows for mark-up of full text, association of text in paper with a specific instance of a GO annotation
- Pre-populating the curation form based upon text-mining results - options? automatically populate based upon mark-up; curator clicks on entity in paper and selects a field to populate; drag and drop from paper to data entry field

Curation Input

GAF 2.0
GO Database
Other formats?

Curation Output

GAF2.0
Other formats?

Questions for Each MOD/Participating Database - November 2011 - February 2012

Accessible to all curators in the GO Consortium

What type of curation tool do you use (i.e., web-based, spreadsheet)?

Did you develop this tool in-house or are you using a tool developed by another group?

How many curators use this tool?

Annotation Format - GAF/GPAD or LEGO?

What annotation format does your tool support? (Make this multiple choice?)

Integration with GOLD?

What database do you use for storing annotations?

Allows curators to supply annotation data in line with the most expressive GO annotation format agreed by the GO Consortium

Do you use all columns of the current GAF (i.e. 2.0) for your annotation? If not, why?

Carries out all approved annotation quality checking on-the-fly to fully support curators

What quality control checks are you running before submission (i.e. Mike's checking script, valid organism-specific identifiers, etc.)?

Integrated with inferred annotation pipeline

Are you able to include inferred annotations, such as PAINT, F-P links, etc.?

Text mining curation support

Do you use any text mining tools for GO curation? If so, which ones and how do you incorporate them into your curation pipeline?

Features to support the curator: inclusion of Reference Genome targets, annotation dispute handling

Are you able to indicate information such as Reference Genome targets? How does your tool handle annotation disputes?

Other features to ask about:

To what entities do you annotate: genes, proteins, transcripts, complexes

Ease of duplicating information for making multiple annotations to one entity

How often does your tool update the ontology, the evidence codes, database identifiers?

Autocomplete for gene names, publication IDs, GO terms, evidence codes?

Please provide some documentation or a demonstration of your tool - screenshots, YouTube videos would be great

If using Column 16, what other ontologies/vocabularies do you use?