Manager Call 2016-06-1

From GO Wiki
Jump to navigation Jump to search


Identifier Space in GO Annotations

  • In response to the May 18th call's discussion on gene and gene product identifier space (see minutes), I've put together a spreadsheet that documents our current practice wrt for GAF and GPAD:
    • Annotated Entity IDs
    • With/From Entity IDs (note only for gene and gene product)
    • Annotation Extension Entity IDs (note only for gene and gene product)
    • Annotation Isoform Entity IDs
  • Then, for the purposes of discussion, I also added two other possible approaches:
    • Gene IDs only
    • Broad range of gene, transcript, protein, protein complex entity IDs
  • At the top of the spreadsheet are three general questions that we need to consider - there may be more; please add if needed
  • The plan was to review the different approaches, debate the pros and cons and then either get more feedback or finalize the proposal for presentation on an annotation or all-hands call

Review action items from Geneva meeting, and add items to Trello if necessary

Periodic review of the Trello board


Attendees: Chris, David H, Kimberly, Moni, Paola, Paul T.

Regrets: Moni Munoz-Torres (Teaching 9th & 10th graders about research and the scientific method from 7:00AM - 9:30AM PDT).

Agenda: Paola; Minutes: Kimberly

Identifier Space in GAF and GPAD

  • We discussed different options for what to use as gene and gene product identifiers in GAF and GPAD.
  • Much of the discussion was centered around cost/benefit for curators and users of using gene or gene-centric protein identifiers vs using more specific or granular identifiers, such as UniProtKB protein isoform IDs or PRO IDs for modified forms of proteins, for annotations.
  • There is currently an important distinction between GAF and GPAD in that GPAD specs indicate that Column 2 can use the more granular identifier, e.g. P34187-1, while in GAF Column 2 uses canonical identifiers for gene, protein, ncRNA, or protein complex.
  • Curators may want to capture the most granular information possible, but what use cases do we have for use of that more granular info?
  • Enrichment analysis still seems to be the more common use case of GO annotations and for that, most users still just use gene or gene-centric annotations
  • Possible proposal:
    • GAF:
      • Column 2 - would stay as is using identifiers for genes, gene-centric protein set, ncRNAs, and protein complexes
      • With/From and Annotation Extensions: curators could use whatever identifiers they want, but their annotation group must provide a Gene Product Information (GPI) file that would allow users to map those identifiers to a canonical, or parent, ID as for Column 2
    • GPAD:
      • Column 2 - would stay as is with ability to use most granular identifier
      • With/From and Annotation Extensions: same as above
  • LEGO models would essentially use a GPAD approach, as curators can use the most granular ID to indicate enabled_by entities
  • Question 1: Should GO provide the digested GAF that contains only the canonical IDs in all columns (except Column 17) of GAF?
  • Question 2: How should mapping of protein complex members be handled? We probably do want to have a mechanism for mapping between gene or gene-centric protein IDs and protein complexes and then automatically unfold annotations to each member of a complex using the contributes_to qualifier for MF?
  • Question 3: Do we need to establish guidelines for curators who want to use more granular identifiers in AEs to make sure the IDs used correspond to the correct entity given the GO term used? How much effort should be put into this?
  • Question 4: If we allow any gene or gene-centric identifier in AEs or With/From, what effects does this have on error checking?
  • AI:Determine if there currently are uses cases for the more granular gene or gene product information in AEs and With/From. Consult with Val and Ruth on this.
  • AI:More generally, look for examples of AE usage in literature.
  • AI:Need to check with Sandra Orchard about how protein complex mappings to gene or gcrp IDs are currently handled.