Annotation Conf. Call 2015-10-13

From GO Wiki
Revision as of 11:26, 3 November 2015 by Vanaukenk (talk | contribs) (Use MOD identifiers in col-16 (DavidH))

Jump to: navigation, search


Welcome Alex (MelanieC)

Use MOD identifiers in col-16 (DavidH)

I'd like to propose that whenever we are referring to a gene object from one of the groups that contributes annotations to the GOC, we use the identifier from that group in column 16 values. Currently, there are several different ways that groups refer to gene objects: sometimes by NCBI_genes (Onecut3), ENSEMBL identifiers (Arntl), MGI identifiers (Dlx6os1).

DoNot Annotate annotation property (MelanieC)

I would like if possible to discuss an old ticket on the GOA Jira, "do not automatically annotate". This was initially suggested I believe by Rachael/Ruth and discussed in Texas. The use case was preventing some electronic pipelines, such as the Ensembl one, to automatically project annotations that we think could be valid if manually reviewed only. Ruth's example was human behaviour which can't be inferred from the same behaviour in mice, but could be nonetheless valid if manually annotated based on an other existing paper.

Current status: - implementing a "do not automatically annotate" is not trivial - the only use case identified is the human behaviour one described above - we (GOA) are taking over the Ensembl pipeline.

I would like to know if other curators can foresee a use case for "do not automatically annotate", in which case it may be worth considering adding it. Otherwise, I would like to suggest we don't.

Curation consistency exercise

Zfin's turn to pick a paper.


Use MOD identifiers in col-16 (DavidH)

  • The original issue involved use of gene identifiers in Annotation Extensions, but this led to a more general discussion of entity ID space as well as the semantics of ID use in the With/From and Annotation Extension fields.
  • David had requested that when curators outside of MGI need to refer to a mouse gene they use MGI gene IDs, as opposed to another gene ID, e.g., ENSEMBL or NCBI_Gene.
  • For Protein2GO users, this would mean having Tony add the MGI database prefix as a database option in the tool, so curators can enter MGI gene IDs.
  • This seemed reasonable to people, but led to a broader discussion of how we use ID space in GO annotations.
  • For the subject of an annotation (i.e., Column 2 in the GAF), we generally use the following types of IDs in our annotation files:
    • UniProtKB protein accessions
    • MOD gene IDs
    • RNACentral IDs for ncRNAs
    • IntAct complex IDs
  • A few years ago, the GOC agreed that reference UniProtKB accessions and MOD/ENSEMBL gene IDs would effectively be interchangeable, i.e., an annotation to a gene would mean that one or more products of that gene are associated with that annotation and an annotation to a reference UniProtKB accession would mean that one or more proteins associated with that accession are associated with the annotation
  • If isoform-specific information was relevant, curators could add the isoform-specfic ID or accession to Column 17
  • In annotation extensions, however, are we being more specific wrt domain and range of values such that, depending on the GO term, if the object of the annotation extension is a gene, curators might put a gene ID, if the object is a transcript, they might use a transcript ID, and if the object is a protein, they might put a protein ID?
  • The result of this, though, is that we might then be using different semantics for the subject and object of the annotation; do we need to be consistent?
  • Several points to consider:
    • Existing annotation constraints (i.e., domain and range) for annotation extensions that only allow for certain ID space
    • Isoform-specific information for Annotation Extensions - how should this be captured?
    • Human understandable vs machine readable Annotation Extensions
      • Humans may be able to infer that if the object of an mRNA binding annotation is a gene ID, then the in vivo object is the mRNA transcript of that gene, but how would a machine know that? Are there cases where this might not be clear or intuitive, even to a human?
    • Besides the semantic issues, what are possible consumer issues for having different types of IDs in our files?
  • ACTION ITEM: Review our use of ID space as the subject and object of annotations and then document for curators what is going to be standard GOC practice. (Note that we should also review With/From column IDs, but Rachael had collated what was used in the With/From a year or so ago and we can refer to that for this discussion.)
    • Examples of Annotations to ncRNAs (miRNAs) with Extensions from Rachael (note the different types of GO terms and specific annotation extension relations):
      • Object: human miR-21 (RNACentral:URS000039ED8D_9606) GO term: gene silencing by miRNA (GO:0035195) Annotation Extension: regulates_expression_of human SPRY2 (Ensembl:ENSG00000136158)
      • Object: human miR-21 (RNACentral:URS000039ED8D_9606) GO term: mRNA binding involved in posttranscriptional gene silencing (GO:1903231) Annotation Extension: has_direct_input human SPRY2 (Ensembl:ENSG00000136158)
      • Object: mouse miR-21a (RNACentral:URS000039ED8D_10090) GO term: negative regulation of translation involved in gene silencing by miRNA (GO:0035278) Annotation Extension: regulates_translation_of mouse Tgfbr3 (Ensembl:ENSMUSG00000029287)
    • Other comments from Rachael:
      • I'm using RNAcentral Ids as the DB_Object in column 2 and in Col. 16 I'm currently using Ensembl gene Ids (I did start out using Ensembl transcript Ids for the specific mRNA targeted, but somewhere along the line David OS advised using gene Ids for regulates_expression_of, regulates_transcription_of and regulates_translation_of).
      • However, coming back to Melanie's point about what identifiers users would expect to see, I think in the case of miRNA targets it would be either gene or transcript Ids that people would expect to see. Looking at some of the miRNA target prediction databases, they mostly use NCBI transcript Ids, sometimes NCBI gene Ids or Ensembl gene Ids, for the targets.
    • Examples of annotations requiring specific isoforms in annotation extensions
      • A specific transcription factor isoform, FOS-1A, regulates expression of one of three mig-10 transcripts, mig-10b, during anchor cell invasion in C. elegans
      • Columnn 17 would capture the isoform for FOS-1A, but how would we capture the isoform of mig-10 unless transcript IDs were used?

DoNot Annotate annotation property (MelanieC)

  • GOA is taking over the ENSEMBL GO pipeline
  • One outstanding issue wrt the ENSEMBL pipeline is the automatic transfer of annotations amongst orthologs that could be overstepping what would be considered a reasonable inference
  • An example of this is transferring behavior-related annotations from mouse to human
  • Are there other examples that would justify the work needed to implement a 'Do Not Automatically Annotate/Transfer' tag to some GO terms?
  • We don't want to have separate terms in the ontology for human behavior, mouse behavior, etc.
  • ACTION ITEM: If you have other examples of annotations that should not automatically be transferred, please share them so the cost/benefit of having a 'Do Not Automatically Annotate/Transfer' type of tag can be evaluated.

Curation consistency exercise

  • Will be the topic of the 2015-10-27 annotation call.
  • Zfin is next on the list for choosing a paper.
  • ACTION ITEM:Sabrina will send around Zfin's selection.