Difference between revisions of "Annotation Conf. Call 2015-10-13"

From GO Wiki
Jump to: navigation, search
Line 28: Line 28:
*For the subject of an annotation (i.e., Column 2 in the GAF), we generally use the following types of IDs in our annotation files:
*For the subject of an annotation (i.e., Column 2 in the GAF), we generally use the following types of IDs in our annotation files:
**UniProtKB protein accessions
**UniProtKB protein accessions
**MOD gene IDs, but if no MOD, ENSEMBL gene IDs
**MOD gene IDs
**RNACentral IDs for ncRNAs
**RNACentral IDs for ncRNAs
**IntAct complex IDs
**IntAct complex IDs

Revision as of 04:55, 14 October 2015


Welcome Alex (MelanieC)

Use MOD identifiers in col-16 (DavidH)

I'd like to propose that whenever we are referring to a gene object from one of the groups that contributes annotations to the GOC, we use the identifier from that group in column 16 values. Currently, there are several different ways that groups refer to gene objects: sometimes by NCBI_genes (Onecut3), ENSEMBL identifiers (Arntl), MGI identifiers (Dlx6os1).

DoNot Annotate annotation property (MelanieC)

I would like if possible to discuss an old ticket on the GOA Jira, "do not automatically annotate". This was initially suggested I believe by Rachael/Ruth and discussed in Texas. The use case was preventing some electronic pipelines, such as the Ensembl one, to automatically project annotations that we think could be valid if manually reviewed only. Ruth's example was human behaviour which can't be inferred from the same behaviour in mice, but could be nonetheless valid if manually annotated based on an other existing paper.

Current status: - implementing a "do not automatically annotate" is not trivial - the only use case identified is the human behaviour one described above - we (GOA) are taking over the Ensembl pipeline.

I would like to know if other curators can foresee a use case for "do not automatically annotate", in which case it may be worth considering adding it. Otherwise, I would like to suggest we don't.

Curation consistency exercise

Zfin's turn to pick a paper.


  • The original issue involved use of gene identifiers in Annotation Extensions, but this led to a more general discussion of entity ID space as well as the semantics of ID use in the With/From and Annotation Extension fields.
  • David had requested that when curators outside of MGI need to refer to a mouse gene they use MGI gene IDs, as opposed to another gene ID, e.g., ENSEMBL or NCBI_Gene.
  • For Protein2GO users, this would mean having Tony add the MGI database prefix as a database option in the tool, so curators can enter MGI gene IDs.
  • This seemed reasonable to people, but led to a broader discussion of how we use ID space in GO annotations.
  • For the subject of an annotation (i.e., Column 2 in the GAF), we generally use the following types of IDs in our annotation files:
    • UniProtKB protein accessions
    • MOD gene IDs
    • RNACentral IDs for ncRNAs
    • IntAct complex IDs
  • A few years ago, the GOC agreed that reference UniProtKB accessions and MOD/ENSEMBL gene IDs would effectively be interchangeable, i.e., an annotation to a gene would mean that one or more products of that gene are associated with that annotation and an annotation to a reference UniProtKB accession would mean that one or more proteins associated with that accession are associated with the annotation
  • If isoform-specific information was relevant, curators could add the isoform-specfic ID or accession to Column 17
  • In annotation extensions, however, we are being more specific wrt domain and range of values such that, depending on the GO term, if that object of the annotation extension is a gene, curators might put a gene ID, if the object is a transcript, they might use a transcript ID, and if the object is a protein, they might put a protein ID.
  • The result of this, though, is that we are now using different semantics for the subject and object of the annotation; do we need to be consistent?
  • Several points to consider:
    • Existing annotation constraints (i.e., domain and range) for annotation extensions that only allow for certain ID space
    • Isoform-specific information for Annotation Extensions - how should this be captured?
    • Human understandable vs machine readable Annotation Extensions
      • Humans may be able to infer that if the object of an mRNA binding annotation is a gene ID, then the actual object is the mRNA transcript of that gene, but how would a machine know that? Are there cases where this might not be clear or intuitive, even to a human?
  • ACTION ITEM: Review our use of ID space as the subject and object of annotations and annotation extensions, respectively, and then document for curators what is going to be standard GOC practice.