Annotation method

From GO Wiki
Jump to: navigation, search

Flagging annotations to indicate if an annotation was reviewed by a curator or not.

New column in GAF: Annotation Method

This idea was brought up at the Montreal GOC meeting. Minutes of the Montreal meeting- Separating Annotation Method from Ev.Code

Background information - http://gocwiki.geneontology.org/index.php/Electronically_curated_flag

Repeating most of the stuff that is already on the other wiki page (all credits to Kara)-

The problem

We have some evidence (experimental or computational) for a GO annotation, but we also want to indicate whether a curator looked at/reviewed the evidence or not. In our current system we have a mixed bag, with IEA being the oddball. IEA is used to indicate not what the evidence/experiment is, but rather how the annotation was done, while the rest of the evidence codes indicate the type of experimental evidence that supports the GO annotation.

Here are some examples to show the limitations of our current system and why this is a problem-

Case 1) SGD has several sets of computationally predicted GO annotations (PMID = 18811975) that are not solely sequence or structure based. These studies combine several interaction and expression data sets to predict functions for genes. RCA evidence code is appropriate here but then the documentation says the annotations all have to be manually reviewed by a curator to use this evidence. There are several 100 annotations of this kind and it is not a good use of our time to manually review these annotations and sometimes there are RCA type studies where the authors propose a MF/BP for just a handful and we are able to manually review those.
We cannot use IEA because the 'with' column is mandatory for this evidence code and these predictions don't have any information for the 'with' column.

Case 2) There are few large scale localization data sets available for yeast. The authors GFP-tag the proteins and visualize the localization. These studies provide localization for 100s of proteins and again it is not straightforward to manually review each image to confirm the localization. Plus the authors of these studies typically spend 100s of hours scoring duplicates/triplicates of these images that it doesn't make sense for us to review them again. We can review the method used in the study to make sure there are no caveats in the experimental system, it is real, but can't review all the data. In this case, although it is an experimental study and IDA would be the evidence code, since we are unable to review each image, we would like separate these annotations from annotations made from classical genetics studies where a curator can review the data shown in the paper.

Case 3) Inferences made from MF-BP inter ontology links. Tanya/David have created inter-ontology F-P part_of links and Chris has a script to make inferences (annotations) based on these links. Currently they all have IC as ev-code. These annotations can't be IEA because they don't have 'with' column data. If there are many such annotations, it might not be possible for groups to review each annotation and if they are not reviewed, they should be flagged some how.

Comment: I think this illustrates why a binary reviewed/not-reviewed flag is too crude here. Inferences from F-P links are correct by definition, and unreviewed ontology-based inferences should not be treated in the same way as less reliable links based on statistical predictive methods or heuristic rules --Chris.

Case 4) Inferences made from PAINT Inferences made using the PAINT tool are based on sequence/str similarity and currently the plan is to load them with ISS evidence code. But then there is a difference between sequence similarities published in papers and these and like in #3, it might not be possible for groups to review each annotation.

For all these cases mentioned above, some groups have the resource to review the inferences, some don't. I believe we should have the option to indicate to our users whether the annotations were bulk loaded/reviewed and that brings me to the following proposal.

Solution: New column

  • Evidence code should not be overloaded, it should indicate the evidence based on which the annotation was inferred.
  • Add a new column to indicate method of annotation- it could be something as simple as:

ColumnName: Curator/Manually reviewed
Values: Y/N

  • Remove IEA as an evidence code. Everything that is currently IEA would be given the method-'automated/computational' , and then would be given an evidence code as appropriate (mostly a flavor of ISS I would assume). There can be a rule that all 'automated' annotations that are a flavor of ISS must have a 'with' value.
  • With this system, an RCA type annotation can be manual or automated depending on whether the curator has reviewed it or not.
  • Annotations with IDA evidence code (like in the large scale localization study) can be automated or manual depending on whether the curator has reviewed it or not.
  • All annotations based on Swiss_prot key word mapping would be NAS + automated/computational

And so on..

So any evidence code can be used in conjunction with the Annotation Method.

Here is how this new column will work for the examples presented earlier.

  • Case 1: RCA predictions PMID: 18811975

This paper predicts functions/processes for several genes using a combination of methods/studies. These will be depending on whether the curator reviewed them or no:
RCA + Curator Reviewed-'N' or
RCA + Curator Reviewed- 'Y'

  • Case 2: Large Scale localization study (Huh et al PMID: 14562095, Sickman et al PMID: 14576278)

Evidence code for this data set would be IDA since GFP tag was used to see the localization. Depending on whether the curator reviewed each image/localization or not, data from this paper can be annotated with:
IDA + curator_reviewed-'N' or
IDA + curator_reviewed -'Y'

  • Case 3: Inferences made from MF-BP inter ontology links. These are the annotations derived from the inter-ontology links. The evidence code itself if debatable for these annotations. But one proposal is to go with the evidence code used for the original annotation based on which the link was established. So for example if a gpA is annotated to kinase activity with IDA from PMID: xxxx, then that gpA would get a protein phosphorylation annotation in BP, also with IDA. The advantage of the new column will be to highlight whether the inferred annotation was made automatically or by a curator.

These can be-
IDA/IMP/IGI++...Curator Reviewed-'N' or
IDA/IMP/IGI++...Curator Reviewed- 'Y'

  • Case 4: Inferences made from PAINT

As in Case 3, some groups might be able to review each of these inferences and some might bulk load them. With the new column, these would be: ISS + Curator Reviewed-'N' or
ISS + Curator Reviewed- 'Y'


  • What happens to current IEAs?

There are InterproToGO mapping, SwissProtKeyword mapping, EC# mapping (and more).
InterproToGOMapping- all these are currently IEAs. In this new system they will be a flavor of ISS + Curator Reviewed-'N'
SwissProtKey word mapping- All these will be NAS + Curator Reviewed-'N'
EC # mapping- flavor of ISS + Curator Reviewed-'N'

Alternate Proposal

Extending_evidence_codes