Extending evidence codes

Background

The core evidence codes in GO have remained relatively static throughout the existence of GO. There is a separate evidence code ontology but this is ignored by virtually all groups and all tools.

In the early days of GO annotations were either made from either non high-throughput experimental data or from sequence similarity, and the codes reflected this. Things have changed, and the GO evidence codes have only changed in a piecemeal fashion. The system needs an overhaul.

Some of the problems are analyzed in Annotation_method.

Axes

One of the problems with the existing evidence codes is that different axes of classification are mixed in a confusing fashion. "IEA" means a computational method that is unreviewed, and lumps together keyword matching, transference from interpro and blast-based methods, unreviewed integrated computational analyses. "RCA" means both reviewed and the inference is from a mix of high-throughput experimental data and computational analysis.

Reviewed by a GO curator?
High throughput?
Computational vs experimental
Integrative?

Cross-Products

 {reviewed, not-reviewed} x {computational, experimental, mixed-computational/experimental}

Email Summary

Kara has described some problems with the existing evidence codes, and proposed a solution involving the creation of a new column for indicating whether the annotation has been reviewed or not.

I think the analysis of the problem is spot-on. Multiple axes of classification have become entangled in ways that are confusing and limit our ability to clearly state what should be simple statements describing how we arrived at the inference. The solution is to disentangle these axes.

The actual implementation proposed by Kara involves a new column for reviewed/not-reviewed. In fact the solution need not be implemented in exactly this way - we could instead materialize the cross-product of {reviewed, not-reviewed} with the various different experiment types. This was discussed at the Montreal meeting.

Anyway, I think this is a trivial implementation issue and I only bring it up not to get bogged down in separate discussions about GAF formats and so on.

I have a few comments about the proposal.

Evidence Codes are Legacy

My first comment is that this would be much easier if we could scrap the existing codes and start from scratch, and ditch the pointless requirement that the codes must be 2 or 3 letters. This isn't the 1970s, we can spare a few more characters, come on.

The problem can be seen in the proposed way of annotating PMID:18811975

RCA + Curator Reviewed-'N' or
RCA + Curator Reviewed- 'Y'

The "R" in RCA stands for "reviewed". So here we have either "reviewed reviewed computational analysis" or "non-reviewed reviewed computational analysis". This is obviously gibberish. This criticism stands whether or not we have a separate column for review status, or whether we incorporate this into pre-composed cross-product terms such as "R-RCA" and "NR-RCA". It's not a problem with Kara's proposal, it's a problem with our legacy evidence code system.

Karen proposed renaming "RCA" to something like "Integrated Computational Analysis", but this was never incorporated and RCA stuck.

We would not have this problem if we used numeric IDs in the evidence column. We would then be able to modify labels and synonyms just as we do with ontology terms. We have a dozen years of experience in managing ontologies, it might be an idea to use this experience here. I think it's imperative we obsolete the existing evidence codes and switch to using ECO IDs. Otherwise we will just carry on getting bogged down in the same discussions.

Reviewed vs non-reviewed may be too simplistic a distinction

The second comment is that reviewed vs not-reviewed may be too crude a distinction that would end up lumping together unreliable annotations with reliable annotations. This can be seen in case 3 and 4.

Case 3 involves annotations inferred from MF->BP links. Whilst it would be useful for internal tracking purposes to know if these are reviewed are not, unreviewed annotations of this sort should not be lumped in with unreviewed BLAST-based predictions. I also think it's pointless for curators to review these on a per-annotation basis. There are too many, the review should be focused on the links in the ontology itself.

Case 4 involves annotations from Paint. Here Paint-based inferences would get a ISS+N code if they were bulk loaded without review. But this misses the fact that an expert curator was involved in making that inference (just not a curator from the specific MOD). I think here we need to be careful about conflating the notion of reviewed vs non-reviewed with the notion of automated vs manual.

What happens to IEAs

The 3rd comment is on the proposal for the mapping of legacy IEAs:

What happens to current IEAs?

 There are InterproToGO mapping, SwissProtKeyword mapping, EC# mapping (and more).
 InterproToGOMapping- all these are currently IEAs. In this new system they will be a flavor of ISS + Curator Reviewed-'N'

(minor correction: this should be ISM+N)

 SwissProtKey word mapping- All these will be NAS + Curator Reviewed-'N'
 EC # mapping- flavor of ISS + Curator Reviewed-'N'

Here the disentangling doesn't go far enough - Kara is trying too hard to work within the confines of the existing broken system.

We're still mixing apples and oranges here. Should EC mappings really get an ISS?

With all 3 cases above, there's actually (at least) two inference steps. The first step may be computational or experimental (and may be recorded or not recorded) and the second step is a database-to-GO mapping. Ideally this would be described with a composite evidence description.

If we were devising the inference ontology from scratch we might start with one orthogonal axis describing the computational method (regardless of reviewed vs non-reviewed and separate concerns)

computational method
- sequence-based computational method
- integrative computational method
- text-based computational method

(ultimately we would import this from OBI but let's not worry about that for now)

This would give us a more coherent basis with which to combine combinatorial codes (or to use in separate columns in the GAF). If we had to retrofit this to the existing system then ITM might be the best 3 letters to use for the last one.

I'm not holding my breath for a complete overhaul - though I think it's well overdue. I think patching it with Kara's proposal is definitely better than doing nothing. The immediate practical question is whether to opt for an additional column, or to precompose the cross-product, retrofitting some of the existing codes in.

As an example of what I mean by retrofit would be to introduce a new superclass "ICA" for "integrated computational analysis" (or "IICA"), and introduce two subclasses for reviewed vs non-reviewed. Thus "RCA" could retain it's existing meaning, and we would have a new code for "non-reviewed integrated analysis"

ICA
RCA = R x ICA
NRCA = NR x ICA

In ECO these would have sensible names and numeric IDs

integrated computational analysis
- reviewed integrated computational analysis
- non-reviewed integrated computational analysis