Annotation consistency: IEA, ISS, IC Usage Discussion

From GO Wiki
Jump to: navigation, search

Tanya Berardini, Emily Dimmer, Pascale Gaudet, Chris Mungall, Kimberly VanAuken, David Hill, Val Wood


Summary

  • Many IEAs annotations are quite reliable and should be displayed in the absence of other information, including:
  1. InterPro2GO
  2. SwissProt Keyword 2GO
  3. EC2GO
  4. HAMAP2GO
  5. UniProt subcellular location 2GO
  6. Ensembl Compara - projection of annotation from ortholog data.
  • This would solve the problem to some extent where genes incompletely annotated.
  • In some cases (InterPro2GO and Ensembl Compara methods) it may also be useful to convert IEA to ISS annotations after a curator has reviewed the data. However this should not happen for other mappings - the ISS code would have no real meaning as often there is no element of 'inference from sequence or structural similarity' used in creating a set of IEA annotations.
  • ISS annotations by other means can also complement a set of IEA annotations, for example, when a mapping provides annotations only to high-level terms.
  • would be nice to have IEAs annotations displayed in AmiGO, especially when there are no other annotations available. Filtering annotations could be possible?
  • Alternatively, the user could have the option to view IEA annotations.
  • It would be nice to have some alert that would tell us an ISS annotation can be made. That would be: when a gene is annotated to the root (or not annotated at all), and there is an experimental annotation in another species, and the genes are known orthologs. Then a message would be sent to the relevant database to create an ISS annotation to that gene.

-- With regard to this last recommendation, I'd like to suggest a slight wording change in the last sentence, from 'create an ISS annotation to that gene' to 'review and possibly create, an ISS annotation to that gene'. I spent some time last month reviewing some of our 'missing' ISS annotations from the reference genome project and found that in some cases I actually couldn't make the same annotation or was more comfortable making an annotation to a higher level term. --Kimberly


Email exchange Nov 8-12, 2007

Hello,

I am going through the action items for next week's reference genome conference call. At the reference genome meeting we all eagerly volunteered to come up with some guidelines as to how to use ISS, IEA and IC (too much coffee??). Here's the major points I remember:

  • IC: came up during the example of a translation initiation factor annotated to a good function and component but had a root annotation for process. It was suggested maybe that could have got an IC annotation. (gene is mouse Eif2b2)
  • related is ISS and IEAs: David pointed out this gene probably had a good IEA annotation from interpro. The question was how to address this: dicty and pombe would make this an ISS annotation, to avoid the root annotation. The problem is, InterPro domains and mappings can change so maybe that's not such a good practice after all. However, since IEAs ware not displayed in AmiGO (and perhaps excluded in certain studies are poorly reliable), some valuable information is not used.

Was that your recollection as well? Can we make a plan to discuss that at some point and come up with suggestions? (It doesn't have to be before the call next Tuesday; but I wanted to write down the important discussion points).

Cheers, Pascale


Hi,

My memory of the Reference Genome discussion for IEA was that as many of the IEA methods used have increased in quality over the last 5 years, some groups are becoming more accepting of the data they provide. I thought that Judy's suggestion was that a group should overview these methods and decide which sets of data should be displayed in AmiGO (ie. if an annotation is supported by multiple independent IEA methods).

If that is a true memory it might be handy to draw up a list of the IEA methods that we want to discuss. From the GOA perspective, I would like to include:

  1. InterPro2GO
  2. SwissProt Keyword 2GO
  3. EC2GO
  4. HAMAP2GO
  5. UniProt subcellular location 2GO
  6. Ensembl Compara - projection of annotation from ortholog data.

Cheers, Emily


Hi,

Thanks for getting the ball rolling on this one, Pascale.

My recollection of this discussion was that it stemmed from a concern about annotation consistency and that Suzi was concerned that there seemed to be lots of 'missing' ISS annotations in the reference genome work. One response to this was that for many gene products, the IEA annotations were actually providing sufficient information and that perhaps IEAs suffer from past negative perceptions that are no longer accurate given that substantial feedback has greatly improved some of the mappings.

Just speaking for myself here, when curating I often make the very pragmatic decision to focus on getting as many experimental annotations in as possible and then, time permitting, go back and try to fill in ISS annotations where we don't have experimental data. In the meantime, though, I do look at our existing IEA annotations and find that many of them are just fine and, if included in AmiGO, would help plug some perceived annotation holes.

I agree with Emily that it would be worthwhile to look at the various IEA methods and come up with some metrics for evaluating their accuracy. I'm not sure what these could be, but (thinking out loud) perhaps there's a way to determine what percentage of these mappings are supported by experimental data in any organism.

Also, with the proposed changes to the ISS branch of the evidence codes, is it worth considering promoting some of these mappings to one of the new IS* evidence codes, if the method ultimately stems from sequence analysis?

Emily, does some of the BioCreAtIvE work speak to this subject as well? I seem to recall Evelyn talking about this issue at past GO meetings and commenting that the BioCreAtIvE work provided support for the idea that some electronic annotations were actually of high quality.

Cheers, Kimberly


I agree with everything Kimberly says, from the strategy of the curation to the visibility of IEA annotations. We also strive for experiments and only when we have nothing else do we search out ISS annotations. Finding experimental evidence in another organism to make an ISS with is very time-consuming. For many of the genes where the ISS seems 'obvious', we have IEA annotations to the appropriate terms.

David


I'll reply to points that Kimberly and David raised all in one email:

1. From Kimberly:

"Also, with the proposed changes to the ISS branch of the evidence codes, is it worth considering promoting some of these [IEA ]mappings to one of the new IS* evidence codes, if the method ultimately stems from sequence analysis?"

Doesn't this sound eerily like the THMM (or whatever that acronym is) discussion that's been swirling around the mailing list for ages? Is it IEA? Is it ISS? From what Emily has said, it seems like _some_ of the (currently) IEA methods might be in this same type of situation - the mappings are heavily curated.


2. From David: "Finding experimental evidence in another organism to make an ISS with is very time-consuming. "

While it might be difficult to find this type of information for many of the genes that we encounter day-to-day, for the RefGenomes genes, in particular, being able to look at all of the ortholog annotations through Mary's graph, shortcuts this time consuming process. Therefore I think that Pascale's point of 'why not make the ISS annotation?' is well taken. I think we should recommend that this be done in cases where the alternative is to have a 'root' annotation or an IEA one.

3. My own point: It seemed like one solution would be to display the IEA annotations in AmiGO so that the 'holes that are not actually holes but IEA annotations that are invisible' would not be so obvious. I'm sure that the AmiGO working group has this in their sights so it may already be in the works.

Tanya


It would be nice to have some type of alert that let us know when a possible ISS annotation could me made to a gene where we have an annotation to the root.

What about 'promoting' IEA annotations when there is not other info available. we do this now on our web site. If we have a manual annotation to the root and there is an IEA annotation, the display of the root annotation gets suppressed. Is there a way we can keep a zillion IEA annotations that are redundant with manual annotations from being displayed. This is just an idea I'm formulating. It would be nice if we could provide some type of best set of annotations to a user.

David


An alert would be awesome! We wouldn't even have to look at the graphs.

(Now I'll start dreaming, beware.)

Dare I even suggest the 'auto-generation' of an ISS annotation/s with the appropriate evidence_with identifiers already filled in that could then be approved or not? It could be a file that's generated in the usual gene_assoc format (kind of like what MGI feeds to GOA?). Hmmmm, how to integrate it easily into the member dbs though? Does anyone have a way to slurp up outside annotations in the gene_assoc format into their MOD? I think we might because we integrated TIGR's Arabidopsis annotations into TAIR way back when.

</dream>

Tanya


I think this would be really cool. I know that when we do large-scale stuff, we can load annotations in a tab-delimited format. We slurp up the GOA mouse annotations, so I know we could do it. We need to have a way to be sure that the auto-generated ISS annotations remained in synch with the original experiment-based annotations. If those annotations went away or changed, the ISSs would have to change as well. This presumably would be no extra work on the part of the MOD curators.

David


Hi,

Sorry to be replying late... and for the length of this e-mail!

1. did anyone else have any other IEA methods that they would want included in this discussion? or are the six I mentionned the main ones? Perhaps it would be worth an e-mail to the GO list to double-check this is true?

2. Kimberly - yes as part of the BioCreative competition we did evaluate our electronic methods (see PMID: 15960829) and found that they all had a high correctness - I've included the relevant paragraph from the paper at bottom of this e-mail.

3. I do believe that the electronic methods have improved substantially over the years. This is due to the feedback that we get - but this is an on-going effort with particular importance for the InterPro2GO mappings.

4. Many of the mapping methods (Swiss-Prot Keyword/location/HAMAP) are very manual, as quite often the external vocabulary term has been applied manually to a record by a Swiss-Prot curator (InterPro2GO does NOT fall into this category). This is especially true for HAMAP2GO annotations, and one day - when we have all the information we need from Swiss-Prot I can imagine/hope that a this method will change to having a manual evidence code.

5. For InterPro2GO and the Compara IEA annotations I can see that 'upgrading' them to ISS could be helpful for MOD curators - as this action stabilizes an annotation (all IEA mappings/methods are deleted and re-run monthly by GOA, so if you make an 'ISS' statement, you can then ensure that you can 'keep' the information just in case of the unlikely event that an IEA annotation might one month disappear). In addition, I can see that it would be valid to upgrade to 'ISS' for these forementionned methods as they do both have a sequence analysis technique behind them.

But I'm not so sure that it would be right to 'upgrade' Swiss-Prot keyword/location/HAMAP EC2GO mappings in this way...as the origin of this IEA annotation will often not be sequence similiarity, but the effect of a Swiss-Prot curators manual annotation directly onto a protein's record. The evidence trail then just becomes a bit too tangled in my mind to have external curators directly convert these to ISS annotations.

6. I am a little worried about the way that ISS seems to talked about at the moment. We recently restricted the creation of certain ISS annotations a so that they could only be made from source annotations which had a manual evidence code. This effort was to ensure that 'ISS's' were high-quality, highly-curated statements, which users could trust. I liked this change - and feel it would be a pity if we started to speed up the process of making ISS statements again - so at the moment I not 100% behind the suggestion of making a file to semi-automate this process. Why not leave them as IEA if you're only have a quick review of them - just let us know which ones are incorrect, and instead spend time on making the granular annotations with experimental evidence codes?


7. In the UniProt gene association file we display all sets of the IEA annotations created by the different methods - as this then gives the user a complete set of information and there might be interest in the annotations created by different IEA methods. But this display is very redundant and only appropriate with people happy to play with large datasets.

When a user is looking at annotations through a browser however I feel they should be presented with a more user-friendly, condensed view, therefore I agree with David's suggestion of filtering annotations based on evidence code or method. However could there be an optional expanded view in AmiGO for anyone wanting all the annotation detail? For those users that could cope - I feel it would be interesting to see that curators have assigned 'ND' annotations for certain nodes at a certain date, but that there are also some IEA methods which are point to a possible function....

8. But especially I think AmiGO does need to show IEA annotations for all of the non-model organisms - which will never have much (if any) manual curation time given to them nor much direct experimental evidence. Improvements to IEA methods are particularly important for these kinds of users.


Cheers, Emily

BioCreative paper excerpts:

In agreement with GOA release statistics, InterPro2GO (635 annotations) provided the most GO coverage of the test set followed by SPKW2GO (385 annotations) and EC2GO (27 annotations), data not shown. Because the GO function terms predicted by the EC2GO mappings were quite deep/final node GO terms, it was not surprising that 67% of the time they exactly matched the manual GO annotation (Table 6). The InterPro2GO (43%) and SPKW2GO (44%) mappings, however, were more likely to predict a higher level/ less granular term than those chosen manually. Given that this was an automatic evaluation, the precision of electronic GO term predictions was calculated based on new or more granular GO terms being either correct or incorrect. As a result, a precision range is presented for each electronic strategy. In the worst case scenario, InterPro2GO, SPKW2GO and EC2GO precisely predict the correct GO term 60 to 70% of the time. On the other hand, all strategies were capable of up to 100% precision. The reason for this level of accuracy is because these electronic strategies rely on a manual mapping step based on quite high level GO terms.

...

To further evaluate how precise our electronic strategies were, we manually evaluated a random set of 44 proteins that had both electronic and manual GO annotation. This time, we verified whether the GO predictions were correct or incorrect. There was little difference in the precision of each strategy and our electronic annotation was between 91–100% precise

Emily


Hello again,

I wonder if that makes up some sort of proposal?

1. Mappings would be left as IEA (or one of the new IS* codes)

2. Warnings would be sent for genes annotated to the root where there is experimental annotations in other organisms

3. AmiGO would display IEAs-- at least for reference genomes initially?

Pascale


Now for something completely different,

did we want to say anything about IC usage? This was the one area that we didn't explore beyond Pascale's initial email.

Here's the relevant bit:

  • IC: came up during the example of a translation initiation factor annotated to a good function and component but had a root annotation for process. It was suggested maybe that could have got an IC annotation. (gene is mouse Eif2b2)


What if there was an IEA annotation for 'translation'? (That is the case for this gene.)

http://www.informatics.jax.org/javawi2/servlet/WIFetch?page=markerGO&key=72558

Would it still make sense to make an IC just so that the annotation will show up in AmiGO? I don't think so. Seems like we should be able to resolve that type of situation by going through the 'AmiGO displays (subset of) IEA annotations' route.

Tanya


I don't think you should be able to infer anything stronger than IEA from an IEA

If it's really important to capture these then we could have as an is_a child of IEA, "inferred from IEA and curator"...

Chris


Hi Chris,

Maybe I didn't phrase this right. I wasn't talking about making an IC annotation from an IEA.

I was referring to the necessity of making an IC annotation to something like 'translation' based on an experimental annotation to 'transcription factor activity' when there was an existing IEA annotation to 'translation' already present but not visible in AmiGO.

Tanya


And I did mean 'transcription' not 'translation.'



Hi Tanya,

I think there is value in adding this type of manual annotation:

1. Taking the GOA pipelines into account - our IEA pipelines are re-run each month, and there is a small possibility that in a future release some parameters might be changed that results in some IEA 'transcription' annotations disappearing (e.g. if the annotation originated from the InterPro2GO mapping, and say, the GeneDB group found that their S.pombe proteins that the IEA 'transcription' annotation was not be correct - the InterPro2GO mapping would then be modified, and the Arabidopsis protein might lose an annotation that for them was correct). If you make a manual annotation, then you will ensure it stays.

2. If you are already looking at the annotations assigned to this gene product, making the IC annotation does indicate to users that a curator has come to the same conclusion as the IEA prediction, which is nice for everyone.

Emily

Back to Reference_Genome_Annotation_Project