Email discussions

From GO Wiki
Jump to: navigation, search


I agree with everything Michelle said below about what ISS is. I also think that at the recent GO meeting, the group agreed that any method where all the evidence was sequence based and where the final annotations are curator approved should be ISS. On that criteria, I agree with Michelle that what David and Harold have described fits into the ISS category.

There hasn't been any discussion of the papers Rama suggested for the RCA discussion, but if you read them you'll notice that these analyses took multiple types of experimental evidence, e.g. two-hybrid screens, genetic interactions, etc. and did a computational analysis of that type of data. The original/existing RCA documentation stated that RCA should not include sequence data at all, however at the St. Croix meeting, I believe we agreed that an RCA analysis could include sequence info as one of several types of data. However, if all the data is sequence-based, then I think it should be ISS based on the agreement that any curator-approved annotation based purely on sequence-based data should be ISS.

As for the with column, if these are old, they are actually exempt from requirement to fill the with column. Mike's filtering script which removes ISS has a date criteria and only requires the with column to be filled if the annotation date is later than the agreed upon date. Alternately, if you still have the source data, perhaps you could fill the with column with all the things that were considered for the annotation. Karen

>On Tue, 20 Mar 2007, Michelle Gwinn Giglio wrote A few comments on what ISS is. Harold made this statement "The point of the ISS is that the curator checked that the object pointed to has experimental evidence behind it." That is only one aspect of what ISS is - it is much more than that. David said: "But based on the hits, curators made their judgments.....lots of things went into the judgements that were made, most importantly was knowledge of the field." In fact this paraphrased quote from David (taking out the middle part which I don't agree with) is exactly what ISS is. That's the whole point of differentiating it from sequence-based IEA annotations. Making annotations based on sequence-similarity is the process of taking search results of many types (TIGR runs at least 7 types of searches - only one of which is pairwise Blast-based) and then assessing what those search results are telling you. Some of the results will be quite stong evidence, some will be weak evidence. Sometimes the combination of several search results which individually don't tell you anything will together add up to something much more. How one makes this determination is through experience with evaluating search results (knowing what to look for in an alignment, understanding how HMMs work and what they can tell you, etc.) and knowledge of the organism and field of function/process in question. That is the whole point of human review of sequence-based evidence and it is often not an easy task. It is not trivial. In fact, in my experience, it is more challenging than doing literature curation where one need only record in GO annotations what was done by the authors. ISS requires a constant series of judgement calls for virtually every gene - and the annotator is the one making the judgements.

AT TIGR, we have a way of telling which of our match proteins are experimentally characterized and which are not (to a certain extent) and we could then automatically put ISS on everything that had a match above a certain cutoff to one of those proteins. But that would be very wrong and would not be a true ISS annotation in my book - it would be a high quality IEA and nothing more. The process of ISS is exactly the human element of deciding whether the info you are seeing makes sense in the context of your organism and in the context of your knowledge of the system.


> Harold Drabkin wrote: We are not commenting on what is in the With field for ISS annotations. We are saying the the Fantom are RCA because not all uses of blasts, etc. are ISS evidence. Yes you can have certain things in the WITH field for ISSs, but some of these can also be in the WITH field of RCA (or even IEA for that matter). The point of the ISS is that the curator checked that the object pointed to has experimental evidence behind it. In an IEA nothing was checked, and for RCA, an "expert" was asked "does this seem reasonalble, yes or not". Am I correct in understanding that the Cambridge GO designated that using blast, etc. be equated always and only with ISS?

> Karen Christie wrote Yes, for ISS, it states that "To be listed in the with field, a gene product must be experimentally characterized, i.e. it should be possible to annotate that gene product using one of the GO experimental evidence codes. " I believe that we have agreed that when a specific gene product is listed in the with field for ISS, it must be experimentally characterized. However, we also just agreed at the 2007 Cambridge GO meeting that many other things are acceptable, including Pfams, Prosite, TIGRFAMS, CBS, COG, PANTHER, and also HMMs.

That however, is not always the case for an RCA or even an IEA using blastp or blastn or a domain search. Fantom did not confirm the experimental justification for the transfer of annotation based on the blasts, etc (that is, did not check that the thing they were comparing to the mouse cDNAs had experimental literature behind it).

All of the types of evidence that you mention, clustering analysis of individual clones, BLASTP,BLASN analyses, domain mappings, andhydrophobicity study designed to predict membrane bound and secreted proteins, seem like sequence based evidence to me, so on the agreement at the recent GO meeting that ISS should be used when the evidence is purely sequence based, this looks like ISS to me.

Below is an excerpt of the current version of new draft of the ISS documentation.

  1. Sequence similarity, as determined by a pairwise or multiple alignment analysis, with experimentally characterized gene products (protein or RNA). The "with" field should be populated with the accession number of the matching gene product(s). To be listed in the with field, a gene product must be experimentally characterized, i.e. it should be possible to annotate that gene product using one of the GO experimental evidence codes: IDA, IMP, IGI, IPI, or IEP to a homologous sequence. The accession number may come from any publicly available database as long as the abbreviation is listed in the GO.xrf_abbs file. ADD HYPERLINK FOR THIS FILE
  2. RNA prediction methods (e.g., Rfam, tRNAscan, etc.). The "with" field should be populated with the appropriate accession number when available, however the "with" field will be blank for tools like tRNAscan or for methods for calling snoRNAS, etc., where there is no name or ID for the external entity. (NOTE: this will need to be revised with respect to the fact that the with field must always be filled, but that the name of the method is allowed.)
  3. Predicted protein features (e.g., transmembrane regions, signal sequence, etc.). If an accession number or name exists for the HMM or other item used in the comparison, it should be placed in the "with" field (e.g., CBS:TMHMM).
  4. Statistically significant matches to recognized functional domains or protein families, as determined by tools such as InterPro, Pfam, SMART, TIGRFAMs, etc. The with field should be filled with the accession number or name, when available, of the domain or HMM (the name of the method could alternatively be included if these values are not available, as the with column must not be null for ISS annotations). Sequence based objects which are acceptable in this category include Pfams, Prosite, TIGRFAMS, CBS, COG, PANTHER.

> On Tue, 20 Mar 2007, David Hill wrote

For the Fantom project, curators were presented with a graphical curation interface that consisted of clustering analysis of individual clones, BLASTP and BLASN analyses, domain mappings, and the results of a hydrophobicity study designed to predict membrane bound and secreted proteins. They were also given a set of suggested GO terms based on all of these analyses. Curators were asked to try to identify the gene that the clone was associated with and then they were asked to either accept or reject the GO terms that were associated with the clone. Curators were assigned clones based on the predicted GO terms and their areas of expertise. If a curator accepted a GO term, then the GO term that was assigned to the clone was imported into MGI. If that GO term assignment had originally been based on some type of sequence comparison or keyword mapping, then the target object was supplied. The curator made the judgment using all of the data, his or her own expertise, and the Fantom interface. If this isn't a reviewed computational analysis, then I'm not sure what is. David

>Karen Christie wrote We haven't put discussion of the definition of RCA back onto the table yet, but from what I understand of what you've said they've done, the Riken stuff is not what the RCA code was created in order to cover. RCA was initially intended to cover things which were a combinatorial analysis of multiple datatypes, e.g. two-hybrid protein-protein interaction datasets + synthetic genetic interactions, and things like that. We agreed at the St. Croix meeting that one of the datatypes for RCA could be sequence, if it was combined with other data types such as what I've listed above. For a purely sequence based analysis, RCA would not be appropriate.Karen

> On Mon, 19 Mar 2007, Michelle Gwinn Giglio wrote: Well the RCA vs IEA I guess is under debate now, and I don't have a strong feeling one way or the other. As for your IEA being computational enough - IEA's don't have to be computational in the sense of computing scores, alignments, whatever - it just needs to have been done without human review - so if you do that through mapping files or whatever it doesn't matter - just that it was done by some automatic method. ISS is computational because the way one asserts that annotations should be passed from one item to another is by assessing how similar the two items are based on some kind of computational measure of sequence similarity, be that pairwise alignment or match to HMM or whatever. It is not an experimental code since the experiment has not been done on the gene product you are annotating. The step that is the evidence for the annotation is computational. Michelle


Harold Drabkin wrote:
No, it fits RCA since whatever they did they had to look over to finally 
get something  to point to one of the translation tables. The original point 
I was trying to make was whether MGI's "IEA" is computational enough or no to 
be called computational,since all it does it look through a record for a seq_id. 
If the id is in MGI, then the record is attached to that object, and the records
keywords, EC and IP domains are used to map a GO term from the three translation 
tables. I also was a little iffy about ISS being a computational method rather
than an experimental method (or something in between), since we require an 
experiment on one end of the trio ( A ISS B )

> Michelle Gwinn Giglio wrote Hi Harold, OK - so MGI has a narrower definition of ISS than the wider GO - that is of course fine. And, the way you have described the RIken process sounds like an IEA to me since there appears to be no manual review of the results - therefore IEA. However, for the wider GO, it is ok to use, for example, a Pfam match in the "with" field for ISS, assuming that one has reviewed the match to the Pfam and found it sufficient evidence for an annotation.

> Harold Drabkin wrote No, because the way we use ISS, there MUST be an experiment; Riken used a variety of methods involving blasts, domain comparisons, Pfam, etc etc. to come up with either an InterPro domain OR a UniPRot ID that then then could use the translation tables on to get a GO term. We strictly use an ISS where the reference is to a paper with an experiment that does the sequence comparison, OR in the case of our "by orthology", the PMID for the experiment reference is in our notes field (which GOA mines: eg, a mouse ISS to human with a certain experiment reference is an IDA for human). At the time they were ISS, the IP or UniProt: was in the WITH field. Then we made them RCAs because the ISS was not how we usually use ISS; now that IEAs can in fact contain something in the WITH field, we will at some point migrate back the information. The way the Riken people did the assignment is NOT in line with MGI's policy for ISS to something that has a experiment that one could get an experiment evidence code from.

> Michelle Gwinn Giglio wrote: One last thing on the Riken stuff - I agree completely with Karen that the GOC decided that ISS could be used for things in "with" other than experimentally characterized proteins (like HMMs, PROSITE motifs, etc.), but I think that we also decided that if one uses a protein in the "with" field (and one is therefore making assertions based on a pairwise alignment,) that the protein in "with" should be experimentally characterized. So if the Riken people are using things like Pfam, Prosite, etc for the their domain hits those can be ISS. If they are using matches to proteins, one has to see if the match proteins are characterized or not before using ISS. Right?

>Karen Christie wrote: On Sun, 18 Mar 2007, Harold Drabkin wrote - In this case, there would be an actual figure with a comparison. It is literally a sequence comparison; most likely the figure was made by taking the output of a blast x2 and perhaps using "pretty" to make it presentable. But perhaps we should word the definition in a way that implies a more "hands on" intervention by attaching an experiment done on one of the two being compared. It is certainly different than a blast being done, then using translation tables, the annotations of one are transferred to the other. For a while, we were calling our Rikens ISS, but we then changed them to RCA, since the Riken group wasn't doing experiments, but merely transferring the annotations from another protein to the mouse protein in question based on whatever domain or protein had reasonable comparison.-

So, since the authors of the paper used BLAST, or some other method ti do the comparison, there is still a computational method involved in the ISS. We use experimental evidence codes to describe things that the author did, so I don't think it's a problem to say that the method is computational even if the computation was done by the authors of the paper. After all, the evidence code is only describing the method used. It's the reference that says who used the method. About the Riken annotations, please do remember that the Evidence Code Committee (ECC) agreed to recommend overturning the Annotation Camp decision to use RCA for sequence similarity comparisons where you could not put an experimentally characterized ortholog into the with column. The ECC recommended that the evidence codes should be purely a way to designate the basis of the annotation, and should not have any quality implications, including the specific recommendation that ALL methods that are based purely on sequence similarity comparisons should be in the ISS code. These recommendations were ratified at therecent GOC meeting in Cambridge.

The RCA code was sent back to committee to resolve exactly what it should be for, but any purely sequence similarity based method should be ISS. I will say that the original intent for RCA was for methods that did computational analyses of multiple types of data, but we'll have to discuss it. However, we agreed to work on updating the documentation for the stuff we agreed upon before revisiting the two codes sent back to committee. Karen

Original Message-----

> From: Karen Christie [1] Sent: Friday,

> On Thu, 15 Mar 2007, Harold Drabkin wrote
Two points
1. ISS isn't only just computationally used.If I have a paper that 
clones human and mouse gene, then does the experiment on the human 
version, I would use ISS, with the ref. and point to the human seqid
But I don't consider it computational; it's a manual assignment.

How did the mouse/human orthologs get called though? Was there a computational stage to determine best hits, synteny etc.? We've already said that ISS requires human judgement to use, but from what I'm aware of (vastly enhanced by the TIGR eukaryotic analysis course I just attended) it seems that the methods for determining orthologs generally involve computational steps to determine the best hits or orthologs.

2. I wouldn't necessarily call IEA a non-curator "approved" code; just
no intervention; For example in the case or our IEAs, we get the GO 
assignments from >mapping of UniProt Keywords, EC number, and domains
to the 3 translation tables. These UniProt records are only loaded
if they hae a nucleic acid seq_ID in them that matches something at MGI
(a literal string match; no sequence scanning). We don't look at it.
It just loads every night. 

However, I think the keywords in a UniProt record are assigned manually aren't they? Ditto or the EC #s. Aren't the translation tables themselves highly manually curated (thank you Val)?

I'm open to improvements in the phrasing of this category. Also, in saying that IEA means that a curator did not approve the assignment, I only meant that a curator did not approve the GO assignment, not that the underlying basis of the assignment is poor. I think that the IEA documentation should include some statements to the effect that IEA methods are often quite accurate, but sometimes only allow annotation