Annotation Conf. Call, March 27, 2012
- Developing new documentation for IKR, continued from the 24th Jan GOC annotation call
Changes made since the last discussion:
1. Requirement for phylogenetic analysis removed.
- Removed from the new IKR draft documenation, along with a definition change to refer to the "lack of" key residues rather than "loss of"key residues.
- SourceForge ECO request, to ask for mapping to an alternative an ECO id which did not require phylogenetic evidence as the basis for determination of loss of key residues (c.f. ECO:0000320 phylogenetic determination of loss of key residues used in manual assertion).
2. Requirement for a 'with/from' identifier when using the GO_REF added to the draft documenation:
'Where an IKR annotation statement is made using the GO_REF, inclusion of an identifier in the 'with/from' column of the annotation format, that highlights to the user the lacking residues(e.g. an alignment or rule identifier) is strongly recommended'
Continue Discussion on Redundant set of annotations
Recap on previous discussion
1. The discussion was focused around the ideal contents of the annotation file (GAF) that represents all GO Consortium annotations for a given taxon (species-owner groups). There was little discussion regarding annotation web display.
2. unique 'GOID + genePID + evidence + with/from + reference' constitutes non-redundant annotation.
3. Two annotations that are same in the above fields but differs in the Assigned_by column are considered redundant.
1. How do col-16 and col-17 fit in the definition of redundant set?
2. how much annotation filtering should be carried out to IEA predictions to parent/child terms?
InterPro perspective (Alex Mitchell):
1. There may be good case for keeping annotation predictions to both parent and child terms in the GAFs, since returning multiple GO terms through matching several InterPro entries in a hierarchy is extra evidence for confidence in a match (ie, if you're hitting a parent + child + grandchild signature and getting increasingly specific GO terms as a result, that's stronger evidence that the most specific GO term is correctly assigned than if you get a hit to a grandchild signature and a specific GO term alone)
2. My other concern regards protein families, and stems from the fact our GO mappings aren't exhaustive. Take as an example, InterPro entry for the riboflavin ECF transporter S component RibU (typical family member:E5QVT2). The proteins mediate riboflavin uptake, so the InterPro domain is mapped to GO term GO:0032217 (riboflavin transporter activity). There is also a strong suggestion that they may also transport FMN and roseoflavin too - but probably not enough evidence that an InterPro curator would give the entry GO mappings relating to those functions. At the same time there is a more general family entry in InterPro that picks up ECF transporter S components as a whole (ie, it doesn't discriminate between those that bind different substrates). This has the GO term GO:0005215 transporter activity mapped to it.
What this means is, if I put E5QVT2 through InterProScan, I'd get the GO terms GO:0005215 (transporter activity) + GO:0032217 (riboflavin transporter activity). If we remove parent terms because we consider them redundant, I'd just getO:0032217 (riboflavin transporter activity).
The first result looks to me more in line with what the protein does (it has transporter activity, including riboflavin transporter activity) whereas thesecond result looks to me like this is a riboflavin transporter, full stop.
Present: Rama, Karen and Jodi (MGI), Stan (RGD), Li, Mary and Judy (MGI), Kimberly (WormBase), Midori (PomBase), Prudence, Yasmin, Emily (UniProt-GOA), Jane and Becky (GOed), Paul T (USC), Lakshmi (AgBase)
GOC-agreed QC checks implemented
Rama: A number of GOC-agreed QC checks have been added to Mike's checking script, which will go into production this weekend. Many annotations to 'binding' and where incorrect references are applied are being filtered. Curators are encouraged to review results of this check
1. recommended that the documentation states that a value in the 'with/from' field should be mandatory where a GO_REF is applied for this annotation
2. QC checks should be applied to annotations using this evidence code to ensure the 'NOT' qualifier is always present, and a 'with/from' value always included when a GO_REF applied as reference.
3. UniProt will be able to supply IKR-evidenced NOT annotations using a GO_REF and a UniProt UniRule identifier in the 'with' field. Emily to supply a count of annotations that will be generated from UniPRot UniRule data, and supply an annotation example.
4. Other changes made to this code were agreed upon. Emily to circulate the IKR draft documentation to the GO list and implement change on the GO website after 2 weeks unless any objections.
1. the column 17; gene_product_form_id field in GAFs supplies information on the precise gene product known to carry out a function/location. Therefore it should be considered a unique and important part of an annotation. Where two annotations exist which differ solely in that one contains a value in column 17 and one does not, groups should discuss which provides a more accurate representation of the data provided in the reference. It is expected that the annotation with column 17 filled will provide a better representation of data.
2. Column 16, the annotation_extension field also contains valuable data. The GO Consortium is encouraging greater expressivity in its annotation, therefore column 16 data should be included in the authoriative annotation set for any species. If a gene product localizes to the nucleus in both endothelial and epithelial cells, then col-16 has 2 cell types (endothelial and epithelial cell) and they are piped in col-16 all in one row of annotation. There can be lots of combinations of relationship-value pairs in col-16. Further discussions need to be had regarding the state of the format of column 16, as it is a released set of annotations. Improved documentation needs to be added to the GOC website
Emily to send list of relationship-value pairs used by GO annotation groups and draft col.16 documentation. The next GO annotation call on the 10th of April may well focus on the column 16 dataset.
Paul T: we need a working group to discuss how GO annotation expressivity can be improved, moving on from column 16.
3. Annotation IEA Filtering based on the provision of associations to both parent and child terms
Alex Mitchell, InterPro discussed benefits of not filtering annotations provided by InterPro2GO to parent and child terms (see above). Different InterPro IDs map to terms at different levels and different InterPro IDs can map to the same term (as shown for moeA5-http://www.ebi.ac.uk/QuickGO/GProtein?ac=A0A000). In this case, moeA5 is annotated to GO:30170 using different InterProIDs. Alex mentioned that 1 protein can belong to 2 rules/2 models. They are independent pieces of evidence. There are relationships between the InterPro Models.
Paul T.: however we need to evaluate what our users want to see. What do our users expect from our resource? Could the family/subfamily relationships provided by some InterPro member databases, and created by InterPro be used to filter out some of the redundant GO annotations? Action: Alex to investigate.
Kimberly: Should we move away from displaying a historial view of how annotations change over time (i.e. move away from every paper being annotated?)
Emily: as we're discussing the GO annotation that goes into the GOC database. Therefore we may want to be far more conservative in the filtering applied to the GO annotation set than when deciding upon the optimal web display of annotation data.