Annotation Conf. Call, February 14, 2012: Difference between revisions
No edit summary |
|||
(9 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Annotation Working Group]] | |||
==Agenda for Annotation Call== | ==Agenda for Annotation Call== | ||
Line 240: | Line 241: | ||
<br | <br/> | ||
==Minutes== | <br/> | ||
===Minutes=== | |||
Present: | Present: | ||
Line 249: | Line 250: | ||
WormBase: Kimberly<br> | WormBase: Kimberly<br> | ||
TAIR: Donghui<br> | TAIR: Donghui<br> | ||
MGI: Li<br> | MGI: Li, Mary<br> | ||
Pombase: Midori, Val<br> | Pombase: Midori, Val<br> | ||
EBI: Jane, Tony(GOA)<br> | EBI: Jane, Tony(GOA)<br> | ||
IGS: Marcus<br> | IGS: Marcus<br> | ||
NextProt: Pascale<br> | NextProt: Pascale<br> | ||
AgBase: Lakshmi<br> | |||
===New Evidence Code for Inferences=== | ====New Evidence Code for Inferences==== | ||
* Everybody agreed that inferring annotation in CC based on the BP-IMP annotations can be a problem. IMP can indicate a downstream effect too. BP annotation with IDA may not be good for propagation to CC either. | * Everybody agreed that inferring annotation in CC based on the BP-IMP annotations can be a problem. IMP can indicate a downstream effect too. BP annotation with IDA may not be good for propagation to CC either. | ||
* Perhaps we should draw rules on when these annotations should be inferred. For example have a rule that process annotations with IMP should not be used for inferring CC annotations. | * Perhaps we should draw rules on when these annotations should be inferred. For example have a rule that process annotations with IMP should not be used for inferring CC annotations. | ||
* Marcus brought up a good point that there are two parts to this inference: one is the primary evidence and the second part is how the annotation was asserted. Evidence code is not the place to indicate the Process of how the annotation was made and here we are trying to come up with an ev. code to represent this | * Marcus brought up a good point that there are two parts to this inference: one is the primary evidence and the second part is how the annotation was asserted. Evidence code is not the place to indicate the Process of how the annotation was made and here we are trying to come up with an ev. code to represent this | ||
* We will survey the inferences to see how many of those make sense, what evidence codes makes sense to propagate. If none of those work, then we will seriously go the new ev.code or assertion method route. | * We will survey the inferences to see how many of those make sense, what evidence codes makes sense to propagate. If none of those work, then we will seriously go the new ev.code or assertion method route. | ||
*<font color = "red">Action item</font>: Request Chris to put these inferences in GOCVS so everybody can see them | |||
===Protein binding and Taxon constraints (Jane)=== | ====Protein binding and Taxon constraints (Jane)==== | ||
* Jane has listed the protein binding terms that are going to be obsoleted (and the affected annotations). Please review them and holler (on the go-discuss list) if you have any questions. | * Jane has listed the protein binding terms that are going to be obsoleted (and the affected annotations). Please review them and holler (on the go-discuss list) if you have any questions. | ||
* To communicate the addition of taxon constraints to the ontologies, Moving forward, when new ones are added, Ontology developers will send an email out similar to the Obsoletion emails. | * To communicate the addition of taxon constraints to the ontologies, Moving forward, when new ones are added, Ontology developers will send an email out similar to the Obsoletion emails. | ||
===Feedback on PAINT (Kimberly)=== | ====Feedback on PAINT (Kimberly)==== | ||
Pascale suggested that we write to the ref.genome mailing list with any feedback. Hopefully this pipeline will get sorted out during the upcoming GOC meeting. | Pascale suggested that we write to the ref.genome mailing list with any feedback. Hopefully this pipeline will get sorted out during the upcoming GOC meeting. | ||
===QC checks (Amelia)=== | ====QC checks (Amelia)==== | ||
* Checks for column 11: There is a reason for redundant information in col11. SGD does this for a reason. SGDs GAF reports a standard gene name like PHO5 in col2 and the corresponding ORF name/SGDID in col-11 and when there is no standard name for a gene (there are lot of uncharacterized genes) then the systematic name is mentioned in col-2 and in col-3. This way one can retrieve all the systematic names from col-11 if one wishes to. This will be discussed further. | * Checks for column 11: There is a reason for redundant information in col11. SGD does this for a reason. SGDs GAF reports a standard gene name like PHO5 in col2 and the corresponding ORF name/SGDID in col-11 and when there is no standard name for a gene (there are lot of uncharacterized genes) then the systematic name is mentioned in col-2 and in col-3. This way one can retrieve all the systematic names from col-11 if one wishes to. This will be discussed further. | ||
* Checks for Col 17: Kimberly mentioned that not always col-17 ID have to have the same parent GP ID in col-2. | * Checks for Col 17: Kimberly mentioned that not always col-17 ID have to have the same parent GP ID in col-2. | ||
** Possible exception: Some MODs assign protein identifiers based upon amino acid sequence. Thus, it's possible that, if column 2 contains a gene ID, then a given protein isoform ID in Column 17 could be associated with more than one gene ID. Groups could check their current annotations to see if this was the case for any existing annotations or if there are actually errors in their entries for either column 2 or column 17. If gene and protein identifiers from a database can easily be distinguished, then that might help determine when the QC check should be applied. --kimberly | |||
* These checks are still in the proposal stage and will be discussed again at a later stage. | * These checks are still in the proposal stage and will be discussed again at a later stage. |
Latest revision as of 16:30, 9 April 2014
Agenda for Annotation Call
- More evidence codes - new Evidence code for Inferences based on Ontology links (http://gocwiki.geneontology.org/index.php/Evidence_for_Inferences_based_on_Ontology_links) (Rama)
- Update on protein binding obsoletions(Jane)
- Update on communication mechanisms for changes to the GO taxon file. (Jane)
- can we have a quick review of what is the preferred mechanism right now for feedback on PAINT annotations? (Kimberly)
- new QC checks (Amelia) - see below
- col 17 entry ID hierarchy - see below
Suggested QC Checks
Remove redundant GP info
The GP synonyms column must not contain information from other columns (GP symbol, GP name, DB object ID) as this info is redundant
e.g. incorrect:
1 DB |
2 DB object ID |
3 DB object symbol |
... | 10 DB object name |
11 DB object synonym |
12 DB object type |
---|---|---|---|---|---|---|
PomBase | SPCC1884.02 | nic1 | ... | NiCoT heavy metal ion transporter Nic1 | SPCC1884.02 | nic1 | SPCC757.01 | gene |
correct:
1 DB |
2 DB object ID |
3 DB object symbol |
... | 10 DB object name |
11 DB object synonym |
12 DB object type |
---|---|---|---|---|---|---|
PomBase | SPCC1884.02 | nic1 | ... | NiCoT heavy metal ion transporter Nic1 | SPCC757.01 | gene |
Col 17 ID format
Only one ID is allowed in col 17, and that ID should be formatted correctly and be from a database listed in GO.xrf_abbs.
See the [docs on col 17 http://www.geneontology.org/GO.format.gaf-2_0.shtml#gene_product_form_id] for a refresher on col 17 contents
Where spliceforms exist, they must always have the same parent GP ID - unless you can think of any case in which this would not happen?
e.g. incorrect
1 DB |
2 DB object ID |
... | 17 gene product form ID |
---|---|---|---|
MGI | MGI:123456 | ... | UniProt:P0217K-3 |
MGI | MGI:654321 | ... | UniProt:P0217K-3 |
Correct:
1 DB |
2 DB object ID |
... | 17 gene product form ID |
---|---|---|---|
MGI | MGI:123456 | ... | UniProt:P0217K-3 |
MGI | MGI:123456 | ... | UniProt:P0217K-3 |
Possible exception:Some MODs assign protein identifiers based upon amino acid sequence. Thus, it's possible that, if column 2 contains a gene ID, then a given protein isoform ID in Column 17 could be associated with more than one gene ID. Groups could check their current annotations to see if this was the case for any existing annotations or if there are actually errors in their entries for either column 2 or column 17. If gene and protein identifiers from a database can easily be distinguished, then that might help determine when the QC check should be applied. --kimberly
Col 17 ID Hierarchy
Identifiers in column 17 come from a range of databases; propose creating a list of preferred databases from which the IDs are taken.
e.g. if the hierarchy were UniProtKB > VEGA > ENSEMBL
If UniProtKB ID exists, use that else if VEGA ID exists, use that else if ENSEMBL ID exists, use that else PANIC!
Different object types (protein, mRNA, etc.) may need to have different hierarchies.
DBs used so far:
Database | GP form types | # distinct IDs | Assigned by |
---|---|---|---|
ENSEMBL | protein | 2464 |
BHF-UCL DFLAT GOC HGNC IntAct MGI RGD RefGenome UniProtKB |
PR | protein | 3 | MGI |
protein_id | protein | 31 | MGI |
Protein_id [capitalization error] | protein | 1 | MGI |
RefSeq | gene, protein | 3215 |
BHF-UCL GOC IntAct MGI RGD RefGenome UniProtKB |
TAIR | RNA, gene_product, miRNA, protein, rRNA, snRNA, snoRNA, tRNA | 45992 |
GOC IntAct RefGenome TAIR TIGR UniProtKB |
UniProtKB | protein | 4601 |
BHF-UCL DFLAT GOC HGNC IntAct MGI PINC RGD RefGenome Roslin_Institute UniProtKB |
UniPRotKB [capitalization error] | protein | 1 | MGI |
uniProtKB [capitalization error] | protein | 2 | MGI |
VEGA | protein | 13706 |
BHF-UCL DFLAT GOC HGNC IntAct MGI PINC RGD RefGenome Roslin_Institute UniProtKB |
WB | gene | 4 | WB |
WP | gene | 6 | WB |
Minutes
Present:
SGD: Rama, Karen, Julie, Cindy, Jodi
WormBase: Kimberly
TAIR: Donghui
MGI: Li, Mary
Pombase: Midori, Val
EBI: Jane, Tony(GOA)
IGS: Marcus
NextProt: Pascale
AgBase: Lakshmi
New Evidence Code for Inferences
- Everybody agreed that inferring annotation in CC based on the BP-IMP annotations can be a problem. IMP can indicate a downstream effect too. BP annotation with IDA may not be good for propagation to CC either.
- Perhaps we should draw rules on when these annotations should be inferred. For example have a rule that process annotations with IMP should not be used for inferring CC annotations.
- Marcus brought up a good point that there are two parts to this inference: one is the primary evidence and the second part is how the annotation was asserted. Evidence code is not the place to indicate the Process of how the annotation was made and here we are trying to come up with an ev. code to represent this
- We will survey the inferences to see how many of those make sense, what evidence codes makes sense to propagate. If none of those work, then we will seriously go the new ev.code or assertion method route.
- Action item: Request Chris to put these inferences in GOCVS so everybody can see them
Protein binding and Taxon constraints (Jane)
- Jane has listed the protein binding terms that are going to be obsoleted (and the affected annotations). Please review them and holler (on the go-discuss list) if you have any questions.
- To communicate the addition of taxon constraints to the ontologies, Moving forward, when new ones are added, Ontology developers will send an email out similar to the Obsoletion emails.
Feedback on PAINT (Kimberly)
Pascale suggested that we write to the ref.genome mailing list with any feedback. Hopefully this pipeline will get sorted out during the upcoming GOC meeting.
QC checks (Amelia)
- Checks for column 11: There is a reason for redundant information in col11. SGD does this for a reason. SGDs GAF reports a standard gene name like PHO5 in col2 and the corresponding ORF name/SGDID in col-11 and when there is no standard name for a gene (there are lot of uncharacterized genes) then the systematic name is mentioned in col-2 and in col-3. This way one can retrieve all the systematic names from col-11 if one wishes to. This will be discussed further.
- Checks for Col 17: Kimberly mentioned that not always col-17 ID have to have the same parent GP ID in col-2.
- Possible exception: Some MODs assign protein identifiers based upon amino acid sequence. Thus, it's possible that, if column 2 contains a gene ID, then a given protein isoform ID in Column 17 could be associated with more than one gene ID. Groups could check their current annotations to see if this was the case for any existing annotations or if there are actually errors in their entries for either column 2 or column 17. If gene and protein identifiers from a database can easily be distinguished, then that might help determine when the QC check should be applied. --kimberly
- These checks are still in the proposal stage and will be discussed again at a later stage.
NO CALL ON FEB 28th
We won't have a call on Feb 28th since most of us will be flying back from the GOC meeting. The next call will be on March 13th.