Annotation Conf. Call, February 14, 2012

Agenda for Annotation Call

More evidence codes - new Evidence code for Inferences based on Ontology links (http://gocwiki.geneontology.org/index.php/Evidence_for_Inferences_based_on_Ontology_links) (Rama)

Update on protein binding obsoletions(Jane)

Update on communication mechanisms for changes to the GO taxon file. (Jane)

can we have a quick review of what is the preferred mechanism right now for feedback on PAINT annotations? (Kimberly)

new QC checks (Amelia) - see below

col 17 entry ID hierarchy - see below

Suggested QC Checks

Remove redundant GP info

The GP synonyms column must not contain information from other columns (GP symbol, GP name, DB object ID) as this info is redundant

e.g. incorrect:

1 DB	2 DB object ID	3 DB object symbol	...	10 DB object name	11 DB object synonym	12 DB object type
PomBase	SPCC1884.02	nic1	...	NiCoT heavy metal ion transporter Nic1	SPCC1884.02 \| nic1 \| SPCC757.01	gene

correct:

1 DB	2 DB object ID	3 DB object symbol	...	10 DB object name	11 DB object synonym	12 DB object type
PomBase	SPCC1884.02	nic1	...	NiCoT heavy metal ion transporter Nic1	SPCC757.01	gene

Col 17 ID format

Only one ID is allowed in col 17, and that ID should be formatted correctly and be from a database listed in GO.xrf_abbs.

Col 17 entities should always be related to the same col 2 entry

See the [docs on col 17 http://www.geneontology.org/GO.format.gaf-2_0.shtml#gene_product_form_id] for a refresher on col 17 contents

Where spliceforms exist, they must always have the same parent GP ID - unless you can think of any case in which this would not happen?

e.g. incorrect

1 DB	2 DB object ID	...	17 gene product form ID
MGI	MGI:123456	...	UniProt:P0217K-3
MGI	MGI:654321	...	UniProt:P0217K-3

Correct:

1 DB	2 DB object ID	...	17 gene product form ID
MGI	MGI:123456	...	UniProt:P0217K-3
MGI	MGI:123456	...	UniProt:P0217K-3

Possible exception:Some MODs assign protein identifiers based upon amino acid sequence. Thus, it's possible that, if column 2 contains a gene ID, then a given protein isoform ID in Column 17 could be associated with more than one gene ID. Groups could check their current annotations to see if this was the case for any existing annotations or if there are actually errors in their entries for either column 2 or column 17. If gene and protein identifiers from a database can easily be distinguished, then that might help determine when the QC check should be applied. --kimberly

Col 17 ID Hierarchy

Identifiers in column 17 come from a range of databases; propose creating a list of preferred databases from which the IDs are taken.

e.g. if the hierarchy were UniProtKB > VEGA > ENSEMBL

If UniProtKB ID exists, use that
else if VEGA ID exists, use that
else if ENSEMBL ID exists, use that
else PANIC!

Different object types (protein, mRNA, etc.) may need to have different hierarchies.

DBs used so far:

Database	GP form types	# distinct IDs	Assigned by
ENSEMBL	protein	2464	BHF-UCL DFLAT GOC HGNC IntAct MGI RGD RefGenome UniProtKB
PR	protein	3	MGI
protein_id	protein	31	MGI
Protein_id [capitalization error]	protein	1	MGI
RefSeq	gene, protein	3215	BHF-UCL GOC IntAct MGI RGD RefGenome UniProtKB
TAIR	RNA, gene_product, miRNA, protein, rRNA, snRNA, snoRNA, tRNA	45992	GOC IntAct RefGenome TAIR TIGR UniProtKB
UniProtKB	protein	4601	BHF-UCL DFLAT GOC HGNC IntAct MGI PINC RGD RefGenome Roslin_Institute UniProtKB
UniPRotKB [capitalization error]	protein	1	MGI
uniProtKB [capitalization error]	protein	2	MGI
VEGA	protein	13706	BHF-UCL DFLAT GOC HGNC IntAct MGI PINC RGD RefGenome Roslin_Institute UniProtKB
WB	gene	4	WB
WP	gene	6	WB