Annotation Quality Control Checks
From GO Public
Annotation quality control checks are used to ensure that the annotations submitted by Consortium members are of a good standard. The checks imposed on annotations include both format checks on the raw data and tests of the quality of the data within the file.
All checks that have been proposed or discussed can be found on the GO website as an web page or as an XML file; the schema is available in RELAX-NG format. This wiki page should only be used for ideas for proposed new checks.
If you have any concerns regarding the these Quality Control Checks, please email Rama Balakrishnan or Emily Dimmer.
Hard vs Soft QC
The QC checks are categorised as either "hard" or "soft", representing incorrect annotations and annotations to be checked respectively.
- Hard Quality Control Checks: these refer to checks on annotations that are incorrect/that should be removed. These checks will added to the current filtering script and will be run on the GAF files in the Submissions directory and any offending annotations will be filtered out before they are loaded into the database (and the filtered GAFs will be checked into the main gene_associations directory). Email will be sent to the submitting groups about annotations that were filtered.
- Soft QC: Annotations that need to reviewed for consistency. These annotations will not be filtered out of the MOD GAf file.
Annotations that fall under this check will be dumped out and annotating groups will be alerted to review those annotations.
The SQL included on these pages can be used to directly query the GO database. Copy and paste the SQL into the text box located on the AmiGO Goose page:
Proposed Hard QC checks
All IC annotations should include a GO id in column 8 (with)
This is inline with GO Consortium evidence code usage recommendations:
'Usage of the With/From Column for IC
Note that the with/from field must always be filled in with a GO ID when using this evidence code.'
From: the IC description on the GO guide to Evidence Codes
Existing Annotations:
10 EWIKI
990 Gramene
2 JCVI
1 MGI
2 RGD
9 TAIR
29 WB
Edimmer 12:12, 15 March 2011 (UTC)
All IPI annotations should include a nucleotide/protein/chemical identifier in column 8 (with)
This is inline with GO Consortium evidence code usage recommendations:
'We strongly recommend making an entry in the with/from column when using this evidence code to include an identifier for the other protein or other macromolecule or other chemical involved in the interaction. When multiple entries are placed in the with/from field, they are separated by pipes. Consider using IDA when no identifier can be entered in the with/from column. '
From: the IPI description on the GO guide to Evidence Codes
N.B. possible that Biological Process annotations from 'guilt by association' evidence might appropriately use IPI without having an appropriate id in the 'with'?
Annotations: 11 ASPGD
3 IPI C 8 IPI F
57 CGD
13 IPI C 9 IPI F 24 IPI P
19 DictyBase
6 IPI C 6 IPI F 7 IPI P
59 Ecocyc
8 IPI C 44 IPI F 7 IPI P
14 FlyBase (Fixed - will be visible in FB2011_05, May Release)
3 IPI C 7 IPI F 4 IPI P
1 Gramene
1 IPI P
54 MGI
4 IPI C 40 IPI F 10 IPI P
5 RGD
1 IPI C 2 IPI F 2 IPI P
SGD:
590 IPI C 392 IPI F 673 IPI P
TAIR:
96 IPI C 125 IPI F 46 IPI P
WB:
4 IPI C 2 IPI F 1 IPI P
ZFIN:
1 IPI C 4 IPI F 2 IPI P
Pascale and Edimmer 12:12, 15 March 2011 (UTC)
All IDA annotations should not include an identifier in column 8 (with)
This is inline with GO Consortium evidence code usage recommendations:
'Use IDA only when no identifier can be placed in the with/from column; when there is an appropriate ID for the with/from column, use IPI). '
Annotations:
Gramene:
1 IDA UniProtKB:P38992 F 2 IDA UniProtKB:P38992 P
All identifiers in the GAFs must use the correct DB abbreviation
All IDs in the GAFs must use the primary DB abbreviations found in the GO.xrf_abbs file, not the synonyms. For example, identifiers for UniProt entities must be in the form
UniProtKB:P12345
and not
UniProt:P12345
or
Uniprot:P12345
Proposed Soft QC checks
There are 2 classes of annotations that the MODs need to be reminded of.
- The first set includes those annotations already in the MOD GAFs but don't meet standards and needs evaluation (details below).
- The second set includes fresh/new annotations that needs to be absorbed (PAINT and MF-BP GAFs). We need a system to send an email or reminder to MODs about these files.
Annotations to high-level 'response to' terms
High level 'Response to' terms should not be used directly for annotation, unless further information has been added into column 16 to direct users to the specific type of abiotic/biotic stimulus. This includes the following terms:
* GO:0050896 : response to stimulus * GO:0051716 : cellular response to stimulus * GO:0009628 : response to abiotic stimulus * GO:0009607 : response to biotic stimulus * GO:0042221 : response to chemical stimulus * GO:0009719 : response to endogenous stimulus * GO:0009605 : response to external stimulus * GO:0006950 : response to stress * GO:0048585 : negative regulation of response to stimulus * GO:0048584 : positive regulation of response to stimulus * GO:0048583 : regulation of response to stimulus
All of these terms now have the following text in the comments stanza: 'Note that this term is in the subset of terms that should not be used for direct gene product annotation. Annotations to this term will be removed during annotation QC.'
Direct Annotations:
- GO:0050896 : response to stimulus
7,604 IEA annotations from GOA (via a SPKW2GO mapping using KW-0716)
- GO:0051716 : cellular response to stimulus
23 IEA ENSEMBL,
26 manual UniProtKB
- GO:0009628 : response to abiotic stimulus
1 IDA MGI
1 IEA ENSEMBL
1 IEP RGD
6 IEP TAIR
1 IEP UniProtKB
2 IMP AspGD
2 IMP FlyBase (now removed)
8 IMP TAIR
1 ISO RGD
1 TAS TIGR
- GO:0009607 : response to biotic stimulus
10 IEA Ensembl
1967 IEA InterPro
827 IEA UniProtKB (SPKW2GO)
2 IDA GR
3 IDA RGD
3 IEP TAIR
2 IMP GR
1 IMP TAIR
1 ISS GR
2 TAS TAIR
- GO:0042221 response to chemical stimulus
1 IDA BHF-UCL
10 (manual)UniProtKB
1 IDA EcoliWiki
14 IDA MGI
8 IDA RGD
68 IDA ZFIN
208 IEA ENSEMBL
1 IEP CGD
12 IEP RGD
2 IGI EcoliWiki
2 IGI TAIR
7 IGI ZFIN
23 IMP FlyBase (All now removed or replace with more specific terms)
3 IMP MGI
2 IMP SGD
4 IMP WB
6 IMP ZFIN
1 ISO MGI
19 ISO RGD
2 ISS UniProtKB
1 ISS WB
1 TAS RGD
- GO:0009719 : response to endogenous stimulus
no annotations
- GO:0009605 : response to external stimulus
1 ISS AgBase
539 IEA ENSEMBL
1 IEA InterPro
2 IEP TIGR
1 IEP WB
1 IMP FlyBase (removed)
3 IMP MGI
6 ISO RGD
19 (manual)UniProtKB
- GO:0006950 : response to stress
40412 IEA UniProtKB (SPKW2GO)
5 (ISS,TAS, IDA) BHF-UCL
1 ISS AgBase
5 IDA FlyBase (removed or replaced)
8 IDA MGI
6 IDA RGD
16 IDA SGD
14 IDA UniProtKB
1 IDA WB
13 IDA ZFIN
19 IEA AspGD
14 IEA CGD
14 IEA ENSEMBL
4 IEA FlyBase
11 IEA GOA
43 IEA GR
40 IEA InterPro
47 IEA TAIR
241 IEA UniProtKB
6 IEA WB
54 IEA ZFIN
1 IEP CGD
6 IEP EcoCyc
2 IEP EcoliWiki
1 IEP FlyBase (removed)
55 IEP RGD
11 IEP TAIR
1 IEP TIGR
3 IGI CGD
1 IGI EcoCyc
13 IGI SGD
2 IGI TAIR
1 IGI WB
9 IMP CGD
5 IMP EcoCyc
2 IMP EcoliWiki
7 IMP FlyBase (removed)
9 IMP MGI
9 IMP RGD
34 IMP SGD
26 IMP TAIR
4 IMP UniProtKB
15 IMP WB
2 IPI SGD
1 IPI TIGR
39 ISO RGD
2 ISS CGD
3 ISS dictyBase
7 ISS FlyBase (removed)
1 ISS GR
5 ISS JCVI
1 ISS MGI
1 ISS PAMGO_VMD
1 ISS SGD
43 ISS TIGR
18 ISS UniProtKB
4 ISS WB
3 NAS CGD
8 NAS FlyBase (removed)
1 NAS RGD
1 NAS UniProtKB
3 RCA GR
5 RCA PseudoCAP
1 TAS FlyBase
1 TAS GR
8 TAS MGI
3 TAS RGD
6 TAS SGD
1 TAS TAIR
10 TAS TIGR
1 TAS WB
- GO:0048585 : negative regulation of response to stimulus
1 IDA UniProtKB
- GO:0048584 : positive regulation of response to stimulus
no annotations.
- GO:0048583 : regulation of response to stimulus
5 RCA bioPIXIE_MEFIT (from SGD)
Suggested by the 'Response to' working group, agreed at the Geneva 2010 GO annotation Camp.
Annotations to high level transcription factor activity terms
Annotations to the following two high level MF terms should be avoided.
- GO:0001071 - nucleic acid binding transcription factor activity
- GO:0000988 - protein binding transcription factor activity
All gene/protein/chemical identifiers used in GO annotations should conform to RegExps supplied in the GO.xref.abbs file
e.g. exerpt from GO.xref.abbs file:
abbreviation: GO
database: Gene Ontology Database
object: Identifier
example_id: GO:0004352
local_id_syntax: ^\d{7}$
-- the RegExp present in some entries in the GO.xref.abbs file could be used to help ensure that identifiers included in annotation files conform to the expected format.
-- UniProtKB-GOA is currently contacting groups whose 'with' field ids do not conform to these RegExps to check if they're happy with the proposed format, and we are adding/modifying the below RegExp suggestions in response to feedback we're receiving.
-- Groups should be encouraged to use 'UniProtKB' instead of 'UniProt'
-- Identifiers with the following prefixes are obsolete in the go.xrf_abbs file and should be replaced with 'RefSeq';
- NCBI_NM
- NCBI_NP
- RefSeq_NA
- RefSeq_Prot
Identifier RegExps to add to GO.xef_abbs:
(Please note this list is currently incomplete)
(UniProt(?:KB)?):([A-Z][0-9][A-Z0-9]{3}[0-9]((-([0-9]+)|:PRO_[0-9]{10}))?)
(CGD):((CAL|CAF)[0-9]{7})
(SGD):(S[0-9]{9})
(dictyBase):(DDB_G[0-9]{7})
(FB):(FBgn[0-9]{7})
(GeneDB_Spombe):(SP[A-Z0-9]+\.[A-Z0-9]+)
(GeneDB_Pfalciparum):(SP[A-Z0-9]+\.[A-Z0-9]+)
(AGI_LocusCode):(AT[0-9]G[0-9]{5}(\.[0-9]{1})?)
(TAIR):(gene:[0-9]{7,})
(TAIR):(locus:AT[0-9]G[0-9]{5})
(WB):(WBGene[0-9]{8})
(WB):(WBVar[0-9]{8})
(WB):(WP:CE[0-9]{5})
(ZFIN):(ZDB-GENE-[0-9]{6}-[0-9]+)
(ZFIN):(ZDB-GENO-[0-9]{6}-[0-9]+)
(ZFIN):(ZDB-MRPHLNO-[0-9]{6}-[0-9]+)
(CHEBI):([0-9]{5})
(JCVI_GenProp):(GenProp[0-9]{4})
(RGD):([0-9]{4,7})
(PubChem_Compound):([0-9]+)
(MGI):(MGI:[0-9]{5,})
(protein_id):([A-Z]{3}[0-9]{5})
(RefSeq):([A-Z]{2}_[0-9]{4,10}(\.[0-9]+)?)
(NCBI_gi):([0-9]{6,})
(PDB):([A-Za-z0-9]{4})
(ENSEMBL):(ENS[A-Z0-9]{10,17})
(GR):([A-Z][0-9][A-Z0-9]{3}[0-9])
(GR_PROTEIN):([A-Z][0-9][A-Z0-9]{3}[0-9])
(EcoliWiki):([A-Za-z]{3,4})
(ECK):(ECK[0-9]{4})
(EcoCyc):(EG[0-9]{5})
(ECOGENE):(EG[0-9]{5})
(EchoBASE):(EB[0-9]{4})
(JCVI_GenProp):(GenProp[0-9]{4})
(PubChem_Substance):([0-9]{4,})
(PIR):([A-Z]{1}[0-9]{5})
(KEGG_LIGAND):([A-Z]{1}[0-9]{3,})
(EMBL):([A-Z]{1}[0-9]{5})
(EMBL):([A-Z]{2}[0-9]{6})
(EMBL):([A-Z]{4}[0-9]{8,9})
(MaizeGDB_Locus):([A-Za-z][A-Za-z0-9]*)
(NCBI_GP):([A-Z]{3}[0-9]{5}(\.[0-9]+)?)
(GenBank|GB):(([A-Z]{1}[0-9]{5})|([A-Z]{2}[0-9]{6})|([A-Z]{4}[0-9]{8,9})
Edimmer 13:28, 15 March 2011 (UTC)
Annotations to protein oligomerization (GO:51259)
Discussed on SF
Review of NOT-qualified annotations, when positive annotations exist to child terms
- NOT-qualified annotations also imply NOT for all children terms, therefore NOT-qualified annotations should provide as much specificity as possible.
- when an annotation made to a granular GO term is contradicted by a NOT-qualified annotation to a parent term, curators need to be alerted and review of the annotation set should be carried out.
Example: POLA1 was annotated to "double-strand break repair via nonhomologous end joining", but also had a NOT-qualified annotation to the parent term "DNA repair".
The NOT annotation is based on UV/X-ray damage, which causes lesions that are repaired by nucleotide excision repair. See PMID:3335506
Outcome: The anotation to NOT 'DNA repair' should have been made to a more granular term, describing nucleotide-excision repair.
(Paul Thomas)
Taxon triggers
Annotations that need to be absorbed by MODs
- MF-BP inter-ontology inferences
- PAINT