Annotation Quality Control Checks

From GO Wiki
Jump to: navigation, search

Annotation quality control checks are used to ensure that the annotations submitted by Consortium members are of a good standard. The checks imposed on annotations include both format checks on the raw data and tests of the quality of the data within the file.


All checks that have been proposed or discussed can be found on the GO website as an web page or as an XML file; the schema is available in RELAX-NG format. This wiki page should only be used for ideas for proposed new checks.


If you have any concerns regarding the these Quality Control Checks, please email Rama Balakrishnan or Emily Dimmer.


Hard vs Soft QC

The QC checks are categorised as either "hard" or "soft", representing incorrect annotations and annotations to be checked respectively.

  • Hard Quality Control Checks: these refer to checks on annotations that are incorrect/that should be removed. These checks will added to the current filtering script and will be run on the GAF files in the Submissions directory and any offending annotations will be filtered out before they are loaded into the database (and the filtered GAFs will be checked into the main gene_associations directory). Email will be sent to the submitting groups about annotations that were filtered.
  • Soft QC: Annotations that need to reviewed for consistency. These annotations will not be filtered out of the MOD GAf file.

Annotations that fall under this check will be dumped out and annotating groups will be alerted to review those annotations.

The SQL included on these pages can be used to directly query the GO database. Copy and paste the SQL into the text box located on the AmiGO Goose page:

http://berkeleybop.org/goose

Proposed Hard QC checks

All IC annotations should include a GO id in column 8 (with)

This is inline with GO Consortium evidence code usage recommendations:

'Usage of the With/From Column for IC

Note that the with/from field must always be filled in with a GO ID when using this evidence code.'

From: the IC description on the GO guide to Evidence Codes

Existing Annotations:

10 EWIKI

990 Gramene

2 JCVI

1 MGI

2 RGD

9 TAIR

29 WB



Edimmer 12:12, 15 March 2011 (UTC)

All IPI annotations should include a nucleotide/protein/chemical identifier in column 8 (with)

This is inline with GO Consortium evidence code usage recommendations:

'We strongly recommend making an entry in the with/from column when using this evidence code to include an identifier for the other protein or other macromolecule or other chemical involved in the interaction. When multiple entries are placed in the with/from field, they are separated by pipes. Consider using IDA when no identifier can be entered in the with/from column. '

From: the IPI description on the GO guide to Evidence Codes


N.B. possible that Biological Process annotations from 'guilt by association' evidence might appropriately use IPI without having an appropriate id in the 'with'?

Annotations: 11 ASPGD

3 IPI     C
8 IPI     F

57 CGD

13 IPI     C
9 IPI     F
24 IPI     P

19 DictyBase

6 IPI     C
6 IPI     F
7 IPI     P

59 Ecocyc

8 IPI     C
44 IPI     F
7 IPI     P

14 FlyBase (Fixed - will be visible in FB2011_05, May Release)

3 IPI     C
7 IPI     F
4 IPI     P

1 Gramene

1 IPI     P

54 MGI

4 IPI     C
40 IPI     F
10 IPI     P

5 RGD

1 IPI     C
2 IPI     F
2 IPI     P

SGD:

590 IPI     C
392 IPI     F
673 IPI     P

TAIR:

96 IPI     C
125 IPI     F
46 IPI     P

WB:

4 IPI     C
2 IPI     F
1 IPI     P

ZFIN:

1 IPI    C
4 IPI     F
2 IPI     P





Pascale and Edimmer 12:12, 15 March 2011 (UTC)

All IDA annotations should not include an identifier in column 8 (with)

This is inline with GO Consortium evidence code usage recommendations:

'Use IDA only when no identifier can be placed in the with/from column; when there is an appropriate ID for the with/from column, use IPI). '


Annotations:

Gramene:

1 IDA   UniProtKB:P38992    F
2 IDA   UniProtKB:P38992    P


All identifiers in the GAFs must use the correct DB abbreviation

All IDs in the GAFs must use the primary DB abbreviations found in the GO.xrf_abbs file, not the synonyms. For example, identifiers for UniProt entities must be in the form

 UniProtKB:P12345

and not

 UniProt:P12345

or

 Uniprot:P12345


Proposed Soft QC checks

There are 2 classes of annotations that the MODs need to be reminded of.

  • The first set includes those annotations already in the MOD GAFs but don't meet standards and needs evaluation (details below).
  • The second set includes fresh/new annotations that needs to be absorbed (PAINT and MF-BP GAFs). We need a system to send an email or reminder to MODs about these files.

Annotations to high-level 'response to' terms

High level 'Response to' terms should not be used directly for annotation, unless further information has been added into column 16 to direct users to the specific type of abiotic/biotic stimulus. This includes the following terms:

   * GO:0050896 : response to stimulus
   * GO:0051716 : cellular response to stimulus
   * GO:0009628 : response to abiotic stimulus
   * GO:0009607 : response to biotic stimulus
   * GO:0042221 : response to chemical stimulus
   * GO:0009719 : response to endogenous stimulus
   * GO:0009605 : response to external stimulus
   * GO:0006950 : response to stress
   * GO:0048585 : negative regulation of response to stimulus
   * GO:0048584 : positive regulation of response to stimulus
   * GO:0048583 : regulation of response to stimulus 

All of these terms now have the following text in the comments stanza: 'Note that this term is in the subset of terms that should not be used for direct gene product annotation. Annotations to this term will be removed during annotation QC.'

Direct Annotations:

  • GO:0050896 : response to stimulus

7,604 IEA annotations from GOA (via a SPKW2GO mapping using KW-0716)

  • GO:0051716 : cellular response to stimulus
     23 IEA     ENSEMBL, 
     26 manual  UniProtKB


  • GO:0009628 : response to abiotic stimulus
     1 IDA     MGI
     1 IEA     ENSEMBL
     1 IEP     RGD
     6 IEP     TAIR
     1 IEP     UniProtKB
     2 IMP     AspGD
     2 IMP     FlyBase (now removed)
     8 IMP     TAIR
     1 ISO     RGD
     1 TAS     TIGR
  • GO:0009607 : response to biotic stimulus
    10 IEA Ensembl
  1967 IEA InterPro
   827 IEA UniProtKB (SPKW2GO)
     2 IDA     GR
     3 IDA     RGD
     3 IEP     TAIR
     2 IMP     GR
     1 IMP     TAIR
     1 ISS     GR
     2 TAS     TAIR
  • GO:0042221 response to chemical stimulus
     1 IDA     BHF-UCL
     10 (manual)UniProtKB
     1 IDA     EcoliWiki
    14 IDA     MGI
     8 IDA     RGD
    68 IDA     ZFIN
   208 IEA     ENSEMBL
     1 IEP     CGD
    12 IEP     RGD
     2 IGI     EcoliWiki
     2 IGI     TAIR
     7 IGI     ZFIN
    23 IMP     FlyBase (All now removed or replace with more specific terms)
     3 IMP     MGI
     2 IMP     SGD
     4 IMP     WB
     6 IMP     ZFIN
     1 ISO     MGI
    19 ISO     RGD
     2 ISS     UniProtKB
     1 ISS     WB
     1 TAS     RGD


  • GO:0009719 : response to endogenous stimulus

no annotations

  • GO:0009605 : response to external stimulus
      1 ISS    AgBase
     539 IEA   ENSEMBL
     1 IEA     InterPro
     2 IEP     TIGR
     1 IEP     WB
     1 IMP     FlyBase (removed)
     3 IMP     MGI
     6 ISO     RGD
    19 (manual)UniProtKB


  • GO:0006950 : response to stress
 40412 IEA     UniProtKB (SPKW2GO)
     5 (ISS,TAS, IDA) BHF-UCL
     1 ISS     AgBase
     5 IDA     FlyBase (removed or replaced)
     8 IDA     MGI
     6 IDA     RGD
    16 IDA     SGD
    14 IDA     UniProtKB
     1 IDA     WB
    13 IDA     ZFIN
    19 IEA     AspGD
    14 IEA     CGD
    14 IEA     ENSEMBL
     4 IEA     FlyBase
    11 IEA     GOA
    43 IEA     GR
    40 IEA     InterPro
    47 IEA     TAIR
   241 IEA     UniProtKB
     6 IEA     WB
    54 IEA     ZFIN
     1 IEP     CGD
     6 IEP     EcoCyc
     2 IEP     EcoliWiki
     1 IEP     FlyBase (removed)
    55 IEP     RGD
    11 IEP     TAIR
     1 IEP     TIGR
     3 IGI     CGD
     1 IGI     EcoCyc
    13 IGI     SGD
     2 IGI     TAIR
     1 IGI     WB
     9 IMP     CGD
     5 IMP     EcoCyc
     2 IMP     EcoliWiki
     7 IMP     FlyBase (removed)
     9 IMP     MGI
     9 IMP     RGD
    34 IMP     SGD
    26 IMP     TAIR
     4 IMP     UniProtKB
    15 IMP     WB
     2 IPI     SGD
     1 IPI     TIGR
    39 ISO     RGD
     2 ISS     CGD
     3 ISS     dictyBase
     7 ISS     FlyBase (removed)
     1 ISS     GR
     5 ISS     JCVI
     1 ISS     MGI
     1 ISS     PAMGO_VMD
     1 ISS     SGD
    43 ISS     TIGR
    18 ISS     UniProtKB
     4 ISS     WB
     3 NAS     CGD
     8 NAS     FlyBase (removed)
     1 NAS     RGD
     1 NAS     UniProtKB
     3 RCA     GR
     5 RCA     PseudoCAP
     1 TAS     FlyBase
     1 TAS     GR
     8 TAS     MGI
     3 TAS     RGD
     6 TAS     SGD
     1 TAS     TAIR
    10 TAS     TIGR
     1 TAS     WB
  • GO:0048585 : negative regulation of response to stimulus

1 IDA UniProtKB

  • GO:0048584 : positive regulation of response to stimulus

no annotations.

  • GO:0048583 : regulation of response to stimulus

5 RCA bioPIXIE_MEFIT (from SGD)


Suggested by the 'Response to' working group, agreed at the Geneva 2010 GO annotation Camp.

Annotations to high level transcription factor activity terms

Annotations to the following two high level MF terms should be avoided.

  • GO:0001071 - nucleic acid binding transcription factor activity
  • GO:0000988 - protein binding transcription factor activity

All gene/protein/chemical identifiers used in GO annotations should conform to RegExps supplied in the GO.xref.abbs file

e.g. exerpt from GO.xref.abbs file:

abbreviation: GO

database: Gene Ontology Database

object: Identifier

example_id: GO:0004352

local_id_syntax: ^\d{7}$

-- the RegExp present in some entries in the GO.xref.abbs file could be used to help ensure that identifiers included in annotation files conform to the expected format.

-- UniProtKB-GOA is currently contacting groups whose 'with' field ids do not conform to these RegExps to check if they're happy with the proposed format, and we are adding/modifying the below RegExp suggestions in response to feedback we're receiving.

-- Groups should be encouraged to use 'UniProtKB' instead of 'UniProt'

-- Identifiers with the following prefixes are obsolete in the go.xrf_abbs file and should be replaced with 'RefSeq';

  • NCBI_NM
  • NCBI_NP
  • RefSeq_NA
  • RefSeq_Prot

Identifier RegExps to add to GO.xef_abbs:

(Please note this list is currently incomplete)


(UniProt(?:KB)?):([A-Z][0-9][A-Z0-9]{3}[0-9]((-([0-9]+)|:PRO_[0-9]{10}))?)

(CGD):((CAL|CAF)[0-9]{7})

(SGD):(S[0-9]{9})

(dictyBase):(DDB_G[0-9]{7})

(FB):(FBgn[0-9]{7})

(GeneDB_Spombe):(SP[A-Z0-9]+\.[A-Z0-9]+)

(GeneDB_Pfalciparum):(SP[A-Z0-9]+\.[A-Z0-9]+)

(AGI_LocusCode):(AT[0-9]G[0-9]{5}(\.[0-9]{1})?)

(TAIR):(gene:[0-9]{7,})

(TAIR):(locus:AT[0-9]G[0-9]{5})

(WB):(WBGene[0-9]{8})

(WB):(WBVar[0-9]{8})

(WB):(WP:CE[0-9]{5})

(ZFIN):(ZDB-GENE-[0-9]{6}-[0-9]+)

(ZFIN):(ZDB-GENO-[0-9]{6}-[0-9]+)

(ZFIN):(ZDB-MRPHLNO-[0-9]{6}-[0-9]+)

(CHEBI):([0-9]{5})

(JCVI_GenProp):(GenProp[0-9]{4})

(RGD):([0-9]{4,7})

(PubChem_Compound):([0-9]+)

(MGI):(MGI:[0-9]{5,})

(protein_id):([A-Z]{3}[0-9]{5})

(RefSeq):([A-Z]{2}_[0-9]{4,10}(\.[0-9]+)?)

(NCBI_gi):([0-9]{6,})

(PDB):([A-Za-z0-9]{4})

(ENSEMBL):(ENS[A-Z0-9]{10,17})

(GR):([A-Z][0-9][A-Z0-9]{3}[0-9])

(GR_PROTEIN):([A-Z][0-9][A-Z0-9]{3}[0-9])

(EcoliWiki):([A-Za-z]{3,4})

(ECK):(ECK[0-9]{4})

(EcoCyc):(EG[0-9]{5})

(ECOGENE):(EG[0-9]{5})

(EchoBASE):(EB[0-9]{4})

(JCVI_GenProp):(GenProp[0-9]{4})

(PubChem_Substance):([0-9]{4,})

(PIR):([A-Z]{1}[0-9]{5})

(KEGG_LIGAND):([A-Z]{1}[0-9]{3,})

(EMBL):([A-Z]{1}[0-9]{5})

(EMBL):([A-Z]{2}[0-9]{6})

(EMBL):([A-Z]{4}[0-9]{8,9})

(MaizeGDB_Locus):([A-Za-z][A-Za-z0-9]*)

(NCBI_GP):([A-Z]{3}[0-9]{5}(\.[0-9]+)?)

(GenBank|GB):(([A-Z]{1}[0-9]{5})|([A-Z]{2}[0-9]{6})|([A-Z]{4}[0-9]{8,9})



Edimmer 13:28, 15 March 2011 (UTC)

Annotations to protein oligomerization (GO:51259)

Discussed on SF

Review of NOT-qualified annotations, when positive annotations exist to child terms

  • NOT-qualified annotations also imply NOT for all children terms, therefore NOT-qualified annotations should provide as much specificity as possible.
  • when an annotation made to a granular GO term is contradicted by a NOT-qualified annotation to a parent term, curators need to be alerted and review of the annotation set should be carried out.

Example: POLA1 was annotated to "double-strand break repair via nonhomologous end joining", but also had a NOT-qualified annotation to the parent term "DNA repair".

The NOT annotation is based on UV/X-ray damage, which causes lesions that are repaired by nucleotide excision repair. See PMID:3335506

Outcome: The anotation to NOT 'DNA repair' should have been made to a more granular term, describing nucleotide-excision repair.

(Paul Thomas)

Taxon triggers

Annotations that need to be absorbed by MODs

  • MF-BP inter-ontology inferences
  • PAINT