Survey Handling PAINT-generated GAF (Archived): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 82: Line 82:
|      No  
|      No  
| Yes, The GOA pipeline imports the GOA_RAT file from the GOA website, uses the UniProtKB or RefSeq protein ID to match the incoming annotation to an RGD gene record. Once a match has been made, it checks to make sure the annotation isn't already in our db and loads the annotation into the correct table using the matched RGD ID as the DB_ID. Annotations from the file that would be duplicates in our database (i.e. one gene corresponds to multiple proteins each containing the same annotation) are stored and concatenated onto the end of the outgoing gene_association.rgd file as is. For annotation that are loaded into RGD, the DB field (column 1) becomes RGD in the outgoing GAF but the Assigned_by field (column 15) goes into our db and is exported from our database unchanged (so annotations that come from UniProtKB still have an assigned by designation of UniProtKB in the gene_association.rgd file.
| Yes, The GOA pipeline imports the GOA_RAT file from the GOA website, uses the UniProtKB or RefSeq protein ID to match the incoming annotation to an RGD gene record. Once a match has been made, it checks to make sure the annotation isn't already in our db and loads the annotation into the correct table using the matched RGD ID as the DB_ID. Annotations from the file that would be duplicates in our database (i.e. one gene corresponds to multiple proteins each containing the same annotation) are stored and concatenated onto the end of the outgoing gene_association.rgd file as is. For annotation that are loaded into RGD, the DB field (column 1) becomes RGD in the outgoing GAF but the Assigned_by field (column 15) goes into our db and is exported from our database unchanged (so annotations that come from UniProtKB still have an assigned by designation of UniProtKB in the gene_association.rgd file.
|
| Yes, GAFs are automatically loaded into the GOA Oracle database, gp2protein files are used to map MOD identifiers to UniProtKB accessions.
| Yes, GAFs are automatically loaded into the GOA Oracle database, gp2protein files are used to map MOD identifiers to UniProtKB accessions.
| Yes, Perl scripts prepare data and then loaded by PHP script.
| Yes, Perl scripts prepare data and then loaded by PHP script.

Revision as of 14:03, 13 January 2010

This survey was done to assess how every participating database was handling GAF files and to see how the PAINT annotations would be integrated.

November 2009

database ZFIN MGI FlyBase TAIR RGD dictybase GOA EcoWiki Wormbase GeneDB (S.pombe) SGD AgBase
contact person Doug Harold Susan Tanya Stan Pascale Emily Daniel Renfro & Brenley McIntosh Ranjana Val Wood Julie Fiona
Do you currently upload GAF files from external sources, such as from GOA? Yes Yes Yes Yes Yes No Yes Yes Yes yes yes yes
frequency of incorporation of external GAF files? Monthly (or more often) Monthly (or more often) ad hoc ad hoc Monthly (or more often) Not applicable Monthly (or more often) Monthly (or more often) Monthly (or more often) Monthly (or more often) Monthly (or more often) Quarterly
Do you expect your your database to run manual and/or automated verifications (redundancies, quality of annotations) before integrating the GAFs that will be provided by ref genome? Yes, We may like to only accept ISS annotations from PAINT that are not trumped by experimentally supported annotations to more granular terms...though we do not currently have the facilites in our database to do this type of check. Soon we should though. Yes, redundancy only for automated; depending on the size of the GAF, some manual inspection may initially be done, but at some point we will need to feel confident enough to forgo this, as we cannot afford the curator time. Yes, Manual verification of quality (in the first instance at least). It is less of a priority to eliminate redundancy for the new PANTHER annotations. I plan to add these even if we have an existing ISS for the same term - the PAINT annotations are arguably based on stronger evidence than most of our existing pairwise annotations. There is a argument for excluding then if we already have experimental evidence for the same term but since they wouldn't be wrong there isn't much point. For other GAFs we do not load new annotations with NAS, TAS evidence codes or annotations that are completely redundant with existing annotations - i.e. same term, same gene, same pub, same evidence. If we find conflicting annotations for the same publication then we try to resolve them with the other source. Yes, We remove IEA annotations, since we do our own IEA analyses. We also check for redundancies by comparing ids for the object annotated (UniProt ID vs. TAIR id using mapping file), the GO id, evidence code and reference (based on PMID). Only non-redundant, non-IEA and mapped annotations are attached to the GAF file that we check into the GO cvs. Not applicable Yes, We will want to put a script that removes redundant annotations, and have the possibility to verify the PAINT annotations (the current visualization tools such as GO-nuts should be enough) Yes, manual checks of the correctness of the GO terms applied in the annotation set automatic checks to ensure that primary UniProtKB accessions are used. we are not intending that redundant annotations will be removed from our files, as our other electronic annotation pipelines already generate redundant GO annotations. The web-based display of annotations will be filtered however. Yes, Redundancies. Check for validity to prokaryotic organisms (ie no mitochondrial terms). Yes, We remove redundant and electronic annotations. yes, I plan to filter the ref geneome annotations against existing non-IEA annotation. After import I will run quality control checks to evaluate the remaining annotations (I expect that most will already be annotated) No, At this point we do not plan to individually review the RefGenome annotations, but instead will incorporate them all without review and display them in the computational section of our GO pages. This is also how we treat other predicted annotations such as the GOA IEAs and other tool-based predictions. As the number of PAINT annotations accumulate and we have a sense of how many and what kind of annotations we will actually be dealing with on a monthly basis, SGD plans to revisit the question of whether or not to review the PAINT annotations. For our review process of a subset of the GOA annotations, please see the comments under question 5. Not applicable
Do you have a script that loads these annotations into your MOD? Actually, we do not load GOA annotations into ZFIN. You won't find them in our database. We append those annotations to the end of our GAF each week..so they appear in AmiGO..but not in ZFIN. We hope to rectify this at the same time we prepare scripts to load PAINT annotations into ZFIN. Yes, script does this; complicated regime because their annotation object is always a uniprot id, and ours is a gene. GOA is loaded as follows: 1. Does UniProt ID exist in MGI 1. if yes proceed to 2 2. if no, append to GAF (note, direct annotation to isoforms (Q12345-1) is not supported at this time 2. Does annotation have PMID 1. if no, do not load; append to GAF 2. if yes does it match one in MGI? 1. if no, cannot load; put to qc to get PMID into db 2. if yes proceed to next 3. Does annotation duplicate already existing 1. if not, add but NOT if evidence code is IEP (MGI does not use IEP); append to GAF 2. if yes (match needs to be GO _ID, PMID, evidence code) skip; do not append to GAF NO IEAs are loaded because the method used is identical to MGIs (UniProt Keyword and Interpro domain mapping). No, We are currently working on a script to automate this system. The plan is that the ref genome annotations will be loaded via this script. Previously we have been sent notification of new annotations from UniProtKB and these have been added via our standard pipleine after manual checks. No, Right now, we do not load GOA annotations into our database, we append their annotations to ours and check the combined GAF file into the GO cvs. No Yes, The GOA pipeline imports the GOA_RAT file from the GOA website, uses the UniProtKB or RefSeq protein ID to match the incoming annotation to an RGD gene record. Once a match has been made, it checks to make sure the annotation isn't already in our db and loads the annotation into the correct table using the matched RGD ID as the DB_ID. Annotations from the file that would be duplicates in our database (i.e. one gene corresponds to multiple proteins each containing the same annotation) are stored and concatenated onto the end of the outgoing gene_association.rgd file as is. For annotation that are loaded into RGD, the DB field (column 1) becomes RGD in the outgoing GAF but the Assigned_by field (column 15) goes into our db and is exported from our database unchanged (so annotations that come from UniProtKB still have an assigned by designation of UniProtKB in the gene_association.rgd file. Yes, GAFs are automatically loaded into the GOA Oracle database, gp2protein files are used to map MOD identifiers to UniProtKB accessions. Yes, Perl scripts prepare data and then loaded by PHP script. No, Currently we are not loading any external annotations into our database. yes, We have scripts to load GAFs, but they need to have the dtabase and database identifier (column 1&2) specified as GeneDB, and the GeneDB identifier. Yes, We have different pipelines for the GOA IEAs versus other GOA annotations. *IEAs (~once a month, tens of thousands) We have a script that retrieves the gene_association.goa_uniprot.gz file and filters it to get only IEA annotations for the taxon ID 4932 (S. cerevisiae). The output file is still in gene association file format. We then compare that file to the file generated from the previous month's release and remove from SGD all the annotations that do not appear in the current release. We then run a loading script on the current file that takes the most recent IEA file and either adds new annotations to the database or updates the date associated with any annotations already existing in the database. Our loading script can accept either the SGDID or the UniProt ID in column 2, converting all UniProt IDs into the SGDID. Also, our script is able to accept column 1 from the incoming file as 'UniProtKB' but this gets converted to 'SGD' in the gene_associations files we submit. *non-IEA annotations (~once a quarter, a few dozen) We have a script that retrieves the gene_association.goa_uniprot.gz file and filters for the following criteria: 1. only retrieve annotations with taxon 4932 (S. cerevisiae 2. NO annotations from SGD 3. NO IEA annotations 4. NO annotations by IntAct using IPI evidence to protein binding (GO:0005515) A second script is run on this output file comparing it to the previous quarter's file to retrieve only new annotations made in the last quarter. These new annotations are divided amongst a couple of SGD curators who go through and review each annotation to see if they meet the standards for the core manual set of SGD annotations and are correct (evidence code, term choice). During our review process we mark the reviewed annotation in one of three ways: Y we accept the GOA annotations, N if we do not accept their annotation, and R if we want GOA to review their annotation. We also try to comment on why the annotation was classified as Y/N/R. -Y: if their annotation is completely correct and we don't have that annotation at SGD. We can accept their annotation even if we have an annotation to the same term but from a different paper. -N: if we have the exact same annotation or if the annotation cannot be made from the paper they chose (not enough evidence, gene name not mentioned in paper etc) or if we have an annotation from the same paper but to a more granular term or the reference is not a published journal article or the evidence code is not one of the experimental ones. -R: if we think they almost have the right annotation, but a different evidence code fits the experiment, if we think they have a typo in the PMID, etc. We send a file with the N and R annotations back to Emily at GOA. For the Y annotations we manually put them into a GAF file and load them into SGD using the same loading script as we use for the IEAs. yes
Are those external annotations displayed on your web pages for GO annotations? No yes yes no yes no yes yes no yes yes yes
Do you display the original source of the annotation (column 15)? No, Currently, all the annotations found in ZFIN originated at ZFIN...so we do not show the source. However, if we started loading externally provided annotations into our databse, we would make every attempt to make the source information available. The references are provided for each GO annotation as a direct link to the originating pub record in our database or internal pub describing the annotation method. For the PAINT annotations we could link to a pub in the GO pub set or a PMID or an internal reference describing the PAINT annotation process. No, We display only Aspect, GO term (with link to browser), evidence qualifier, evidence, with, and reference. Column 15 that we give to GO does display proper source. yes, The original source is only displayed if it is external. References are displayed in an abbreviated (first author, year) format - this is hyperlinked to an internal reference report page which displays pubmed IDs where available (these are also linked-out to pubmed. No, Annotations from GOA are not displayed on TAIR's web pages at the moment. Yes, For annotations that are loaded into RGD, the DB field (column 1) becomes RGD in the outgoing GAF but the Assigned_by field (column 15) goes into our db and is exported from our database unchanged (so annotations that come from UniProtKB still have an assigned by designation of UniProtKB in the gene_association.rgd file. The reference column has an RGD ID which identifies the GOA pipeline as the source of the information. yes, we would/will do yes, External GO annotations are mapped to UniProtKB accession. Therefore columns 1 and 2 will differ from that of the external MOD. In addition, GOA standardizes the gene symbol, protein name and synonyms displayed in columns 3,10 and 11; using UniProtKB data. Where a MOD pipes together an internal reference and PubMed identifier in the reference column, GOA only displays the PubMed identifier. Currently only the external annotations which apply a PubMed identifier are integrated. Although we are intending to increase the scope of integrations, and in future include other sources of external annotation. No, In table format, we display (as the column headers): Qualifier, GO ID, GO term name, Reference(s), Evidence Code, with/from, Aspect, Notes & Status. Status column says "complete" if a valid GO ID, reference and evidence code are provided. No no, we do not supply source for any external mappings, reference is provided through the IEA (GO_REF:0000002) with InterPro:IPR001394(hyperlinked) IEA (GO_REF:0000004) with SP_KW:KW-0788(hyoperlinked) etc. yes, Annotations are displayed with a "Last update date" listed followed by the information: GO ID GO Term Name Evidence Qualifier Assigned By
Does the annotations from the external GAFs appear in your GAF file that gets submitted to the GO database? Yes Yes yes yes yes no yes no yes yes yes yes
What appears in your GAF file as the annotation source for annotations originally coming from external sources? The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor The original contributor