https://wiki.geneontology.org/api.php?action=feedcontributions&user=Girlwithglasses&feedformat=atomGO Wiki - User contributions [en]2024-03-29T13:17:44ZUser contributionsMediaWiki 1.40.0https://wiki.geneontology.org/index.php?title=AmiGO_Manual:_Bar_Chart&diff=41950AmiGO Manual: Bar Chart2012-08-26T15:49:02Z<p>Girlwithglasses: Pie ==> Bar</p>
<hr />
<div>=Generating a Bar Chart=<br />
<br />
[[Image:Tb1.png]]<br />
<br />
If a GO term is fully expanded (i.e. all its children GO terms are displayed), a bar chart icon will appear at the end of the line. Clicking on this icon will generate a bar chart showing the distribution of genes and gene products associated with that GO term and its children.<br />
<br />
=Understanding Bar Charts=<br />
<br />
[[Image:Bcr1.png]]<br />
<br />
The bar chart shows the distribution of all gene products associated for that GO term and its child GO terms. The bar chart is drawn on the top half of the page with the total number of gene products annotated to each GO each listed in a table below. The GO term name is a hyperlink that directs you to the [[AmiGO_Manual:_Term_Details | term details]] page for that GO term.<br />
<br />
Please note that the total number of genes and the percentages listed may not equal 100% because a gene product may be annotated to multiple children GO terms.<br />
<br />
<br />
[[Category:AmiGO_Manual]]<br />
[[Category:AmiGO]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Curator_Guide:_Enzymes_and_Reactions&diff=39570Curator Guide: Enzymes and Reactions2012-02-29T20:35:33Z<p>Girlwithglasses: /* Example 4: updating EC:1.5.3.11, a transferred EC entry */</p>
<hr />
<div>This guide is for editors adding molecular function terms to represent enzyme reactions.<br />
<br />
There are five websites that are particularly useful when adding reaction terms. These are:<br />
<br />
*[http://www.chem.qmul.ac.uk/iubmb/enzyme/index.html EC enzyme nomenclature]<br />
*[http://www.ebi.ac.uk/intenz IntEnz]<br />
*[http://www.genome.jp/kegg KEGG]<br />
*[http://www.metacyc.org MetaCyc]<br />
*[http://www.ebi.ac.uk/rhea Rhea]<br />
<br />
For chemical names, one should consult [http://www.ebi.ac.uk/chebi ChEBI]. RHEA is particularly useful because it gives EC reactions using ChEBI chemicals.<br />
<br />
<br />
==General Rules and Things of Note==<br />
<br />
<br />
===Enzyme Commission===<br />
<br />
The Enzyme Commission names and categorises enzymes, i.e. physical entities, whereas GO is interested in the various reactions that the enzyme performs. In the same way that a gene product may participate in a number of different processes, it may catalyse a number of different reactions; the ontology should contain each reaction, and the job of the annotator is to mark which reactions a certain gene product catalyses. A single enzyme may perform a number of different reactions, and it is also possible for several different EC enzymes to perform the same reaction.<br />
<br />
''This means that there is not a 1:1 correspondence between EC numbers and GO reaction terms.''<br />
<br />
There are a number of websites that mirror the EC data; my favourite is IntEnz as it shows the reactions from RHEA, so you are saved the trouble of having to find out what the ChEBI names for the reaction participants are.<br />
<br />
<br />
===MetaCyc===<br />
<br />
At present, MetaCyc reactions are associated with one EC number, so if two different EC enzymes catalyse the same reaction, there will be two MetaCyc reactions, one for each EC number.<br />
<br />
<br />
===KEGG===<br />
<br />
KEGG makes reactions independent of the EC number; you can look up an EC number and see the reactions that the enzyme performs (e.g. [http://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.21 EC:1.1.1.21]), or you can look up a reaction and see which EC enzymes perform that reaction (e.g. [http://www.genome.jp/dbget-bin/www_bget?rn:R01036 R01036]). Nifty!<br />
<br />
<br />
===Reactome===<br />
<br />
Reactome currently provide mappings of their terms to GO terms, so they do the work for us!<br />
<br />
<br />
===Precise vs. Imprecise EC Numbers===<br />
<br />
GO has terms that represent the categories used by EC. These have EC xrefs of the form EC:n, EC:n.n and EC:n.n.n (where n is a number).<br />
<br />
For reactions where the enzyme has not yet been added to EC, but it can be put into one of the EC categories, the xref should be of the form EC:n.n.n.-, i.e. ending with a dash.<br />
<br />
<br />
===NAD(P)===<br />
<br />
According to the Enzyme Commission, NAD(P) means that the reaction occurs with NAD '''and''' with NADP; e.g.<br />
<br />
alditol + NAD(P)+ = aldose + NAD(P)H + H+<br />
<br />
means that the enzyme performs<br />
<br />
alditol + NAD+ = aldose + NADH + H+<br />
<br />
'''AND'''<br />
<br />
alditol + NADP+ = aldose + NADPH + H+<br />
<br />
<br />
===One EC number, multiple reactions===<br />
<br />
There are a number of cases where an enzyme can catalyse a set of reactions. These may or may not be specified by EC, but KEGG and MetaCyc will often show additional reactions. Similarly, there are often different EC enzymes that will catalyse the same reaction. A good example of this overlap is found in EC:1.5.3.13, 14, 15, 16, and 17. Looking at IntEnz, there are four reactions for [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.17&status=OK EC:1.5.3.17]; if we then look at [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.16&status=OK EC:1.5.3.16], we can see that one of the reactions from EC:1.5.3.17 can be catalysed by this enzyme, too. KEGG shows this data more clearly; [http://www.genome.jp/dbget-bin/www_bget?reaction+R03899+R09074+R09076+R09077 viewing all the reactions for EC:1.5.3.17] (click 'Show all' on the enzyme data page), each reaction has the EC numbers of enzymes that can catalyse it listed. MetaCyc also lists a [http://metacyc.org/META/NEW-IMAGE?type=NIL&object=EC-1.5.3&redirect=T number of reactions for each EC number].<br />
<br />
==Example 1: epi-cedrol synthase==<br />
<br />
Add a term for EC 4.2.3.39, epi-cedrol synthase<br />
<br />
*Check the reaction does not exist in GO by searching on the name, EC number and the reactants. I searched for 'epicedrol' and 'epi-cedrol'.<br />
<br />
*Look up the reaction in EC (using IntEnz), MetaCyc and KEGG.<br />
<br />
*IntEnz: [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=4.2.3.39 4.2.3.39]<br />
2-trans,6-trans-farnesyl diphosphate + H2O <=> epi-cedrol + diphosphate<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-10004 RXN-10004]<br />
(2E,6E)-farnesyl diphosphate + H2O <=> 8-epi-cedrol + diphosphate<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?ec:4.2.3.39 EC:4.2.3.39] is connected to one reaction, [http://www.genome.jp/dbget-bin/www_bget?rn:R09140 R09140]<br />
trans,trans-Farnesyl diphosphate + H2O <=> 8-epi-Cedrol + Diphosphate<br />
<br />
Check against the [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=26118 RHEA reaction, RHEA:26118] (linked from IntEnz) so that we can be sure we're using the correct nomenclature.<br />
<br />
Names and synonyms: KEGG and EC both give us "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming)", which is the systematic name, according to EC. We also have "8-epicedrol synthase" and "epicedrol synthase".<br />
<br />
Parentage: find the GO term for the category EC:4.2.3; if any of the children are relevant, use them as the parent.<br />
<br />
name: epi-cedrol synthase activity<br />
def: "Catalysis of the reaction: 2-trans,6-trans-farnesyl diphosphate + H2O = epi-cedrol + diphosphate." [RHEA:26118]<br />
synonym: "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming) activity" EXACT systematic_synonym [EC:4.2.3.39]<br />
synonym: "8-epicedrol synthase activity" EXACT []<br />
synonym: "epicedrol synthase activity" EXACT []<br />
xref: EC:4.2.3.39<br />
xref: MetaCyc:RXN-10004<br />
xref: KEGG:R09140<br />
xref: RHEA:26118<br />
is_a: GO:0016838 ! carbon-oxygen lyase activity, acting on phosphates<br />
<br />
<br />
==Example 2: farnesol kinase==<br />
<br />
From SourceForge:<br />
<br />
definition: farnesol + an NTP = farnesol phosphate + an NDP<br />
EC: 2.7.1.-<br />
One example of a more specific case of this is: MetaCyc RXN-11625<br />
<br />
PMID 21395888<br />
PMID 10557276<br />
<br />
NARROW synonym: trans,trans-farnesol kinase<br />
NARROW synonym: 2-trans, 6-trans-farnesol kinase<br />
<br />
*Look up the MetaCyc reaction. It's<br />
<br />
2-trans,-6-trans-farnesol + CTP = 2-trans,-6-trans-farnesyl monophosphate + CDP + H+<br />
<br />
*Search GO, EC, KEGG and RHEA for farnesol. No results for reactions of a similar form.<br />
*Checking the literature references, it is not clear whether the farnesol reactions are limited to the 2-trans,6-trans isomer, so we'll refer to 'farnesol' in the reaction.<br />
*ChEBI searches for farnesol phosphates turn up a blank; however, "farnesyl phosphate" is a parent term for "farnesyl diphosphate" so we should use the name "farnesyl monophosphate" instead of "farnesol phosphate" to refer to the reaction product.<br />
*Parentage: MetaCyc gives an EC ref of 2.7.1.- for RXN-11625; this corresponds to GO:0016773. We can have a look at the ChEBI hierarchy for "farnesyl phosphate" to get some hints as to whether there may be any generic terms under GO:0016773, but there don't seem to be any. (N.b. a 'prenol kinase' term was later added which would be a more appropriate parent)<br />
*Reaction equation: NTP and NDP are referred to in ChEBI as nucleoside triphosphate and nucleoside diphosphate.<br />
<br />
name: farnesol kinase activity<br />
def: "Catalysis of the reaction: farnesol + nucleoside triphosphate = farnesyl monophosphate + nucleoside diphosphate." [MetaCyc:RXN-11625]<br />
synonym: "trans,trans-farnesol kinase activity" NARROW<br />
xrefs: EC:2.7.1.-<br />
is_a: GO:0016773 ! phosphotransferase activity, alcohol group as acceptor<br />
<br />
<br />
*Add the MetaCyc reaction cited as a child of this new term. I gave it the name "2-trans,-6-trans-farnesol kinase activity" to reflect the specific substrate.<br />
<br />
<br />
==Example 3: phosphomethylethanolamine N-methyltransferase activity==<br />
<br />
From SourceForge:<br />
<br />
Def: Catalysis of the reaction: phosphomethylethanolamine (PMEA) + AdoMet -> phosphodimethylethanolamine<br />
Ref: GOC:tb<br />
PMID 20650897<br />
<br />
<br />
Searching for the enzyme name brings up no results in GO, EC, MetaCyc and KEGG, so let's look up the reaction instead.<br />
<br />
Look up all three compounds mentioned in MetaCyc and KEGG.<br />
<br />
*KEGG contains [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 Phosphodimethylethanolamine]<br />
*MetaCyc contains [http://biocyc.org/META/NEW-IMAGE?type=COMPOUND&object=S-ADENOSYLMETHIONINE AdoMet]<br />
<br />
Check the reactions for these compounds.<br />
<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?rn:R06868 R06868] looks like a match:<br />
<br />
S-Adenosyl-L-methionine + N-Methylethanolamine phosphate <=><br />
S-Adenosyl-L-homocysteine + Phosphodimethylethanolamine<br />
<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-5642 RXN-5642] looks like a match:<br />
<br />
N-methylethanolamine phosphate + S-adenosyl-L-methionine <=><br />
N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H+<br />
<br />
*Check that N-dimethylethanolamine phosphate (from the MetaCyc reaction) is also known as phosphodimethylethanolamine<br />
**phosphodimethylethanolamine is a synonym on the MetaCyc compound page; the KEGG compound ID [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 C13482] matches that in the KEGG reaction<br />
**If in doubt, search for the compound in ChEBI and check the synonyms.<br />
<br />
*MetaCyc states that the reaction is one of three catalysed by EC:2.1.1.103, so go to IntEnz and look up [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=2.1.1.103 2.1.1.103]. Although the comments mention subsequent reactions, the reaction list doesn't, so we will use the more generic EC:2.1.1.- as a reference.<br />
<br />
*Get the ChEBI names for the substances and generate a balanced equation. Check to see if the reaction is in Rhea. I looked at the [http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?chebiId=CHEBI:57781 automatic xrefs for N-methylethanolamine phosphate in ChEBI] and clicked on the Rhea xrefs. [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=RHEA:25322 RHEA:25322] is a match! Checking the xrefs for the Rhea reaction, they match the reactions in KEGG and MetaCyc that we found earlier.<br />
<br />
*Term name: a quick Google search reveals that 'phosphomethylethanolamine N-methyltransferase' appears to be the most common name for this term.<br />
*Synonyms: added the KEGG name for the reaction as an exact synonym with the scope set as 'systematic synonym'; also added a synonym using the ChEBI name for the chemical instead of phosphomethylethanolamine.<br />
*Term parentage: this term can go under N-methyltransferase activity.<br />
<br />
name: phosphomethylethanolamine N-methyltransferase activity<br />
def: "Catalysis of the reaction: N-methylethanolamine phosphate + S-adenosyl-L-methionine = N,N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H(+)." [RHEA:25322, KEGG:R06868, MetaCyc:RXN-5642]<br />
synonym: "N-methylethanolamine phosphate N-methyltransferase activity" EXACT<br />
synonym: "S-adenosyl-L-methionine:methylethanolamine phosphate N-methyltransferase activity" EXACT systematic_synonym [KEGG:R06868]<br />
xref: EC:2.1.1.-<br />
xref: KEGG:R06868<br />
xref: MetaCyc:RXN-5642<br />
xref: RHEA:25322<br />
is_a: GO:0008170 ! N-methyltransferase activity<br />
<br />
<br />
==Example 4: updating EC:1.5.3.11, a transferred EC entry==<br />
<br />
From EC:<br />
<br />
Transferred entry: polyamine oxidase. Now included with EC 1.5.3.13 N1-acetylpolyamine oxidase,<br />
EC 1.5.3.14 polyamine oxidase (propane-1,3-diamine-forming), EC 1.5.3.15 N8-acetylspermidine<br />
oxidase (propane-1,3-diamine-forming), EC 1.5.3.16 spermine oxidase and EC 1.5.3.17 non-specific<br />
polyamine oxidase<br />
<br />
This is a tricky entry as there is a lot of overlap between the reactions that each enzyme catalyses. The best way to handle it is to copy out all the reactions (either from IntEnz or KEGG) and then see which are duplicated. E.g.<br />
<br />
EC:1.5.3.13:<br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25871] N1,N12-diacetylspermine + H2O + O2 <=> 3-acetamidopropanal + N1-acetylspermidine + H2O2 <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.16:<br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.17 <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
<br />
From these lists, we can see that RHEA:25807 will have EC refs 1.5.3.13, 1.5.3.16 and 1.5.3.17; RHEA:25815 will have EC refs 1.5.3.13 and 1.5.3.17; and so on. The KEGG reaction display makes it easier to check which reactions are linked with which EC numbers once you have figured out the correspondence between RHEA IDs and KEGG IDs. KEGG also provides names for the reactions; there was one case where a reaction name clashed with an existing GO MF term, so I made the new term name more specific whilst keeping to the nomenclature conventions used by the other terms.<br />
<br />
There ended up being a lot of new terms created; here's a sample:<br />
<br />
name: spermine:oxygen oxidoreductase (spermidine-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine." [RHEA:25807]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.16<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09076<br />
xref: MetaCyc:1.5.3.17-RXN<br />
xref: MetaCyc:RXN-9015<br />
xref: RHEA:25807 "H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine"<br />
<br />
name: spermidine:oxygen oxidoreductase (3-aminopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine." [RHEA:25811]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09077<br />
xref: MetaCyc:RXN-10461<br />
xref: MetaCyc:RXN-12089<br />
xref: RHEA:25811 "H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine"<br />
<br />
name: N1-acetylspermine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine." [RHEA:25803]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R03899<br />
xref: MetaCyc:RXN-12090<br />
xref: MetaCyc:RXN-9940<br />
xref: RHEA:25803 "H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine"<br />
<br />
name: N1-acetylspermidine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermidine + O(2) = 3-acetamidopropanal + H(2)O(2) + putrescine." [RHEA:25815]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09074<br />
xref: MetaCyc:RXN-12091<br />
xref: MetaCyc:RXN-9942<br />
xref: RHEA:25815<br />
<br />
There were also extra reactions in KEGG and MetaCyc that weren't in the EC listings; whether you add these or not depends on whether the person requesting the terms has asked for them and/or whether you want to add them.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Curator_Guide:_Enzymes_and_Reactions&diff=39569Curator Guide: Enzymes and Reactions2012-02-29T20:34:44Z<p>Girlwithglasses: /* Example 4: updating EC:1.5.3.11, a transferred EC entry */</p>
<hr />
<div>This guide is for editors adding molecular function terms to represent enzyme reactions.<br />
<br />
There are five websites that are particularly useful when adding reaction terms. These are:<br />
<br />
*[http://www.chem.qmul.ac.uk/iubmb/enzyme/index.html EC enzyme nomenclature]<br />
*[http://www.ebi.ac.uk/intenz IntEnz]<br />
*[http://www.genome.jp/kegg KEGG]<br />
*[http://www.metacyc.org MetaCyc]<br />
*[http://www.ebi.ac.uk/rhea Rhea]<br />
<br />
For chemical names, one should consult [http://www.ebi.ac.uk/chebi ChEBI]. RHEA is particularly useful because it gives EC reactions using ChEBI chemicals.<br />
<br />
<br />
==General Rules and Things of Note==<br />
<br />
<br />
===Enzyme Commission===<br />
<br />
The Enzyme Commission names and categorises enzymes, i.e. physical entities, whereas GO is interested in the various reactions that the enzyme performs. In the same way that a gene product may participate in a number of different processes, it may catalyse a number of different reactions; the ontology should contain each reaction, and the job of the annotator is to mark which reactions a certain gene product catalyses. A single enzyme may perform a number of different reactions, and it is also possible for several different EC enzymes to perform the same reaction.<br />
<br />
''This means that there is not a 1:1 correspondence between EC numbers and GO reaction terms.''<br />
<br />
There are a number of websites that mirror the EC data; my favourite is IntEnz as it shows the reactions from RHEA, so you are saved the trouble of having to find out what the ChEBI names for the reaction participants are.<br />
<br />
<br />
===MetaCyc===<br />
<br />
At present, MetaCyc reactions are associated with one EC number, so if two different EC enzymes catalyse the same reaction, there will be two MetaCyc reactions, one for each EC number.<br />
<br />
<br />
===KEGG===<br />
<br />
KEGG makes reactions independent of the EC number; you can look up an EC number and see the reactions that the enzyme performs (e.g. [http://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.21 EC:1.1.1.21]), or you can look up a reaction and see which EC enzymes perform that reaction (e.g. [http://www.genome.jp/dbget-bin/www_bget?rn:R01036 R01036]). Nifty!<br />
<br />
<br />
===Reactome===<br />
<br />
Reactome currently provide mappings of their terms to GO terms, so they do the work for us!<br />
<br />
<br />
===Precise vs. Imprecise EC Numbers===<br />
<br />
GO has terms that represent the categories used by EC. These have EC xrefs of the form EC:n, EC:n.n and EC:n.n.n (where n is a number).<br />
<br />
For reactions where the enzyme has not yet been added to EC, but it can be put into one of the EC categories, the xref should be of the form EC:n.n.n.-, i.e. ending with a dash.<br />
<br />
<br />
===NAD(P)===<br />
<br />
According to the Enzyme Commission, NAD(P) means that the reaction occurs with NAD '''and''' with NADP; e.g.<br />
<br />
alditol + NAD(P)+ = aldose + NAD(P)H + H+<br />
<br />
means that the enzyme performs<br />
<br />
alditol + NAD+ = aldose + NADH + H+<br />
<br />
'''AND'''<br />
<br />
alditol + NADP+ = aldose + NADPH + H+<br />
<br />
<br />
===One EC number, multiple reactions===<br />
<br />
There are a number of cases where an enzyme can catalyse a set of reactions. These may or may not be specified by EC, but KEGG and MetaCyc will often show additional reactions. Similarly, there are often different EC enzymes that will catalyse the same reaction. A good example of this overlap is found in EC:1.5.3.13, 14, 15, 16, and 17. Looking at IntEnz, there are four reactions for [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.17&status=OK EC:1.5.3.17]; if we then look at [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.16&status=OK EC:1.5.3.16], we can see that one of the reactions from EC:1.5.3.17 can be catalysed by this enzyme, too. KEGG shows this data more clearly; [http://www.genome.jp/dbget-bin/www_bget?reaction+R03899+R09074+R09076+R09077 viewing all the reactions for EC:1.5.3.17] (click 'Show all' on the enzyme data page), each reaction has the EC numbers of enzymes that can catalyse it listed. MetaCyc also lists a [http://metacyc.org/META/NEW-IMAGE?type=NIL&object=EC-1.5.3&redirect=T number of reactions for each EC number].<br />
<br />
==Example 1: epi-cedrol synthase==<br />
<br />
Add a term for EC 4.2.3.39, epi-cedrol synthase<br />
<br />
*Check the reaction does not exist in GO by searching on the name, EC number and the reactants. I searched for 'epicedrol' and 'epi-cedrol'.<br />
<br />
*Look up the reaction in EC (using IntEnz), MetaCyc and KEGG.<br />
<br />
*IntEnz: [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=4.2.3.39 4.2.3.39]<br />
2-trans,6-trans-farnesyl diphosphate + H2O <=> epi-cedrol + diphosphate<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-10004 RXN-10004]<br />
(2E,6E)-farnesyl diphosphate + H2O <=> 8-epi-cedrol + diphosphate<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?ec:4.2.3.39 EC:4.2.3.39] is connected to one reaction, [http://www.genome.jp/dbget-bin/www_bget?rn:R09140 R09140]<br />
trans,trans-Farnesyl diphosphate + H2O <=> 8-epi-Cedrol + Diphosphate<br />
<br />
Check against the [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=26118 RHEA reaction, RHEA:26118] (linked from IntEnz) so that we can be sure we're using the correct nomenclature.<br />
<br />
Names and synonyms: KEGG and EC both give us "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming)", which is the systematic name, according to EC. We also have "8-epicedrol synthase" and "epicedrol synthase".<br />
<br />
Parentage: find the GO term for the category EC:4.2.3; if any of the children are relevant, use them as the parent.<br />
<br />
name: epi-cedrol synthase activity<br />
def: "Catalysis of the reaction: 2-trans,6-trans-farnesyl diphosphate + H2O = epi-cedrol + diphosphate." [RHEA:26118]<br />
synonym: "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming) activity" EXACT systematic_synonym [EC:4.2.3.39]<br />
synonym: "8-epicedrol synthase activity" EXACT []<br />
synonym: "epicedrol synthase activity" EXACT []<br />
xref: EC:4.2.3.39<br />
xref: MetaCyc:RXN-10004<br />
xref: KEGG:R09140<br />
xref: RHEA:26118<br />
is_a: GO:0016838 ! carbon-oxygen lyase activity, acting on phosphates<br />
<br />
<br />
==Example 2: farnesol kinase==<br />
<br />
From SourceForge:<br />
<br />
definition: farnesol + an NTP = farnesol phosphate + an NDP<br />
EC: 2.7.1.-<br />
One example of a more specific case of this is: MetaCyc RXN-11625<br />
<br />
PMID 21395888<br />
PMID 10557276<br />
<br />
NARROW synonym: trans,trans-farnesol kinase<br />
NARROW synonym: 2-trans, 6-trans-farnesol kinase<br />
<br />
*Look up the MetaCyc reaction. It's<br />
<br />
2-trans,-6-trans-farnesol + CTP = 2-trans,-6-trans-farnesyl monophosphate + CDP + H+<br />
<br />
*Search GO, EC, KEGG and RHEA for farnesol. No results for reactions of a similar form.<br />
*Checking the literature references, it is not clear whether the farnesol reactions are limited to the 2-trans,6-trans isomer, so we'll refer to 'farnesol' in the reaction.<br />
*ChEBI searches for farnesol phosphates turn up a blank; however, "farnesyl phosphate" is a parent term for "farnesyl diphosphate" so we should use the name "farnesyl monophosphate" instead of "farnesol phosphate" to refer to the reaction product.<br />
*Parentage: MetaCyc gives an EC ref of 2.7.1.- for RXN-11625; this corresponds to GO:0016773. We can have a look at the ChEBI hierarchy for "farnesyl phosphate" to get some hints as to whether there may be any generic terms under GO:0016773, but there don't seem to be any. (N.b. a 'prenol kinase' term was later added which would be a more appropriate parent)<br />
*Reaction equation: NTP and NDP are referred to in ChEBI as nucleoside triphosphate and nucleoside diphosphate.<br />
<br />
name: farnesol kinase activity<br />
def: "Catalysis of the reaction: farnesol + nucleoside triphosphate = farnesyl monophosphate + nucleoside diphosphate." [MetaCyc:RXN-11625]<br />
synonym: "trans,trans-farnesol kinase activity" NARROW<br />
xrefs: EC:2.7.1.-<br />
is_a: GO:0016773 ! phosphotransferase activity, alcohol group as acceptor<br />
<br />
<br />
*Add the MetaCyc reaction cited as a child of this new term. I gave it the name "2-trans,-6-trans-farnesol kinase activity" to reflect the specific substrate.<br />
<br />
<br />
==Example 3: phosphomethylethanolamine N-methyltransferase activity==<br />
<br />
From SourceForge:<br />
<br />
Def: Catalysis of the reaction: phosphomethylethanolamine (PMEA) + AdoMet -> phosphodimethylethanolamine<br />
Ref: GOC:tb<br />
PMID 20650897<br />
<br />
<br />
Searching for the enzyme name brings up no results in GO, EC, MetaCyc and KEGG, so let's look up the reaction instead.<br />
<br />
Look up all three compounds mentioned in MetaCyc and KEGG.<br />
<br />
*KEGG contains [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 Phosphodimethylethanolamine]<br />
*MetaCyc contains [http://biocyc.org/META/NEW-IMAGE?type=COMPOUND&object=S-ADENOSYLMETHIONINE AdoMet]<br />
<br />
Check the reactions for these compounds.<br />
<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?rn:R06868 R06868] looks like a match:<br />
<br />
S-Adenosyl-L-methionine + N-Methylethanolamine phosphate <=><br />
S-Adenosyl-L-homocysteine + Phosphodimethylethanolamine<br />
<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-5642 RXN-5642] looks like a match:<br />
<br />
N-methylethanolamine phosphate + S-adenosyl-L-methionine <=><br />
N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H+<br />
<br />
*Check that N-dimethylethanolamine phosphate (from the MetaCyc reaction) is also known as phosphodimethylethanolamine<br />
**phosphodimethylethanolamine is a synonym on the MetaCyc compound page; the KEGG compound ID [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 C13482] matches that in the KEGG reaction<br />
**If in doubt, search for the compound in ChEBI and check the synonyms.<br />
<br />
*MetaCyc states that the reaction is one of three catalysed by EC:2.1.1.103, so go to IntEnz and look up [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=2.1.1.103 2.1.1.103]. Although the comments mention subsequent reactions, the reaction list doesn't, so we will use the more generic EC:2.1.1.- as a reference.<br />
<br />
*Get the ChEBI names for the substances and generate a balanced equation. Check to see if the reaction is in Rhea. I looked at the [http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?chebiId=CHEBI:57781 automatic xrefs for N-methylethanolamine phosphate in ChEBI] and clicked on the Rhea xrefs. [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=RHEA:25322 RHEA:25322] is a match! Checking the xrefs for the Rhea reaction, they match the reactions in KEGG and MetaCyc that we found earlier.<br />
<br />
*Term name: a quick Google search reveals that 'phosphomethylethanolamine N-methyltransferase' appears to be the most common name for this term.<br />
*Synonyms: added the KEGG name for the reaction as an exact synonym with the scope set as 'systematic synonym'; also added a synonym using the ChEBI name for the chemical instead of phosphomethylethanolamine.<br />
*Term parentage: this term can go under N-methyltransferase activity.<br />
<br />
name: phosphomethylethanolamine N-methyltransferase activity<br />
def: "Catalysis of the reaction: N-methylethanolamine phosphate + S-adenosyl-L-methionine = N,N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H(+)." [RHEA:25322, KEGG:R06868, MetaCyc:RXN-5642]<br />
synonym: "N-methylethanolamine phosphate N-methyltransferase activity" EXACT<br />
synonym: "S-adenosyl-L-methionine:methylethanolamine phosphate N-methyltransferase activity" EXACT systematic_synonym [KEGG:R06868]<br />
xref: EC:2.1.1.-<br />
xref: KEGG:R06868<br />
xref: MetaCyc:RXN-5642<br />
xref: RHEA:25322<br />
is_a: GO:0008170 ! N-methyltransferase activity<br />
<br />
<br />
==Example 4: updating EC:1.5.3.11, a transferred EC entry==<br />
<br />
From EC:<br />
<br />
Transferred entry: polyamine oxidase. Now included with EC 1.5.3.13 N1-acetylpolyamine oxidase,<br />
EC 1.5.3.14 polyamine oxidase (propane-1,3-diamine-forming), EC 1.5.3.15 N8-acetylspermidine<br />
oxidase (propane-1,3-diamine-forming), EC 1.5.3.16 spermine oxidase and EC 1.5.3.17 non-specific<br />
polyamine oxidase<br />
<br />
This is a tricky entry as there is a lot of overlap between the reactions that each enzyme catalyses. The best way to handle it is to copy out all the reactions (either from IntEnz or KEGG) and then see which are duplicated. E.g.<br />
<br />
EC:1.5.3.13:<br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25871] N1,N12-diacetylspermine + H2O + O2 <=> 3-acetamidopropanal + N1-acetylspermidine + H2O2 <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.16:<br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.17 <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
<br />
From these lists, we can see that RHEA:25807 will have EC refs 1.5.3.13, 1.5.3.16 and 1.5.3.17; RHEA:25815 will have EC refs 1.5.3.13 and 1.5.3.17; and so on. The KEGG reaction display makes it easier to check which reactions are linked with which EC numbers once you have figured out the correspondence between RHEA IDs and KEGG IDs. KEGG also provides names for the reactions; there was one case where a reaction name clashed with an existing GO MF term, so I made the new term name more specific whilst keeping to the nomenclature conventions used by the other terms.<br />
<br />
There ended up being a lot of new terms created; here's a sample:<br />
<br />
name: spermine:oxygen oxidoreductase (spermidine-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine." [RHEA:25807]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.16<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09076<br />
xref: MetaCyc:1.5.3.17-RXN<br />
xref: MetaCyc:RXN-9015<br />
xref: RHEA:25807 "H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine"<br />
<br />
name: spermidine:oxygen oxidoreductase (3-aminopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine." [RHEA:25811]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09077<br />
xref: MetaCyc:RXN-10461<br />
xref: MetaCyc:RXN-12089<br />
xref: RHEA:25811 "H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine"<br />
<br />
name: N1-acetylspermine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine." [RHEA:25803]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R03899<br />
xref: MetaCyc:RXN-12090<br />
xref: MetaCyc:RXN-9940<br />
xref: RHEA:25803 "H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine"<br />
<br />
name: N1-acetylspermidine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermidine + O(2) = 3-acetamidopropanal + H(2)O(2) + putrescine." [RHEA:25815]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09074<br />
xref: MetaCyc:RXN-12091<br />
xref: MetaCyc:RXN-9942<br />
xref: RHEA:25815<br />
<br />
There were also extra reactions in KEGG and MetaCyc that weren't in the EC listings; whether you add these or not depends on whether the person requesting the terms has asked for them and/or whether you want to add them.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Curator_Guide:_Enzymes_and_Reactions&diff=39568Curator Guide: Enzymes and Reactions2012-02-29T20:34:10Z<p>Girlwithglasses: /* One EC number, multiple reactions */</p>
<hr />
<div>This guide is for editors adding molecular function terms to represent enzyme reactions.<br />
<br />
There are five websites that are particularly useful when adding reaction terms. These are:<br />
<br />
*[http://www.chem.qmul.ac.uk/iubmb/enzyme/index.html EC enzyme nomenclature]<br />
*[http://www.ebi.ac.uk/intenz IntEnz]<br />
*[http://www.genome.jp/kegg KEGG]<br />
*[http://www.metacyc.org MetaCyc]<br />
*[http://www.ebi.ac.uk/rhea Rhea]<br />
<br />
For chemical names, one should consult [http://www.ebi.ac.uk/chebi ChEBI]. RHEA is particularly useful because it gives EC reactions using ChEBI chemicals.<br />
<br />
<br />
==General Rules and Things of Note==<br />
<br />
<br />
===Enzyme Commission===<br />
<br />
The Enzyme Commission names and categorises enzymes, i.e. physical entities, whereas GO is interested in the various reactions that the enzyme performs. In the same way that a gene product may participate in a number of different processes, it may catalyse a number of different reactions; the ontology should contain each reaction, and the job of the annotator is to mark which reactions a certain gene product catalyses. A single enzyme may perform a number of different reactions, and it is also possible for several different EC enzymes to perform the same reaction.<br />
<br />
''This means that there is not a 1:1 correspondence between EC numbers and GO reaction terms.''<br />
<br />
There are a number of websites that mirror the EC data; my favourite is IntEnz as it shows the reactions from RHEA, so you are saved the trouble of having to find out what the ChEBI names for the reaction participants are.<br />
<br />
<br />
===MetaCyc===<br />
<br />
At present, MetaCyc reactions are associated with one EC number, so if two different EC enzymes catalyse the same reaction, there will be two MetaCyc reactions, one for each EC number.<br />
<br />
<br />
===KEGG===<br />
<br />
KEGG makes reactions independent of the EC number; you can look up an EC number and see the reactions that the enzyme performs (e.g. [http://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.21 EC:1.1.1.21]), or you can look up a reaction and see which EC enzymes perform that reaction (e.g. [http://www.genome.jp/dbget-bin/www_bget?rn:R01036 R01036]). Nifty!<br />
<br />
<br />
===Reactome===<br />
<br />
Reactome currently provide mappings of their terms to GO terms, so they do the work for us!<br />
<br />
<br />
===Precise vs. Imprecise EC Numbers===<br />
<br />
GO has terms that represent the categories used by EC. These have EC xrefs of the form EC:n, EC:n.n and EC:n.n.n (where n is a number).<br />
<br />
For reactions where the enzyme has not yet been added to EC, but it can be put into one of the EC categories, the xref should be of the form EC:n.n.n.-, i.e. ending with a dash.<br />
<br />
<br />
===NAD(P)===<br />
<br />
According to the Enzyme Commission, NAD(P) means that the reaction occurs with NAD '''and''' with NADP; e.g.<br />
<br />
alditol + NAD(P)+ = aldose + NAD(P)H + H+<br />
<br />
means that the enzyme performs<br />
<br />
alditol + NAD+ = aldose + NADH + H+<br />
<br />
'''AND'''<br />
<br />
alditol + NADP+ = aldose + NADPH + H+<br />
<br />
<br />
===One EC number, multiple reactions===<br />
<br />
There are a number of cases where an enzyme can catalyse a set of reactions. These may or may not be specified by EC, but KEGG and MetaCyc will often show additional reactions. Similarly, there are often different EC enzymes that will catalyse the same reaction. A good example of this overlap is found in EC:1.5.3.13, 14, 15, 16, and 17. Looking at IntEnz, there are four reactions for [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.17&status=OK EC:1.5.3.17]; if we then look at [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.16&status=OK EC:1.5.3.16], we can see that one of the reactions from EC:1.5.3.17 can be catalysed by this enzyme, too. KEGG shows this data more clearly; [http://www.genome.jp/dbget-bin/www_bget?reaction+R03899+R09074+R09076+R09077 viewing all the reactions for EC:1.5.3.17] (click 'Show all' on the enzyme data page), each reaction has the EC numbers of enzymes that can catalyse it listed. MetaCyc also lists a [http://metacyc.org/META/NEW-IMAGE?type=NIL&object=EC-1.5.3&redirect=T number of reactions for each EC number].<br />
<br />
==Example 1: epi-cedrol synthase==<br />
<br />
Add a term for EC 4.2.3.39, epi-cedrol synthase<br />
<br />
*Check the reaction does not exist in GO by searching on the name, EC number and the reactants. I searched for 'epicedrol' and 'epi-cedrol'.<br />
<br />
*Look up the reaction in EC (using IntEnz), MetaCyc and KEGG.<br />
<br />
*IntEnz: [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=4.2.3.39 4.2.3.39]<br />
2-trans,6-trans-farnesyl diphosphate + H2O <=> epi-cedrol + diphosphate<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-10004 RXN-10004]<br />
(2E,6E)-farnesyl diphosphate + H2O <=> 8-epi-cedrol + diphosphate<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?ec:4.2.3.39 EC:4.2.3.39] is connected to one reaction, [http://www.genome.jp/dbget-bin/www_bget?rn:R09140 R09140]<br />
trans,trans-Farnesyl diphosphate + H2O <=> 8-epi-Cedrol + Diphosphate<br />
<br />
Check against the [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=26118 RHEA reaction, RHEA:26118] (linked from IntEnz) so that we can be sure we're using the correct nomenclature.<br />
<br />
Names and synonyms: KEGG and EC both give us "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming)", which is the systematic name, according to EC. We also have "8-epicedrol synthase" and "epicedrol synthase".<br />
<br />
Parentage: find the GO term for the category EC:4.2.3; if any of the children are relevant, use them as the parent.<br />
<br />
name: epi-cedrol synthase activity<br />
def: "Catalysis of the reaction: 2-trans,6-trans-farnesyl diphosphate + H2O = epi-cedrol + diphosphate." [RHEA:26118]<br />
synonym: "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming) activity" EXACT systematic_synonym [EC:4.2.3.39]<br />
synonym: "8-epicedrol synthase activity" EXACT []<br />
synonym: "epicedrol synthase activity" EXACT []<br />
xref: EC:4.2.3.39<br />
xref: MetaCyc:RXN-10004<br />
xref: KEGG:R09140<br />
xref: RHEA:26118<br />
is_a: GO:0016838 ! carbon-oxygen lyase activity, acting on phosphates<br />
<br />
<br />
==Example 2: farnesol kinase==<br />
<br />
From SourceForge:<br />
<br />
definition: farnesol + an NTP = farnesol phosphate + an NDP<br />
EC: 2.7.1.-<br />
One example of a more specific case of this is: MetaCyc RXN-11625<br />
<br />
PMID 21395888<br />
PMID 10557276<br />
<br />
NARROW synonym: trans,trans-farnesol kinase<br />
NARROW synonym: 2-trans, 6-trans-farnesol kinase<br />
<br />
*Look up the MetaCyc reaction. It's<br />
<br />
2-trans,-6-trans-farnesol + CTP = 2-trans,-6-trans-farnesyl monophosphate + CDP + H+<br />
<br />
*Search GO, EC, KEGG and RHEA for farnesol. No results for reactions of a similar form.<br />
*Checking the literature references, it is not clear whether the farnesol reactions are limited to the 2-trans,6-trans isomer, so we'll refer to 'farnesol' in the reaction.<br />
*ChEBI searches for farnesol phosphates turn up a blank; however, "farnesyl phosphate" is a parent term for "farnesyl diphosphate" so we should use the name "farnesyl monophosphate" instead of "farnesol phosphate" to refer to the reaction product.<br />
*Parentage: MetaCyc gives an EC ref of 2.7.1.- for RXN-11625; this corresponds to GO:0016773. We can have a look at the ChEBI hierarchy for "farnesyl phosphate" to get some hints as to whether there may be any generic terms under GO:0016773, but there don't seem to be any. (N.b. a 'prenol kinase' term was later added which would be a more appropriate parent)<br />
*Reaction equation: NTP and NDP are referred to in ChEBI as nucleoside triphosphate and nucleoside diphosphate.<br />
<br />
name: farnesol kinase activity<br />
def: "Catalysis of the reaction: farnesol + nucleoside triphosphate = farnesyl monophosphate + nucleoside diphosphate." [MetaCyc:RXN-11625]<br />
synonym: "trans,trans-farnesol kinase activity" NARROW<br />
xrefs: EC:2.7.1.-<br />
is_a: GO:0016773 ! phosphotransferase activity, alcohol group as acceptor<br />
<br />
<br />
*Add the MetaCyc reaction cited as a child of this new term. I gave it the name "2-trans,-6-trans-farnesol kinase activity" to reflect the specific substrate.<br />
<br />
<br />
==Example 3: phosphomethylethanolamine N-methyltransferase activity==<br />
<br />
From SourceForge:<br />
<br />
Def: Catalysis of the reaction: phosphomethylethanolamine (PMEA) + AdoMet -> phosphodimethylethanolamine<br />
Ref: GOC:tb<br />
PMID 20650897<br />
<br />
<br />
Searching for the enzyme name brings up no results in GO, EC, MetaCyc and KEGG, so let's look up the reaction instead.<br />
<br />
Look up all three compounds mentioned in MetaCyc and KEGG.<br />
<br />
*KEGG contains [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 Phosphodimethylethanolamine]<br />
*MetaCyc contains [http://biocyc.org/META/NEW-IMAGE?type=COMPOUND&object=S-ADENOSYLMETHIONINE AdoMet]<br />
<br />
Check the reactions for these compounds.<br />
<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?rn:R06868 R06868] looks like a match:<br />
<br />
S-Adenosyl-L-methionine + N-Methylethanolamine phosphate <=><br />
S-Adenosyl-L-homocysteine + Phosphodimethylethanolamine<br />
<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-5642 RXN-5642] looks like a match:<br />
<br />
N-methylethanolamine phosphate + S-adenosyl-L-methionine <=><br />
N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H+<br />
<br />
*Check that N-dimethylethanolamine phosphate (from the MetaCyc reaction) is also known as phosphodimethylethanolamine<br />
**phosphodimethylethanolamine is a synonym on the MetaCyc compound page; the KEGG compound ID [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 C13482] matches that in the KEGG reaction<br />
**If in doubt, search for the compound in ChEBI and check the synonyms.<br />
<br />
*MetaCyc states that the reaction is one of three catalysed by EC:2.1.1.103, so go to IntEnz and look up [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=2.1.1.103 2.1.1.103]. Although the comments mention subsequent reactions, the reaction list doesn't, so we will use the more generic EC:2.1.1.- as a reference.<br />
<br />
*Get the ChEBI names for the substances and generate a balanced equation. Check to see if the reaction is in Rhea. I looked at the [http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?chebiId=CHEBI:57781 automatic xrefs for N-methylethanolamine phosphate in ChEBI] and clicked on the Rhea xrefs. [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=RHEA:25322 RHEA:25322] is a match! Checking the xrefs for the Rhea reaction, they match the reactions in KEGG and MetaCyc that we found earlier.<br />
<br />
*Term name: a quick Google search reveals that 'phosphomethylethanolamine N-methyltransferase' appears to be the most common name for this term.<br />
*Synonyms: added the KEGG name for the reaction as an exact synonym with the scope set as 'systematic synonym'; also added a synonym using the ChEBI name for the chemical instead of phosphomethylethanolamine.<br />
*Term parentage: this term can go under N-methyltransferase activity.<br />
<br />
name: phosphomethylethanolamine N-methyltransferase activity<br />
def: "Catalysis of the reaction: N-methylethanolamine phosphate + S-adenosyl-L-methionine = N,N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H(+)." [RHEA:25322, KEGG:R06868, MetaCyc:RXN-5642]<br />
synonym: "N-methylethanolamine phosphate N-methyltransferase activity" EXACT<br />
synonym: "S-adenosyl-L-methionine:methylethanolamine phosphate N-methyltransferase activity" EXACT systematic_synonym [KEGG:R06868]<br />
xref: EC:2.1.1.-<br />
xref: KEGG:R06868<br />
xref: MetaCyc:RXN-5642<br />
xref: RHEA:25322<br />
is_a: GO:0008170 ! N-methyltransferase activity<br />
<br />
<br />
==Example 4: updating EC:1.5.3.11, a transferred EC entry==<br />
<br />
From EC:<br />
<br />
Transferred entry: polyamine oxidase. Now included with EC 1.5.3.13 N1-acetylpolyamine oxidase, EC 1.5.3.14 polyamine oxidase (propane-1,3-diamine-forming), EC 1.5.3.15 N8-acetylspermidine oxidase (propane-1,3-diamine-forming), EC 1.5.3.16 spermine oxidase and EC 1.5.3.17 non-specific polyamine oxidase<br />
<br />
This is a tricky entry as there is a lot of overlap between the reactions that each enzyme catalyses. The best way to handle it is to copy out all the reactions (either from IntEnz or KEGG) and then see which are duplicated. E.g.<br />
<br />
EC:1.5.3.13:<br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25871] N1,N12-diacetylspermine + H2O + O2 <=> 3-acetamidopropanal + N1-acetylspermidine + H2O2 <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.16:<br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.17 <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
<br />
From these lists, we can see that RHEA:25807 will have EC refs 1.5.3.13, 1.5.3.16 and 1.5.3.17; RHEA:25815 will have EC refs 1.5.3.13 and 1.5.3.17; and so on. The KEGG reaction display makes it easier to check which reactions are linked with which EC numbers once you have figured out the correspondence between RHEA IDs and KEGG IDs. KEGG also provides names for the reactions; there was one case where a reaction name clashed with an existing GO MF term, so I made the new term name more specific whilst keeping to the nomenclature conventions used by the other terms.<br />
<br />
There ended up being a lot of new terms created; here's a sample:<br />
<br />
name: spermine:oxygen oxidoreductase (spermidine-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine." [RHEA:25807]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.16<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09076<br />
xref: MetaCyc:1.5.3.17-RXN<br />
xref: MetaCyc:RXN-9015<br />
xref: RHEA:25807 "H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine"<br />
<br />
name: spermidine:oxygen oxidoreductase (3-aminopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine." [RHEA:25811]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09077<br />
xref: MetaCyc:RXN-10461<br />
xref: MetaCyc:RXN-12089<br />
xref: RHEA:25811 "H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine"<br />
<br />
name: N1-acetylspermine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine." [RHEA:25803]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R03899<br />
xref: MetaCyc:RXN-12090<br />
xref: MetaCyc:RXN-9940<br />
xref: RHEA:25803 "H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine"<br />
<br />
name: N1-acetylspermidine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermidine + O(2) = 3-acetamidopropanal + H(2)O(2) + putrescine." [RHEA:25815]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09074<br />
xref: MetaCyc:RXN-12091<br />
xref: MetaCyc:RXN-9942<br />
xref: RHEA:25815<br />
<br />
There were also extra reactions in KEGG and MetaCyc that weren't in the EC listings; whether you add these or not depends on whether the person requesting the terms has asked for them and/or whether you want to add them.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Curator_Guide:_Enzymes_and_Reactions&diff=39567Curator Guide: Enzymes and Reactions2012-02-29T20:31:42Z<p>Girlwithglasses: </p>
<hr />
<div>This guide is for editors adding molecular function terms to represent enzyme reactions.<br />
<br />
There are five websites that are particularly useful when adding reaction terms. These are:<br />
<br />
*[http://www.chem.qmul.ac.uk/iubmb/enzyme/index.html EC enzyme nomenclature]<br />
*[http://www.ebi.ac.uk/intenz IntEnz]<br />
*[http://www.genome.jp/kegg KEGG]<br />
*[http://www.metacyc.org MetaCyc]<br />
*[http://www.ebi.ac.uk/rhea Rhea]<br />
<br />
For chemical names, one should consult [http://www.ebi.ac.uk/chebi ChEBI]. RHEA is particularly useful because it gives EC reactions using ChEBI chemicals.<br />
<br />
<br />
==General Rules and Things of Note==<br />
<br />
<br />
===Enzyme Commission===<br />
<br />
The Enzyme Commission names and categorises enzymes, i.e. physical entities, whereas GO is interested in the various reactions that the enzyme performs. In the same way that a gene product may participate in a number of different processes, it may catalyse a number of different reactions; the ontology should contain each reaction, and the job of the annotator is to mark which reactions a certain gene product catalyses. A single enzyme may perform a number of different reactions, and it is also possible for several different EC enzymes to perform the same reaction.<br />
<br />
''This means that there is not a 1:1 correspondence between EC numbers and GO reaction terms.''<br />
<br />
There are a number of websites that mirror the EC data; my favourite is IntEnz as it shows the reactions from RHEA, so you are saved the trouble of having to find out what the ChEBI names for the reaction participants are.<br />
<br />
<br />
===MetaCyc===<br />
<br />
At present, MetaCyc reactions are associated with one EC number, so if two different EC enzymes catalyse the same reaction, there will be two MetaCyc reactions, one for each EC number.<br />
<br />
<br />
===KEGG===<br />
<br />
KEGG makes reactions independent of the EC number; you can look up an EC number and see the reactions that the enzyme performs (e.g. [http://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.21 EC:1.1.1.21]), or you can look up a reaction and see which EC enzymes perform that reaction (e.g. [http://www.genome.jp/dbget-bin/www_bget?rn:R01036 R01036]). Nifty!<br />
<br />
<br />
===Reactome===<br />
<br />
Reactome currently provide mappings of their terms to GO terms, so they do the work for us!<br />
<br />
<br />
===Precise vs. Imprecise EC Numbers===<br />
<br />
GO has terms that represent the categories used by EC. These have EC xrefs of the form EC:n, EC:n.n and EC:n.n.n (where n is a number).<br />
<br />
For reactions where the enzyme has not yet been added to EC, but it can be put into one of the EC categories, the xref should be of the form EC:n.n.n.-, i.e. ending with a dash.<br />
<br />
<br />
===NAD(P)===<br />
<br />
According to the Enzyme Commission, NAD(P) means that the reaction occurs with NAD '''and''' with NADP; e.g.<br />
<br />
alditol + NAD(P)+ = aldose + NAD(P)H + H+<br />
<br />
means that the enzyme performs<br />
<br />
alditol + NAD+ = aldose + NADH + H+<br />
<br />
'''AND'''<br />
<br />
alditol + NADP+ = aldose + NADPH + H+<br />
<br />
<br />
===One EC number, multiple reactions===<br />
<br />
There are a number of cases where an enzyme can catalyse a set of reactions. These may or may not be specified by EC, but KEGG and MetaCyc will often show additional reactions. Similarly, there are often different EC enzymes that will catalyse the same reaction. A good example of this overlap is found in EC:1.5.3.13, 14, 15, 16, and 17. Looking at IntEnz, there are four reactions for EC:1.5.3.17 ([http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.17&status=OK]); if we then look at EC:1.5.3.16 ([http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=1.5.3.16&status=OK]), we can see that one of the reactions from EC:1.5.3.17 can be catalysed by this enzyme, too. KEGG shows this data more clearly; viewing all the reactions for EC:1.5.3.17 ([http://www.genome.jp/dbget-bin/www_bget?reaction+R03899+R09074+R09076+R09077] - click 'Show all' on the enzyme data page), each reaction has the EC numbers of enzymes that can catalyse it listed. MetaCyc also lists a number of reactions for each EC number: [http://metacyc.org/META/NEW-IMAGE?type=NIL&object=EC-1.5.3&redirect=T].<br />
<br />
==Example 1: epi-cedrol synthase==<br />
<br />
Add a term for EC 4.2.3.39, epi-cedrol synthase<br />
<br />
*Check the reaction does not exist in GO by searching on the name, EC number and the reactants. I searched for 'epicedrol' and 'epi-cedrol'.<br />
<br />
*Look up the reaction in EC (using IntEnz), MetaCyc and KEGG.<br />
<br />
*IntEnz: [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=4.2.3.39 4.2.3.39]<br />
2-trans,6-trans-farnesyl diphosphate + H2O <=> epi-cedrol + diphosphate<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-10004 RXN-10004]<br />
(2E,6E)-farnesyl diphosphate + H2O <=> 8-epi-cedrol + diphosphate<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?ec:4.2.3.39 EC:4.2.3.39] is connected to one reaction, [http://www.genome.jp/dbget-bin/www_bget?rn:R09140 R09140]<br />
trans,trans-Farnesyl diphosphate + H2O <=> 8-epi-Cedrol + Diphosphate<br />
<br />
Check against the [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=26118 RHEA reaction, RHEA:26118] (linked from IntEnz) so that we can be sure we're using the correct nomenclature.<br />
<br />
Names and synonyms: KEGG and EC both give us "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming)", which is the systematic name, according to EC. We also have "8-epicedrol synthase" and "epicedrol synthase".<br />
<br />
Parentage: find the GO term for the category EC:4.2.3; if any of the children are relevant, use them as the parent.<br />
<br />
name: epi-cedrol synthase activity<br />
def: "Catalysis of the reaction: 2-trans,6-trans-farnesyl diphosphate + H2O = epi-cedrol + diphosphate." [RHEA:26118]<br />
synonym: "(2E,6E)-farnesyl-diphosphate diphosphate-lyase (8-epi-cedrol-forming) activity" EXACT systematic_synonym [EC:4.2.3.39]<br />
synonym: "8-epicedrol synthase activity" EXACT []<br />
synonym: "epicedrol synthase activity" EXACT []<br />
xref: EC:4.2.3.39<br />
xref: MetaCyc:RXN-10004<br />
xref: KEGG:R09140<br />
xref: RHEA:26118<br />
is_a: GO:0016838 ! carbon-oxygen lyase activity, acting on phosphates<br />
<br />
<br />
==Example 2: farnesol kinase==<br />
<br />
From SourceForge:<br />
<br />
definition: farnesol + an NTP = farnesol phosphate + an NDP<br />
EC: 2.7.1.-<br />
One example of a more specific case of this is: MetaCyc RXN-11625<br />
<br />
PMID 21395888<br />
PMID 10557276<br />
<br />
NARROW synonym: trans,trans-farnesol kinase<br />
NARROW synonym: 2-trans, 6-trans-farnesol kinase<br />
<br />
*Look up the MetaCyc reaction. It's<br />
<br />
2-trans,-6-trans-farnesol + CTP = 2-trans,-6-trans-farnesyl monophosphate + CDP + H+<br />
<br />
*Search GO, EC, KEGG and RHEA for farnesol. No results for reactions of a similar form.<br />
*Checking the literature references, it is not clear whether the farnesol reactions are limited to the 2-trans,6-trans isomer, so we'll refer to 'farnesol' in the reaction.<br />
*ChEBI searches for farnesol phosphates turn up a blank; however, "farnesyl phosphate" is a parent term for "farnesyl diphosphate" so we should use the name "farnesyl monophosphate" instead of "farnesol phosphate" to refer to the reaction product.<br />
*Parentage: MetaCyc gives an EC ref of 2.7.1.- for RXN-11625; this corresponds to GO:0016773. We can have a look at the ChEBI hierarchy for "farnesyl phosphate" to get some hints as to whether there may be any generic terms under GO:0016773, but there don't seem to be any. (N.b. a 'prenol kinase' term was later added which would be a more appropriate parent)<br />
*Reaction equation: NTP and NDP are referred to in ChEBI as nucleoside triphosphate and nucleoside diphosphate.<br />
<br />
name: farnesol kinase activity<br />
def: "Catalysis of the reaction: farnesol + nucleoside triphosphate = farnesyl monophosphate + nucleoside diphosphate." [MetaCyc:RXN-11625]<br />
synonym: "trans,trans-farnesol kinase activity" NARROW<br />
xrefs: EC:2.7.1.-<br />
is_a: GO:0016773 ! phosphotransferase activity, alcohol group as acceptor<br />
<br />
<br />
*Add the MetaCyc reaction cited as a child of this new term. I gave it the name "2-trans,-6-trans-farnesol kinase activity" to reflect the specific substrate.<br />
<br />
<br />
==Example 3: phosphomethylethanolamine N-methyltransferase activity==<br />
<br />
From SourceForge:<br />
<br />
Def: Catalysis of the reaction: phosphomethylethanolamine (PMEA) + AdoMet -> phosphodimethylethanolamine<br />
Ref: GOC:tb<br />
PMID 20650897<br />
<br />
<br />
Searching for the enzyme name brings up no results in GO, EC, MetaCyc and KEGG, so let's look up the reaction instead.<br />
<br />
Look up all three compounds mentioned in MetaCyc and KEGG.<br />
<br />
*KEGG contains [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 Phosphodimethylethanolamine]<br />
*MetaCyc contains [http://biocyc.org/META/NEW-IMAGE?type=COMPOUND&object=S-ADENOSYLMETHIONINE AdoMet]<br />
<br />
Check the reactions for these compounds.<br />
<br />
*KEGG: [http://www.genome.jp/dbget-bin/www_bget?rn:R06868 R06868] looks like a match:<br />
<br />
S-Adenosyl-L-methionine + N-Methylethanolamine phosphate <=><br />
S-Adenosyl-L-homocysteine + Phosphodimethylethanolamine<br />
<br />
*MetaCyc: [http://biocyc.org/META/NEW-IMAGE?type=REACTION&object=RXN-5642 RXN-5642] looks like a match:<br />
<br />
N-methylethanolamine phosphate + S-adenosyl-L-methionine <=><br />
N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H+<br />
<br />
*Check that N-dimethylethanolamine phosphate (from the MetaCyc reaction) is also known as phosphodimethylethanolamine<br />
**phosphodimethylethanolamine is a synonym on the MetaCyc compound page; the KEGG compound ID [http://www.genome.jp/dbget-bin/www_bget?cpd:C13482 C13482] matches that in the KEGG reaction<br />
**If in doubt, search for the compound in ChEBI and check the synonyms.<br />
<br />
*MetaCyc states that the reaction is one of three catalysed by EC:2.1.1.103, so go to IntEnz and look up [http://www.ebi.ac.uk/intenz/query?cmd=SearchEC&ec=2.1.1.103 2.1.1.103]. Although the comments mention subsequent reactions, the reaction list doesn't, so we will use the more generic EC:2.1.1.- as a reference.<br />
<br />
*Get the ChEBI names for the substances and generate a balanced equation. Check to see if the reaction is in Rhea. I looked at the [http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?chebiId=CHEBI:57781 automatic xrefs for N-methylethanolamine phosphate in ChEBI] and clicked on the Rhea xrefs. [http://www.ebi.ac.uk/rhea/reaction.xhtml?id=RHEA:25322 RHEA:25322] is a match! Checking the xrefs for the Rhea reaction, they match the reactions in KEGG and MetaCyc that we found earlier.<br />
<br />
*Term name: a quick Google search reveals that 'phosphomethylethanolamine N-methyltransferase' appears to be the most common name for this term.<br />
*Synonyms: added the KEGG name for the reaction as an exact synonym with the scope set as 'systematic synonym'; also added a synonym using the ChEBI name for the chemical instead of phosphomethylethanolamine.<br />
*Term parentage: this term can go under N-methyltransferase activity.<br />
<br />
name: phosphomethylethanolamine N-methyltransferase activity<br />
def: "Catalysis of the reaction: N-methylethanolamine phosphate + S-adenosyl-L-methionine = N,N-dimethylethanolamine phosphate + S-adenosyl-L-homocysteine + H(+)." [RHEA:25322, KEGG:R06868, MetaCyc:RXN-5642]<br />
synonym: "N-methylethanolamine phosphate N-methyltransferase activity" EXACT<br />
synonym: "S-adenosyl-L-methionine:methylethanolamine phosphate N-methyltransferase activity" EXACT systematic_synonym [KEGG:R06868]<br />
xref: EC:2.1.1.-<br />
xref: KEGG:R06868<br />
xref: MetaCyc:RXN-5642<br />
xref: RHEA:25322<br />
is_a: GO:0008170 ! N-methyltransferase activity<br />
<br />
<br />
==Example 4: updating EC:1.5.3.11, a transferred EC entry==<br />
<br />
From EC:<br />
<br />
Transferred entry: polyamine oxidase. Now included with EC 1.5.3.13 N1-acetylpolyamine oxidase, EC 1.5.3.14 polyamine oxidase (propane-1,3-diamine-forming), EC 1.5.3.15 N8-acetylspermidine oxidase (propane-1,3-diamine-forming), EC 1.5.3.16 spermine oxidase and EC 1.5.3.17 non-specific polyamine oxidase<br />
<br />
This is a tricky entry as there is a lot of overlap between the reactions that each enzyme catalyses. The best way to handle it is to copy out all the reactions (either from IntEnz or KEGG) and then see which are duplicated. E.g.<br />
<br />
EC:1.5.3.13:<br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25871] N1,N12-diacetylspermine + H2O + O2 <=> 3-acetamidopropanal + N1-acetylspermidine + H2O2 <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.16:<br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
<br />
EC:1.5.3.17 <br />
[RHEA:25803] N1-acetylspermine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + spermidine <br />
[RHEA:25807] H2O + O2 + spermine <=> 3-aminopropanal + H2O2 + spermidine <br />
[RHEA:25811] H2O + O2 + spermidine <=> 3-aminopropanal + H2O2 + putrescine <br />
[RHEA:25815] N1-acetylspermidine + H2O + O2 <=> 3-acetamidopropanal + H2O2 + putrescine <br />
<br />
From these lists, we can see that RHEA:25807 will have EC refs 1.5.3.13, 1.5.3.16 and 1.5.3.17; RHEA:25815 will have EC refs 1.5.3.13 and 1.5.3.17; and so on. The KEGG reaction display makes it easier to check which reactions are linked with which EC numbers once you have figured out the correspondence between RHEA IDs and KEGG IDs. KEGG also provides names for the reactions; there was one case where a reaction name clashed with an existing GO MF term, so I made the new term name more specific whilst keeping to the nomenclature conventions used by the other terms.<br />
<br />
There ended up being a lot of new terms created; here's a sample:<br />
<br />
name: spermine:oxygen oxidoreductase (spermidine-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine." [RHEA:25807]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.16<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09076<br />
xref: MetaCyc:1.5.3.17-RXN<br />
xref: MetaCyc:RXN-9015<br />
xref: RHEA:25807 "H(2)O + O(2) + spermine = 3-aminopropanal + H(2)O(2) + spermidine"<br />
<br />
name: spermidine:oxygen oxidoreductase (3-aminopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine." [RHEA:25811]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09077<br />
xref: MetaCyc:RXN-10461<br />
xref: MetaCyc:RXN-12089<br />
xref: RHEA:25811 "H(2)O + O(2) + spermidine = 3-aminopropanal + H(2)O(2) + putrescine"<br />
<br />
name: N1-acetylspermine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine." [RHEA:25803]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R03899<br />
xref: MetaCyc:RXN-12090<br />
xref: MetaCyc:RXN-9940<br />
xref: RHEA:25803 "H(2)O + N(1)-acetylspermine + O(2) = 3-acetamidopropanal + H(2)O(2) + spermidine"<br />
<br />
name: N1-acetylspermidine:oxygen oxidoreductase (3-acetamidopropanal-forming) activity<br />
def: "Catalysis of the reaction: H(2)O + N(1)-acetylspermidine + O(2) = 3-acetamidopropanal + H(2)O(2) + putrescine." [RHEA:25815]<br />
xref: EC:1.5.3.13<br />
xref: EC:1.5.3.17<br />
xref: KEGG:R09074<br />
xref: MetaCyc:RXN-12091<br />
xref: MetaCyc:RXN-9942<br />
xref: RHEA:25815<br />
<br />
There were also extra reactions in KEGG and MetaCyc that weren't in the EC listings; whether you add these or not depends on whether the person requesting the terms has asked for them and/or whether you want to add them.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Annotation_Conf._Call,_February_14,_2012&diff=39274Annotation Conf. Call, February 14, 20122012-02-14T05:00:19Z<p>Girlwithglasses: /* Col 17 ID Hierarchy */</p>
<hr />
<div>==Agenda for Annotation Call==<br />
<br />
* More evidence codes - new Evidence code for Inferences based on Ontology links (http://gocwiki.geneontology.org/index.php/Evidence_for_Inferences_based_on_Ontology_links) (Rama) <br />
<br />
* Update on [http://wiki.geneontology.org/index.php/Protein_Binding_clean_up protein binding obsoletions](Jane)<br />
<br />
* Update on communication mechanisms for changes to the GO taxon file. (Jane)<br />
<br />
* can we have a quick review of what is the preferred mechanism right now for feedback on PAINT annotations? (Kimberly)<br />
<br />
* new QC checks (Amelia) - see below<br />
<br />
* col 17 entry ID hierarchy - see below<br />
<br />
==Suggested QC Checks==<br />
<br />
===Remove redundant GP info===<br />
<br />
The GP synonyms column must not contain information from other columns (GP symbol, GP name, DB object ID) as this info is redundant<br />
<br />
e.g. incorrect:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC1884.02 &#124; nic1 &#124; SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
===Col 17 ID format===<br />
<br />
Only one ID is allowed in col 17, and that ID should be formatted correctly and be from a database listed in GO.xrf_abbs.<br />
<br />
<br />
===Col 17 entities should always be related to the same col 2 entry===<br />
<br />
See the [docs on col 17 http://www.geneontology.org/GO.format.gaf-2_0.shtml#gene_product_form_id] for a refresher on col 17 contents<br />
<br />
Where spliceforms exist, they must always have the same parent GP ID - unless you can think of any case in which this would not happen?<br />
<br />
e.g. incorrect<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:654321<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
Correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
<br />
==Col 17 ID Hierarchy==<br />
<br />
Identifiers in column 17 come from a range of databases; propose creating a list of preferred databases from which the IDs are taken.<br />
<br />
e.g. if the hierarchy were UniProtKB > VEGA > ENSEMBL<br />
<br />
If UniProtKB ID exists, use that<br />
else if VEGA ID exists, use that<br />
else if ENSEMBL ID exists, use that<br />
else PANIC!<br />
<br />
Different object types (protein, mRNA, etc.) may need to have different hierarchies.<br />
<br />
DBs used so far:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="left"<br />
! Database<br />
! GP form types<br />
! # distinct IDs<br />
! Assigned by<br />
|- <br />
| ENSEMBL<br />
| protein<br />
| 2464<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| PR<br />
| protein<br />
| 3<br />
|MGI<br />
|-<br />
| protein_id<br />
| protein<br />
| 31<br />
|MGI<br />
|-<br />
| Protein_id [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| RefSeq<br />
| gene, protein<br />
| 3215<br />
|<br />
BHF-UCL<br />
GOC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| TAIR<br />
| RNA, gene_product, miRNA, protein, rRNA, snRNA, snoRNA, tRNA<br />
| 45992<br />
|<br />
GOC<br />
IntAct<br />
RefGenome<br />
TAIR<br />
TIGR<br />
UniProtKB<br />
|-<br />
| UniProtKB<br />
| protein<br />
| 4601<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| UniPRotKB [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| uniProtKB [capitalization error]<br />
| protein<br />
| 2<br />
|MGI<br />
|-<br />
| VEGA<br />
| protein<br />
| 13706<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| WB<br />
| gene<br />
| 4<br />
|WB<br />
|-<br />
| WP<br />
| gene<br />
| 6<br />
|WB<br />
|}</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Annotation_Conf._Call,_February_14,_2012&diff=39273Annotation Conf. Call, February 14, 20122012-02-14T04:59:53Z<p>Girlwithglasses: /* Col 17 ID Hierarchy */</p>
<hr />
<div>==Agenda for Annotation Call==<br />
<br />
* More evidence codes - new Evidence code for Inferences based on Ontology links (http://gocwiki.geneontology.org/index.php/Evidence_for_Inferences_based_on_Ontology_links) (Rama) <br />
<br />
* Update on [http://wiki.geneontology.org/index.php/Protein_Binding_clean_up protein binding obsoletions](Jane)<br />
<br />
* Update on communication mechanisms for changes to the GO taxon file. (Jane)<br />
<br />
* can we have a quick review of what is the preferred mechanism right now for feedback on PAINT annotations? (Kimberly)<br />
<br />
* new QC checks (Amelia) - see below<br />
<br />
* col 17 entry ID hierarchy - see below<br />
<br />
==Suggested QC Checks==<br />
<br />
===Remove redundant GP info===<br />
<br />
The GP synonyms column must not contain information from other columns (GP symbol, GP name, DB object ID) as this info is redundant<br />
<br />
e.g. incorrect:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC1884.02 &#124; nic1 &#124; SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
===Col 17 ID format===<br />
<br />
Only one ID is allowed in col 17, and that ID should be formatted correctly and be from a database listed in GO.xrf_abbs.<br />
<br />
<br />
===Col 17 entities should always be related to the same col 2 entry===<br />
<br />
See the [docs on col 17 http://www.geneontology.org/GO.format.gaf-2_0.shtml#gene_product_form_id] for a refresher on col 17 contents<br />
<br />
Where spliceforms exist, they must always have the same parent GP ID - unless you can think of any case in which this would not happen?<br />
<br />
e.g. incorrect<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:654321<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
Correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
<br />
==Col 17 ID Hierarchy==<br />
<br />
Identifiers in column 17 come from a range of databases; propose creating a list of preferred databases from which the IDs are taken.<br />
<br />
e.g. if the hierarchy were UniProtKB > VEGA > ENSEMBL<br />
<br />
If UniProtKB ID exists, use that<br />
else if VEGA ID exists, use that<br />
else if ENSEMBL ID exists, use that<br />
else PANIC!<br />
<br />
Different object types (protein, mRNA, etc.) may need to have different hierarchies.<br />
<br />
DBs used so far:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="left"<br />
! Database<br />
! GP form types<br />
! # distinct IDs<br />
! Assigned by<br />
|- <br />
| ENSEMBL<br />
| protein<br />
| 2464<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| PR<br />
| protein<br />
| 3<br />
|MGI<br />
|-<br />
| protein_id<br />
| protein<br />
| 31<br />
|MGI<br />
|-<br />
| Protein_id [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| RefSeq<br />
| gene, protein<br />
| 3215<br />
|<br />
BHF-UCL<br />
GOC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| TAIR<br />
| RNA, gene_product, miRNA, protein, rRNA, snRNA, snoRNA, tRNA<br />
| 45992<br />
|<br />
GOC<br />
IntAct<br />
RefGenome<br />
TAIR<br />
TIGR<br />
UniProtKB<br />
|-<br />
| UniProtKB<br />
| protein<br />
| 4601<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| UniPRotKB [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| uniProtKB [capitalization error]<br />
| protein<br />
| 2<br />
|MGI<br />
|-<br />
| VEGA<br />
| protein<br />
| 13706<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| WB<br />
| gene<br />
| 4<br />
|WB<br />
|-<br />
| WP<br />
| gene<br />
| 6<br />
|WB<br />
|}</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Annotation_Conf._Call,_February_14,_2012&diff=39272Annotation Conf. Call, February 14, 20122012-02-14T04:57:10Z<p>Girlwithglasses: </p>
<hr />
<div>==Agenda for Annotation Call==<br />
<br />
* More evidence codes - new Evidence code for Inferences based on Ontology links (http://gocwiki.geneontology.org/index.php/Evidence_for_Inferences_based_on_Ontology_links) (Rama) <br />
<br />
* Update on [http://wiki.geneontology.org/index.php/Protein_Binding_clean_up protein binding obsoletions](Jane)<br />
<br />
* Update on communication mechanisms for changes to the GO taxon file. (Jane)<br />
<br />
* can we have a quick review of what is the preferred mechanism right now for feedback on PAINT annotations? (Kimberly)<br />
<br />
* new QC checks (Amelia) - see below<br />
<br />
* col 17 entry ID hierarchy - see below<br />
<br />
==Suggested QC Checks==<br />
<br />
===Remove redundant GP info===<br />
<br />
The GP synonyms column must not contain information from other columns (GP symbol, GP name, DB object ID) as this info is redundant<br />
<br />
e.g. incorrect:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC1884.02 &#124; nic1 &#124; SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! 3<br>DB object symbol<br />
! ...<br />
! 10<br>DB object name<br />
! 11<br>DB object synonym<br />
! 12<br>DB object type<br />
|- <br />
| PomBase<br />
| SPCC1884.02<br />
| nic1<br />
| ...<br />
| NiCoT heavy metal ion transporter Nic1<br />
| SPCC757.01<br />
| gene<br />
|}<br />
<br />
<br />
===Col 17 ID format===<br />
<br />
Only one ID is allowed in col 17, and that ID should be formatted correctly and be from a database listed in GO.xrf_abbs.<br />
<br />
<br />
===Col 17 entities should always be related to the same col 2 entry===<br />
<br />
See the [docs on col 17 http://www.geneontology.org/GO.format.gaf-2_0.shtml#gene_product_form_id] for a refresher on col 17 contents<br />
<br />
Where spliceforms exist, they must always have the same parent GP ID - unless you can think of any case in which this would not happen?<br />
<br />
e.g. incorrect<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:654321<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
Correct:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5"<br />
! 1<br>DB<br />
! 2<br>DB object ID<br />
! ...<br />
! 17<br>gene product form ID<br />
|- <br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|-<br />
| MGI<br />
| MGI:123456<br />
| ...<br />
| UniProt:P0217K-3<br />
|}<br />
<br />
<br />
==Col 17 ID Hierarchy==<br />
<br />
Identifiers in column 17 come from a range of databases; propose creating a list of preferred databases from which the IDs are taken.<br />
<br />
DBs used so far:<br />
<br />
{| border="1" cellspacing="0" cellpadding="5" align="left"<br />
! Database<br />
! GP form types<br />
! # distinct IDs<br />
! Assigned by<br />
|- <br />
| ENSEMBL<br />
| protein<br />
| 2464<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| PR<br />
| protein<br />
| 3<br />
|MGI<br />
|-<br />
| protein_id<br />
| protein<br />
| 31<br />
|MGI<br />
|-<br />
| Protein_id [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| RefSeq<br />
| gene, protein<br />
| 3215<br />
|<br />
BHF-UCL<br />
GOC<br />
IntAct<br />
MGI<br />
RGD<br />
RefGenome<br />
UniProtKB<br />
|-<br />
| TAIR<br />
| RNA, gene_product, miRNA, protein, rRNA, snRNA, snoRNA, tRNA<br />
| 45992<br />
|<br />
GOC<br />
IntAct<br />
RefGenome<br />
TAIR<br />
TIGR<br />
UniProtKB<br />
|-<br />
| UniProtKB<br />
| protein<br />
| 4601<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| UniPRotKB [capitalization error]<br />
| protein<br />
| 1<br />
|MGI<br />
|-<br />
| uniProtKB [capitalization error]<br />
| protein<br />
| 2<br />
|MGI<br />
|-<br />
| VEGA<br />
| protein<br />
| 13706<br />
|<br />
BHF-UCL<br />
DFLAT<br />
GOC<br />
HGNC<br />
IntAct<br />
MGI<br />
PINC<br />
RGD<br />
RefGenome<br />
Roslin_Institute<br />
UniProtKB<br />
|-<br />
| WB<br />
| gene<br />
| 4<br />
|WB<br />
|-<br />
| WP<br />
| gene<br />
| 6<br />
|WB<br />
|}</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Hinxton_OBO-Edit/Protege_4_workshop_Jan_2012&diff=38907Hinxton OBO-Edit/Protege 4 workshop Jan 20122012-01-11T01:54:57Z<p>Girlwithglasses: /* Software and downloads Required */</p>
<hr />
<div>==EBI, Courtyard Room. January 30-31 2012==<br />
<br />
We're looking at running a fairly small, informal training workshop for using Protege 4 for ontology development, especially using it wrt OWL/OBO interconversion and editing in Protege and OBO-Edit simultaneously. It'll be very much a hands-on thing, largely for the benefit of the Gene Ontology developers, but we'd be happy for others to sit in if they were interested.<br />
<br />
===Tentative Schedule===<br />
<br />
====Day 1====<br />
<br />
* Overview and objectives (Chris)<br />
* An introduction to OWL (Chris)<br />
* [http://www.slideshare.net/dosumis/from-obo-to-owl-and-back-building-scalable-ontologies obo to owl and back] (David OS)<br />
** OBO-OWL cheat sheet<br />
** Anatomy ontology examples<br />
** Automating multiple classification<br />
** Automatic error detection<br />
** OBO-Edit Guide (SKIP THIS - read in advance if you are interested)<br />
** Quick guide to Protege <br />
** Introducing the tutorial ontology<br />
* Converting obo to owl: Basic features of [[Oort]] (Chris)<br />
* Demo of OBO-Edit/P4 dual-editing [David OS]<br />
** Note: we hope to have a 2.1.1b that makes this even easier<br />
* P4 tutorial [David OS? Simon and James?]<br />
<br />
====Day 2====<br />
<br />
* [[Relations]] Ontology (Chris)<br />
** Summary of current status<br />
** Macro relations (advanced - see how we're doing for time)<br />
* Working with multiple ontologies<br />
** [[Ontology extensions]] (aka cross-products)<br />
** An introduction to owl:imports<br />
*** Working with owl:imports in P4 (David)<br />
*** Limitation of imports in OE (Chris)<br />
** Imports vs merging in subsets<br />
*** Extracting ontology subsets: Oort guide, part 2 (Chris/David)<br />
** Case studies:<br />
*** Drosophila anatomy ontology (David)<br />
*** Cell ontology (Chris)<br />
*** Protein ontology (Chris)<br />
** Using CL in GO - [[XP:biological_process_xp_cell|bp-xp-cl logical definitions]]<br />
** Using CHEBI/GOCHE in GO - [[XP:biological_process_xp_chebi|bp-xp-chebi logical definitions]]<br />
** Managing dependencies<br />
*** Ontology "builds" and The OBO Jenkins environment<br />
* Hands on ontology editing (everyone)<br />
** Split into groups?<br />
* Advanced topics (if we have time)<br />
** Gene associations in OWL (Chris)<br />
* Discussion<br />
** GO editors requirements for 2012 (Editors)<br />
*** OBO-Edit<br />
*** Protege 4<br />
*** TermGenie<br />
*** GO-Jenkins<br />
<br />
===Preliminary list of participants:===<br />
<br />
Chris Mungall (Berkeley) (V)<br />
<br />
Tanya Berardini (TAIR)<br />
<br />
Judith Blake (GO, MGI)<br />
<br />
Karen Eilbeck (SO, Univ. of Utah) (Vegan)<br />
<br />
Rebecca Foulger (GO, EBI)<br />
<br />
Midori Harris (PomBase, University of Cambridge) (Non-V)<br />
<br />
David Hill (MGI)<br />
<br />
Harold Drabkin (MGI)<br />
<br />
Jane Lomax (GO, EBI)<br />
<br />
David Osumi-Sutherland (FlyBase, University of Cambridge)<br />
<br />
Karen Christie (SGD)<br />
<br />
Paola Roncaglia (GO, EBI)<br />
<br />
James Malone (EBI)<br />
<br />
Simon Jupp (EBI)<br />
<br />
John Ison (EBI)<br />
<br />
Marcus Ennis (CHEBI, EBI)<br />
<br />
Emily Dimmer (GOA)<br />
<br />
Marta Costa (Virtual FlyBrain)<br />
<br />
Michael Schroeder (Technische Universität Dresden)<br />
<br />
Thomas Wächter (Technische Universität Dresden)<br />
<br />
===Useful links:===<br />
<br />
*[http://code.google.com/p/oboformat/ OBO spec and oboformat pages]<br />
*[http://www.slideshare.net/dosumis/from-obo-to-owl-and-back-building-scalable-ontologies David OS's slides]<br />
<br />
===Software and downloads Required===<br />
<br />
You should have the following installed on your laptop ''prior'' to the workshop:<br />
<br />
* [http://protege.stanford.edu/download/download.html Protege4] 4.1 recommended. You could also install 4.2alpha side by side<br />
** Plugins (some of these may be distributed with P4 already):<br />
*** ELK<br />
* OBO-Edit 2.1<br />
** Note: we may have a 2.1.1beta ready to try<br />
* The [http://code.google.com/p/owltools/wiki/OBOReleaseManagerGUIDocumentation Oort GUI]<br />
* An svn client; see [http://subversion.apache.org/ Apache Subversion] for subversion software<br />
<br />
==== Downloads ====<br />
<br />
You should check out the tutorial files, which are arranged in separate directories:<br />
<br />
svn co https://oboformat.googlecode.com/svn/docs/tutorial<br />
<br />
When you get to Hinxton you'll make sure you're up to date:<br />
<br />
cd tutorial<br />
svn update<br />
<br />
Note that if you cannot use svn, you can still download the files individually [https://oboformat.googlecode.com/svn/docs/tutorial click here to navigate] - however, it's '''strongly''' recommended you get a working svn client installed before the meeting.<br />
<br />
SVN clients:<br />
<br />
* Command line (linux/mac): The standard client is just called "svn" - type this on the command line to see if you have it<br />
* Windows users: Tortoise recommended<br />
* Mac users (who prefer GUIs): Smart SVN pro<br />
<br />
==== Reading List ====<br />
<br />
Read / refresh yourself before workshop:<br />
<br />
* Cross-Product Extensions of the Gene Ontology [http://dx.doi.org/10.1016/j.jbi.2010.02.002 Journal of Biomedical Informatics] 2010. Christopher J. Mungall and Michael Bada and Tanya Z. Berardini and Jennifer Deegan and Amelia Ireland and Midori A. Harris and David P. Hill and Jane Lomax<br />
** Note: the formatting is a little screwed up in the html version - [http://www.sciencedirect.com/science?_ob=MiamiImageURL&_cid=272371&_pii=S1532046410000171&_check=y&_origin=&_coverDate=28-Feb-2011&view=c&wchp=dGLzVlS-zSkzV&md5=bcbfb2143c039c35af75b5c4b67b8866/1-s2.0-S1532046410000171-main.pdf download the pdf]<br />
** Familiarity with this will help for the session on working with multiple ontologies<br />
<br />
Additional material:<br />
<br />
* [http://www.w3.org/TR/owl2-primer/ OWL2-primer] (ADVANCED)<br />
** ''Note:'' if you read this you should select "manchester syntax" as the display option (software developers choose "functional syntax")<br />
** You can ignore: 4.7, 4.8 (datatypes), 5.4, 6.3 (keys), all of section 7 (datatypes), section 9<br />
<br />
===Meeting Logistics===<br />
<br />
====Accommodation====<br />
<br />
{| {{Prettytable}} class='sortable'<br />
|-<br />
! Name<br />
! Accommodation<br />
! 29th<br />
! 30th<br />
! 31st<br />
! Vegetarian (Y/N)<br />
|-<br />
|Chris Mungall<br />
|Campus<br />
| Y<br />
| Y<br />
| -<br />
| Y<br />
|-<br />
|Judith Blake<br />
|Campus<br />
| Y<br />
| Y<br />
| -<br />
|-<br />
|David Hill<br />
|Campus<br />
| -<br />
| Y<br />
| -<br />
|-<br />
|Tanya Berardini<br />
|Campus<br />
| - <br />
| Y<br />
| -<br />
|-<br />
|Harold Drabkin<br />
|Campus<br />
| Y<br />
| Y<br />
| Y<br />
|-<br />
|Karen Christie<br />
|Campus<br />
| Y<br />
| Y<br />
| Y<br />
|-<br />
|Karen Eilbeck<br />
|Campus<br />
| Y<br />
| Y<br />
| Y<br />
|-<br />
<br />
<br />
====Food====<br />
<br />
Monday 30th, Dinner:<br />
* http://www.al-casbah.com/ (TBC)<br />
<br />
[[Category:Ontology]]<br />
[[Category:Meetings]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=38109Gene Product Association Data (GPAD) Format (Archived)2011-11-07T20:53:29Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism or colocalizes_with or contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=2012_Timeline&diff=381082012 Timeline2011-11-07T18:45:40Z<p>Girlwithglasses: </p>
<hr />
<div>[[2011_Timeline]]<br />
<br><br />
<br />
{|border="1" cell spacing="0" cellpadding="5" align="center"<br />
!Project<br />
!Personnel<br />
!January <br />
!February<br />
!March<br />
!April<br />
!May<br />
!June<br />
!July<br />
!August<br />
!September<br />
!October<br />
!November<br />
!December<br />
|-<br />
|[[Chemical terms in GO]]. Phase I: metabolism, binding, transport<br />
|Harold, David, Tanya, Jane, Chris, Becky, Paola<br />
|Fix GO/goche misalignments - add CHEBI xps to GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase II: improve enzyme functions, align with external databases inc. Rhea, MF-BP links for enzymes/metabolic processes<br />
|Amelia, Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
*Manual: Identify MF reaction terms that can't be expressed using has_input: CHEBI:XXX / has_output: CHEBI:XXX; manually research and review<br />
*Automated: set up or augment script to produce has_input/has_output data for all rxns with RHEA xrefs (RHEA = EC rxns with CHEBI terms)<br />
*Automated: create system to check GO rxns are up to date wrt RHEA, EC, MetaCyc data. Alert on clashes between sources.<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: continue development of system to check GO rxns are up to date wrt RHEA, EC, MetaCyc data. Look at incorporating KEGG and UM-BBD data (no bulk downloads available)<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: continue development of GO rxn update system.<br />
*Automated: pull in MetaCyc pathway data - how to fit with GO BP?<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: look at other p'way databases (esp. those with BioPAX output) - fit with GO BP?<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase III: biological roles<br />
|Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[XP:biological_process_xp_cell|Cell ontology cross-products]]<br />
|Paola, Chris, Jane, Alex<br />
|<br />
|Review and improve biological_process_xp_cell<br />
|Make TG templates<br />
|Incorporate cell xps into GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Cross_Product_Guide|Internal Cross Products]]<br />
|David, Tanya, Jane, Chris, Paola, Becky<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neuro_Behaviour_Ontology_(NBO)-GO_alignment|Behaviour: GO/NBO integration]]<br />
|Jane, Chris, George, Janna, David OS<br />
|Survey ontology consumers to examine the implications of including non-GO ids in GO <br />
|Generate file of mismatches between GO and NBO<br />
|Fix GO (and NBO) so they align<br />
|<br />
|<br />
|<br />
|Replace GO ids with equivalent NBO ids and definitions<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Signaling |Signaling overhaul]]<br />
|Becky, Alex, Sandra, Peter, Ruth and others<br />
|<br />
|Representing GPCR and second messenger signaling<br />
|<br />
|Ligand-gated ion channels that signal <br />
|<br />
|<br />
|Define intracellular signaling start and stops<br />
|<br />
|Ligand-mediated signaling pathways and receptor-mediated signaling pathways<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neurobiology Project|Neurological processes and components]]<br />
|Jane, Chris, David and Paola<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Cardiac Conduction Ontology Development<br />
|Ruth, Stan, Doug, David, Tanya, Becky, Paola and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Apoptosis<br />
|Emily, Becky, Paola, Pablo and community experts<br />
|Revisiting second-level terms: feedback from experts<br />
|Revisiting existing annotations to second-level terms<br />
|<br />
|Final overall structure: sanity checks and completion<br />
|<br />
|Write paper<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Virus_terms|Viral processes]]<br />
|Jane, Philippe Le Mercier (SIB) and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=2012_Timeline&diff=381072012 Timeline2011-11-07T17:11:53Z<p>Girlwithglasses: </p>
<hr />
<div>[[2011_Timeline]]<br />
<br><br />
<br />
{|border="1" cell spacing="0" cellpadding="5" align="center"<br />
!Project<br />
!Personnel<br />
!January <br />
!February<br />
!March<br />
!April<br />
!May<br />
!June<br />
!July<br />
!August<br />
!September<br />
!October<br />
!November<br />
!December<br />
|-<br />
|[[Chemical terms in GO]]. Phase I: metabolism, binding, transport<br />
|Harold, David, Tanya, Jane, Chris, Becky, Paola<br />
|Fix GO/goche misalignments - add CHEBI xps to GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase II: improve enzyme functions, align with external databases inc. Rhea, MF-BP links for enzymes/metabolic processes<br />
|Amelia, Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
*Manual: Identify MF reaction terms that can't be expressed using has_input: CHEBI:XXX / has_output: CHEBI:XXX; manually research and review<br />
*Automated: set up or augment script to produce has_input/has_output data for all rxns with RHEA xrefs (RHEA = EC rxns with CHEBI terms)<br />
*Automated: create system to check GO rxns are up to date wrt RHEA, EC, MetaCyc data. Alert on clashes between sources.<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: continue development of system to check GO rxns are up to date wrt RHEA, EC, MetaCyc data. Look at incorporating KEGG and UM-BBD data (no bulk downloads available)<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: continue development of GO rxn update system.<br />
*Automated: pull in MetaCyc pathway data - how to fit with GO BP?<br />
|<br />
*Manual: continue review of non-conformant MF rxn terms<br />
*Automated: look at other p'way databases (esp. those with BioPAX output) - fit with GO BP?<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase III: biological roles<br />
|Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[XP:biological_process_xp_cell|Cell ontology cross-products]]<br />
|Paola, Chris, Jane, Alex<br />
|<br />
|Review and improve biological_process_xp_cell<br />
|Make TG templates<br />
|Incorporate cell xps into GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Cross_Product_Guide|Internal Cross Products]]<br />
|David, Tanya, Jane, Chris, Paola, Becky<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neuro_Behaviour_Ontology_(NBO)-GO_alignment|Behaviour: GO/NBO integration]]<br />
|Jane, Chris, George, Janna, David OS<br />
|Survey ontology consumers to examine the implications of including non-GO ids in GO <br />
|Generate file of mismatches between GO and NBO<br />
|Fix GO (and NBO) so they align<br />
|<br />
|<br />
|<br />
|Replace GO ids with equivalent NBO ids and definitions<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Signaling |Signaling overhaul]]<br />
|Becky, Alex, Sandra, Peter, Ruth and others<br />
|<br />
|Representing GPCR and second messenger signaling<br />
|<br />
|Ligand-gated ion channels that signal <br />
|<br />
|<br />
|Define intracellular signaling start and stops<br />
|<br />
|Ligand-mediated signaling pathways and receptor-mediated signaling pathways<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neurobiology Project|Neurological processes and components]]<br />
|Jane, Chris, David and Paola<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Cardiac Conduction Ontology Development<br />
|Ruth, Stan, Doug, David, Tanya, Becky, Paola and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Apoptosis<br />
|Emily, Becky, Paola, Pablo and community experts<br />
|Revisiting second-level terms: feedback from experts<br />
|Revisiting existing annotations to second-level terms<br />
|<br />
|Final overall structure: sanity checks and completion<br />
|<br />
|Write paper<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Virus_terms|Viral processes]]<br />
|Jane, Philippe Le Mercier (SIB) and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=2011_UCL_Meeting_Agenda&diff=380722011 UCL Meeting Agenda2011-11-07T09:16:05Z<p>Girlwithglasses: /* Software (requested time: 3 hours) */</p>
<hr />
<div>back to [http://wiki.geneontology.org/index.php/2011_UCL_Meeting_Logistics meeting logistics page]<br />
<br />
= Venue =<br />
'''Main meeting Monday to Wednesday'''<br />
<br />
Wilkins Haldane Room ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=187&w=False])<br />
<br />
<br />
'''GO managers and PIs'''<br />
<br />
Tuesday 1pm to 2:30pm Wilkins JBR Meeting Room ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=189&w=False])<br />
<br />
<br />
'''GO curators and ChEBI'''<br />
<br />
Wednesday 10:30am to 1:30pm Cruciform Foyer Seminar Room 3 ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=62&w=False])<br />
<br />
=Monday=<br />
<br />
'''Morning'''<br />
Coffee from 9.00am<br />
<br />
Morning session to start at 9:30am<br />
<br />
Philippa Talmud to welcome the GO Consortium to UCL.<br />
<br />
PIs to kick off meeting...?<br />
<br />
== Project Management==<br />
- Overall project management under new grant goals<br><br />
- Manage documentation<br><br />
- Manage staff <br><br />
- Discussion on appropriate software tools to help manage and report progress on individual project milestones<br />
*Jira demo [Tony]<br />
- Discussion on a replacement for the Webex screen sharing tool.<br />
- Communication between managers projects.<br />
<br />
==Software (requested time: 3 hours)==<br />
<br />
# Overview of software group activities and [https://sourceforge.net/apps/trac/geneontology/roadmap Roadmap] [Chris]<br />
# Personnel Updates [Chris]<br />
<br />
* [[Migration of GO to SVN (Proposal)]] Chris (15 mins)<br />
<br />
** New architecture (30 mins)<br />
* [https://sourceforge.net/apps/trac/geneontology/milestone/AmiGO2%20alpha AmiGO 2 alpha] ~~ Seth<br />
** [http://amigo2.berkeleybop.org/cgi-bin/amigo2/amigo amigo2 walkthrough demo]<br />
** Example queries using c16<br />
*** CL examples (mouse annotations)<br />
*** POMBE examples<br />
*** Treatment of evidence types<br />
** overview of architecture, use of Solr. Demo of Solr API<br />
** opportunities for integration with QuickGO ~~ Tony, Chris<br />
<br />
* OBO-Edit ~~ Heiko/Chris (20 mins)<br />
** 2.1 Release<br />
** 2.2 Plans : mostly maintenance, a few key features, integration of obo2owl code<br />
** Beyond 2.2. - should there be a 3.0? Proteg4 plugin? Web-based tool? Lego?<br />
<br />
* [https://sourceforge.net/apps/trac/geneontology/milestone/obo%20to%20owl%20roundtripping obo2owl roundtripping] ~~ Heiko/Chris (20 mins)<br />
** Using [[Ontology Release Files Proposal|OORT]] to produce releases ~~ Chris (!keep this item here - Jane)<br />
<br />
* [https://sourceforge.net/apps/trac/geneontology/milestone/TG1%20beta TermGenie 1 beta] ~~ Heiko (25 mins)<br />
** Retiring TG0<br />
** Extensions - additional templates<br />
** Integration with ontology roadmap<br />
** Use of TG as SF replacement ("freeform" term creation)<br />
<br />
* GO Galaxy ~~ Chris (for Amelia) (5 mins)<br />
** http://galaxy.berkeleybop.org:8080/<br />
** Use by Phenoscape for obo-diffs<br />
** Wrapping multiple term enrichment tools<br />
<br />
* Presentation on GOMine (Stanford group) (Rama)<br />
** New tool from Stanford group, accessing GO data like GOOSE (http://goad.stanford.edu:8080/gomine/begin.do)<br />
<br />
'''Annotation Software'''<br />
<br />
* Annotation Tools<br />
** PAINT report ~~ Suzi, PaulT<br />
<br />
'''GOC dinner Monday 6:30pm: Sardo 47 Grafton Way http://www.sardo-restaurant.com/index.html'''<br />
<br />
== Annotation (requested time: 5 hours) ==<br />
<br />
==== Annotation Projects====<br />
<br />
* Annotation [http://wiki.geneontology.org/index.php/Annotation_Advocacy_Roadmap_2011 Roadmap] overview (Rama)<br />
<br />
* Annotation Progress Report [Rama and Emily] <br />
to include: <br />
** GO Consortium outreach activities and integration of external data sources into primary species GAFs.<br />
** Development of GO annotation guidelines, <br />
**[http://wiki.geneontology.org/index.php/Ideas_for_GOC_community_curation_tool Community Annotation Tool]<br />
**[http://wiki.geneontology.org/index.php?title=Evidence_Code_Ontology_(ECO) ECO codes]<br />
**Annotation QCs, including taxon constraints.<br />
<br />
* Progress of CACAO biocurator training on GONUTS [http://gowiki.tamu.edu/wiki/index.php/Category:CACAO_Fall_2011]<br />
[Jim 10 mins]<br />
<br />
==Tuesday==<br />
<br />
Separate lunchtime meeting for GO managers and PIs.<br />
Wilkins JBR Meeting Room 1pm to 2:30pm<br />
<br />
===Annotation discussions (cont.)===<br />
<br />
* Annotation Extension Field (column 16)<br />
** Development of the column 16 format (Val)<br />
** Discussion: <br />
*** Rules for transferring column 16 via ISS and via F->P inferences.<br />
<br />
===Annotation Proposals===<br />
<br />
*GO evidence codes to ECO identifiers<br />
<br />
* Specific use proposed for explicitly defining a gp to GO term relationship. [http://wiki.geneontology.org/index.php/Relationships_between_annotation_objects_and_ontology_terms see wiki]<br />
<br />
* [https://sourceforge.net/apps/trac/geneontology/milestone/GPAD%20Specification%20and%20tools GPAD specification and tools] ~~ Amelia<br />
<br />
* [https://sourceforge.net/apps/trac/geneontology/milestone/GOLD%20beta GOLD beta] ~~ Chris - on behalf of Shahid<br />
** GAF services<br />
*** Annotation file QC<br />
*** Inference of new annotations<br />
**** Use of logical definitions to suggest deeper annotations<br />
**** Cross-ontology inference using [[part_of]] and [[occurs_in]]<br />
<br />
<br />
====Longer-term Annotation Projects====<br />
* [[LEGO_Model_Draft_Specification Re-development of the Annotation Model]] ~~ Amelia, Chris, Pau lT [15mins talk + 20min discussion]<br />
<br />
*Common Annotation Framework ((Aim 4 of the Grant) - a centralized curation system to help curators efficiently capture annotations from the literature<br />
**Brainstorming session on GO curation tool features - what features are essential, what features would be great to have, what do curators ''not'' want<br />
<br />
==Ontology (requested: 3 hours)==<br />
<br />
* [[2011_Timeline|Roadmap]]/project overview [Jane/David]<br />
<br />
* Ontology editing workflow<br />
<br />
** Dual mode Obo-Edit and Protege editing ~~ Chris<br />
** Introduction to fast OWL reasoners, what they can do for the ontology group ~~ Chris<br />
** MIREOTing subsets of external ontologies into editors version ~~ Chris<br />
** Report on how the GO editors are using SourceForge: triage, jamborees etc ~~ Jane<br />
<br />
* Cross-products<br />
** Roadmap for new logical definitions and relationships - [[:Category:Cross_Products]]<br />
<br />
* Greater use of ontology constraints (disjointness, taxonomic constraints) ~~ Chris<br />
**[[Cellular_component_disjoint_classes#Spatial_disjointness:_Disconnected_from|Proposal to introduce new disjointness axioms into GO CC]] ~~ Chris<br />
<br />
* Aligning with external ontologies<br />
** Update on cell ontology ~~ Chris (NB: would be good to do this by email before the meeting, to get an update on CL funding and workflow).<br />
*** Integration with Uberon<br />
*** Use in FANTOM5<br />
*** Neuro-cell work<br />
** Update on GOCHE ~~ David<br />
** Behaviour alignment and NBO - a proposal [Jane]<br />
<br />
*Content Projects<br />
** Update on signaling ~~ Rebecca<br />
** Update on [[Virus_terms|viruses]] ~~ Jane<br />
** Update on [[Transcription|transcription]] ~~ Karen (please make this a late time slot so Karen can call in: 3pm UK time at the earliest)<br />
** Update on GO apoptosis project (Paola and Pablo)<br />
** Update on [[GO_slim_overhaul|generic GO slim]] (was this done at LA?) [Val]<br />
*** Processes and components<br />
*** Functions<br />
<br />
* Protein Binding (covering both ontology and annotation issues)<br />
** Protein binding data now being captured fully by IMeX Consortium collaborators (Sandra Orchard)<br />
** Level of detail required under GO:0005515 protein binding (Jane)<br />
<br />
* Relations<br />
** Update on RO and BFO ~~ Chris<br />
** GO subset of RO<br />
** Coordinating between annotation group and ontology group on GO subset of new RO<br />
<br />
==Wednesday==<br />
<br />
===Meeting Wrap-Up===<br />
** Action Items<br />
** Dates for next meeting<br />
<br />
====Annotation Discussion====<br />
** Progress and Direction of the Reference Genome Project (Pascale)<br />
** Ways in which PAINT annotation can be carried out to quickly generate annotations as a first draft annotation set (raised on the annotation call 25/10/2011)<br />
** suitable projects to engage all groups<br />
<br />
<br />
====Ontology editors discussion with ChEBI group (separate room booked)====</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=2011_UCL_Meeting_Logistics&diff=380682011 UCL Meeting Logistics2011-11-06T15:26:10Z<p>Girlwithglasses: /* Remote Attendees */</p>
<hr />
<div>Return to [[Consortium_Meetings]] page<br />
[[Category:Meetings]]<br />
<br />
=Questions=<br />
please contact Ruth, r.lovering@ucl.ac.uk if you have any questions/queries<br />
<br />
=Dates=<br />
The GO meeting will be begin at 9am on Monday 7th November, 2011 and end at 1pm on Wednesday 9th November, 2011, and will be held at the University College London.<br />
<br />
=Venue=<br />
'''Main meeting Monday to Wednesday'''<br />
<br />
Wilkins Haldane Room ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=187&w=False])<br />
Wilkins Building, ([http://maps.google.co.uk/maps?f=q&source=embed&hl=en&geocode=&q=WILKINS+BUILDING@51.524699,-0.13366&ie=UTF8&ll=51.524699,-0.13366&z=16])<br/><br />
University College London,<br/><br />
Gower Street,<br/><br />
London. WC1E 6BT<br />
<br />
'''GO managers and PIs'''<br />
<br />
Tuesday 1pm to 2:30pm<br />
<br />
Wilkins JBR Meeting Room ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=189&w=False])<br />
<br />
'''GO curators and ChEBI'''<br />
<br />
Wednesday 10:30am to 1:30pm<br />
<br />
Cruciform Foyer Seminar Room 3 ([http://crf.casa.ucl.ac.uk/screenRoute.aspx?s=1019&d=62&w=False])<br />
<br />
<br />
== GOC dinner ==<br />
<br />
Monday 6:30pm<br />
<br />
Sardo <br />
<br />
47 Grafton Way<br />
<br />
http://www.sardo-restaurant.com/index.html<br />
<br />
Menu: based on the Christmas menu http://www.sardo-restaurant.com/setmenu.html<br />
Soup and Main fish course option varies, depending on seasonal supply.<br />
Additional starter: Carpaccio di Manzo (slices of beef)<br />
Main vegetarian option will be Fregola con asparagi e ricotta, (pasta with asparagus and ricotta, not the Christmas option)<br />
<br />
Vegan and Gluten free options available from the main menu, gluten free pasta option. Maybe able to have other vegetarian options too.<br />
<br />
http://www.sardo-restaurant.com/menu.html<br />
<br />
=Registration=<br />
Register by entering your name in the [http://wiki.geneontology.org/index.php/2011_UCL_Meeting_Logistics#Attendees Attendees section] below.<br />
<br />
*Registration fee to be decided (depending on number of participants). Aiming for £140, to include lunches and coffee breaks and 1 evening dinner. This will be payable by credit card the first day of the meeting. Please use the Attendees table below to register for the meeting.<br />
<br />
=Lodging Information=<br />
Please make your own Hotel reservations (see table below). The most reasonable hotels are Ibis and Premier Inn. <br />
<br />
Early booking is recommended as the prices are likely to increase closer to the time, some sites offer free cancellation options. <br />
It might be worth checking hotel costs direct from hotel versus online sites eg [[http://www.booking.com/ booking.com]] filter on bloomsbury, or [[http://www.tripadvisor.co.uk/ Tripadvisor.co.uk]]. <br />
<br />
All hotels below are within a 10-15 min walk from the meeting room. In General London is a pretty safe area, although 20 min east of UCL (around King's cross) is a red light district so I would suggest you avoid this area. None of the hotels below are in the Red Light area! Obviously there are a lot more hotels you could try.<br />
<br />
If anyone has stayed in these hotels please comment on them below.<br />
<br />
----<br />
{| {{Prettytable}} class='sortable'<br />
|-<br />
! Hotel Name<br />
! Cost for 3 nights<br />
! Walking time to UCL<br />
! Address <br />
! Phone number<br />
! Gym (Y/N)<br />
! Internet cost<br />
! Star rating<br />
! Noisy (Y/N)<br />
|-<br />
|[http://www.micentre.com/ MIC] fully booked?<br />
|£225<br />
|5 min<br />
|81 - 103 Euston Street, London NW1 2EZ<br />
|0207 380 0001<br />
|N<br />
|?<br />
|3<br />
|N, shouldn't be<br />
|-<br />
|[http://www.premierinn.com/en/hotel/LONEUS/london-euston Premier Inn]<br />
|£365<br />
|9 min<br />
|1 Dukes Road,London, WC1H 9PJ<br />
|0870 850 5115<br />
|N<br />
|?<br />
|3<br />
|N<br />
|-<br />
|[http://www.ibishotel.com/gb/hotel-0921-ibis-london-euston-st-pancras/index.shtml Ibis]<br />
|£335<br />
|7 min<br />
|3 Cardington Street NW1 2LW<br />
|0207 3887777<br />
|N<br />
|?<br />
|2<br />
|N&Y room dependent<br />
|-<br />
|[http://www.novotel.com/gb/hotel-5309-novotel-london-st-pancras/index.shtml Novatel]<br />
|£440 (now £520)<br />
|10 min<br />
|100 - 110 Euston Road NW1 2AJ<br />
|0207 6669000<br />
|Y<br />
|?<br />
|4<br />
|N<br />
|-<br />
|[http://www.holidayinn.com/hotels/gb/en/london/lonbl/hoteldetail Holiday Inn]<br />
|£512<br />
|11 min<br />
|Coram Street, London, WC1N 1HT<br />
|0871 9429222<br />
|Y<br />
|£5/hour<br />
|4<br />
|N<br />
|-<br />
|[http://www.londonrussellhotel.co.uk/ The Hotel Russell]<br />
|£493<br />
|12 min<br />
|1-8 Russell Square, Bloomsbury, WC1B 5BE<br />
|0870 850 5115<br />
|off site £5<br />
|£15/day<br />
|4<br />
|Y<br />
|-<br />
|[http://centrallondon.stgiles.com/default.aspx?pg=suites St Giles Hotel & Leisure Club]<br />
|£337<br />
|15 min<br />
|Bedford Avenue, Bloomsbury, WC1B 3GH<br />
|0207 300 3000<br />
|yes and pool<br />
|£10/day<br />
|3<br />
|? should be OK<br />
|-<br />
|The White Hall Hotel<br />
|£450**, only executive rooms now free £558**<br />
|12 min<br />
|2-5 Montague Street, WC1B 5BU <br />
|020 7233 7888<br />
|N<br />
|?<br />
|4<br />
|? should be OK<br />
|}<br />
<br />
(**) If you are interested in The White Hall Hotel let me know and I will book these via UCL, to get this discount price (Ruth r.lovering@ucl.ac.uk).<br />
----<br />
<br />
=Maps and Transportation=<br />
<br />
== Airports ==<br />
<br />
You can get to central London from any of the following airports, taxis will be more expensive than trains and all airports have good train services available. Once in a central London train station, either get a taxi to your hotel or the meeting or get on the tube (see information below).<br />
<br />
* Heathrow [http://www.heathrowairport.com/portal/controller/dispatcher.jsp?CiID=759c9b25f9599110VgnVCM10000036821c0a____&ChID=4504abfa784d3110VgnVCM10000036821c0a____&Ct=B2C_CT_GENERAL&CtID=448c6a4c7f1b0010VgnVCM200000357e120a____&ChPath=Home^Heathrow^General^To+and+from+Heathrow^Trains&search=true link to train information ]<br />
<br />
* Gatwick [http://www.gatwickairport.com/transport/gatwick-express/ link to train information ]<br />
<br />
* Stansted [http://www.stanstedairport.com/portal/page/Stansted%5EGeneral%5ETo+and+from+Stansted%5ETrains/a024e37e8077d110VgnVCM10000036821c0a____/448c6a4c7f1b0010VgnVCM200000357e120a____/ link to train information ]<br />
<br />
== Airport to Central London ==<br />
<br />
The meeting is in central London and is walkable from Euston, Euston Square, Warren Street, Goodge Street tube stations. Note Tottenham Court Road Tube station is closed.<br />
<br />
There is a good [http://www.tfl.gov.uk/gettingaround/1106.aspx journey planner site] <br />
<br />
=== From Heathrow: ===<br />
<br />
* Either: catch the Heathrow express train from Heathrow to Paddington, and then get a taxi, to your hotel/UCL from there, or get the tube from Paddington to Euston Square (east bound on the Hammersmith & City Line towards Baker Street or the Circle Line towards Kings Cross St Pancras). <br />
Journey time around 1 hour. Cost around £38 return, £21 single plus tube fare £4 each way (taxi cost unknown).<br />
<br />
* Or get the Piccadilly tube: from Heathrow to Green Park and change onto the Victoria Line (heading north) and either get out at Warren Street (meeting) or Euston (depending on hotel). <br />
Journey time around 1 hour 10 min. Cost around £5 single (Heathrow is zone 6).<br />
<br />
* Or taxi, there are lots of online sites suggesting the cost is around £44, each way, and will take about 40 min (but this will be traffic dependent).<br />
<br />
=== From Gatwick: ===<br />
<br />
* Either: catch the Gatwick express train from Gatwick to Victoria, and then get a taxi, to your hotel/UCL from there, or get the tube from Victoria to Warren Street (north bound on the Victoria Line). <br />
Journey time around 1 hour. Cost around £30 return, £17 single, plus tube fare £4 each way (taxi cost unknown).<br />
<br />
* Or taxi, there are lots of online sites suggesting the cost is around £60, each way, and will take about 1 hour (but this will be traffic dependent).<br />
<br />
=== From Stansted: ===<br />
<br />
* Either: catch the Stansted express train from Stansted to Liverpool Street, and then get a taxi, to your hotel/UCL from there, or get the tube from Liverpool Street to Euston Square (west bound on the Hammersmith & City Line/Metropolitan Line, or the Circle Line towards Kings Cross St Pancras). <br />
Journey time around 1 hour. Cost around £27 return, £20 single, plus tube fare £4 each way (taxi cost unknown).<br />
<br />
* Or taxi, there are lots of online sites suggesting the cost is around £45, each way, and will take about 1 hour (but this will be traffic dependent).<br />
<br />
=Meeting Agenda= <br />
<br />
The agenda can be found here: [[2011_UCL_Meeting_Agenda]]<br />
<br />
=Attendees=<br />
<br />
{| {{Prettytable}} class='sortable'<br />
|-<br />
! Name<br />
! Organization<br />
! Arrival Date/Time at Airport <br />
! Departure Date/Time from Airport<br />
! Hotel booked?<br />
! Vegetarian (Y/N)<br />
|-<br />
|Ruth Lovering <br />
|BHF-UCL<br />
|N/A<br />
|N/A<br />
|Travel from home and St Giles Monday night<br />
|N<br />
|-<br />
|Peter D'Eustachio <br />
|NYU - Reactome<br />
|6 Nov 9:25 AM LHR<br />
|12 Nov 10:25 AM LHR<br />
|Regency House 71 Gower St<br />
|N<br />
|-<br />
|Varsha Khodiyar <br />
|BHF-UCL<br />
|N/A<br />
|N/A<br />
|Travel from home and St Giles Monday night<br />
|N<br />
|-<br />
|Emily Dimmer<br />
|UniProtKB<br />
|N/A<br />
|N/A<br />
|Travel from home and Ibis Monday night<br />
|N<br />
|-<br />
|Tony Sawford<br />
|UniProtKB<br />
|N/A<br />
|N/A<br />
|Travel from home and Ibis Monday night<br />
|N<br />
|-<br />
|Jane Lomax<br />
|GO-EBI<br />
|N/A<br />
|N/A<br />
|Travel from home<br />
|N<br />
|-<br />
|Seth Carbon<br />
|BBOP-LBNL<br />
|Sun, Nov 6, 11:25am<br />
|N/A<br />
|yes<br />
|N<br />
|-<br />
|Chris Mungall<br />
|BBOP-LBNL<br />
|N/A<br />
|N/A<br />
|not yet<br />
|Y<br />
|-<br />
|Suzanna Lewis<br />
|BBOP-LBNL<br />
|N/A<br />
|N/A<br />
|yes<br />
|N<br />
|-<br />
|Paola Roncaglia<br />
|GO-EBI<br />
|N/A<br />
|N/A<br />
|Travel from home & MIC Monday night<br />
|N<br />
|-<br />
|Mike Cherry<br />
|Stanford<br />
|Nov 7, LHR, 6:50AM<br />
|Nov 10, LHR, 1:35PM<br />
|The White Hall Hotel <br />
|N<br />
|-<br />
|Judy Blake<br />
|The Jackson Laboratory<br />
|Nov 6, LHR, 9:25 AM<br />
|Nov 10, LHR, 12:05 PM<br />
|don't know yet.<br />
|N<br />
|-<br />
|Brenley McIntosh <br />
|Tx A&M - EcoliWiki<br />
|?<br />
|?<br />
|not yet<br />
|N<br />
|-<br />
|Val Wood<br />
|PomBase<br />
|n/a<br />
|n/a<br />
|TBD may stay one night<br />
|N<br />
|-<br />
|Yasmin Alam-Faruque<br />
|UniProtKB-GOA<br />
|n/a<br />
|n/a<br />
|Travel from home and Ibis Monday night<br />
|Y<br />
|-<br />
|Rebecca Foulger<br />
|GO-EBI<br />
|N/A<br />
|N/A<br />
|Travel from home & MIC on Monday night<br />
|N<br />
|-<br />
|Rama Balakrishnan<br />
|SGD, Stanford<br />
|n/a<br />
|n/a<br />
|not yet<br />
|Y (vegan if possible)<br />
|-<br />
|Stan Laulederkind<br />
|RGD<br />
|Nov 6<br />
|Nov 11<br />
|not yet<br />
|N<br />
|-<br />
|David Hill <br />
|The Jackson Laboratory<br />
|?<br />
|?<br />
|not yet<br />
|N<br />
|-<br />
|Kimberly Van Auken<br />
|WormBase, Caltech<br />
|Nov 6, LHR, 6:35am<br />
|Nov 10, LHR, 10:30am<br />
|Premier Inn<br />
|Yes, please<br />
|-<br />
|Julie Park<br />
|SGD, Stanford<br />
|N/A<br />
|Nov 10, LHR, 1:35PM<br />
|Premier Inn<br />
|N<br />
|-<br />
|Midori Harris<br />
|PomBase, Cambridge<br />
|N/A<br />
|N/A<br />
|TBD - may stay one night<br />
|N (but don't want meat every meal)<br />
|-<br />
|Tanya Berardini<br />
|TAIR<br />
|?<br />
|?<br />
|not yet<br />
|N<br />
|-<br />
|Claire O'Donovan<br />
|UniProtKB<br />
|N/A<br />
|N/A<br />
|Travel from home and Ibis Monday night<br />
|gluten-free<br />
|-<br />
|Susan Tweedie<br />
|FlyBase<br />
|N/A<br />
|N/A<br />
|not yet<br />
|N<br />
|-<br />
|Prudence Mutowo <br />
|GOA-UniProtKB<br />
|N/A<br />
|N/A<br />
|Travel from home and Ibis Monday night<br />
|no rice<br />
|-<br />
|Donghui Li<br />
|TAIR<br />
|Nov 6 LHR 11:25AM<br />
|Nov 10 LHR 01:35PM<br />
|MIC<br />
|N<br />
|-<br />
|Rachael Huntley <br />
|GOA-UniProtKB<br />
|N/A<br />
|N/A<br />
|Travel from home and Ibis Monday night<br />
|Y<br />
|-<br />
|Heiko Dietze<br />
|BBOP-LBNL<br />
|Nov 6, 11:25 am LHR<br />
|Nov 10, 1:35 pm LHR<br />
|Holiday Inn<br />
|N<br />
|-<br />
|Antonia Lock<br />
|Pombase UCL<br />
|N/A<br />
|N/A<br />
|TBD may stay one night<br />
|N<br />
|-<br />
|Alex Michell<br />
|InterPro<br />
|N/A<br />
|N/A<br />
|TBD may stay one night<br />
|N<br />
|-<br />
|Amaia Sangrador<br />
|InterPro<br />
|N/A<br />
|N/A<br />
|TBD may stay one night<br />
|N<br />
|-<br />
|Pascale Gaudet <br />
|dictyBase - RefGenome - neXtProt<br />
|11/6 830 AM<br />
|11/9 1930 PM<br />
|St. Giles<br />
|?<br />
|-<br />
|Nick Brown <br />
|FlyBase<br />
|attending Wednesday only<br />
|N/A<br />
|N/A<br />
|?<br />
|-<br />
|}<br />
<br />
=Invited attendees =<br />
<br />
{| {{Prettytable}} class='sortable'<br />
|-<br />
! Name<br />
! Organization<br />
! Date presenting<br />
! Arrival Date/Time at Airport <br />
! Departure Date/Time from Airport<br />
! Hotel booked?<br />
! Vegetarian (Y/N)<br />
|-<br />
|Sandra Orchard<br />
|IntAct<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Philippa Talmud<br />
|BHF-UCL<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
| Pablo Millan<br />
|IntAct<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|}<br />
<br />
=Remote Attendees=<br />
<br />
{| {{Prettytable}} class='sortable'<br />
|-<br />
! Name<br />
! Organization<br />
! email (needed to set up your remote access)<br />
! Time Zone<br />
|-<br />
|Doug Howe<br />
|ZFIN<br />
|dhowe@zfin.org<br />
|PDT (UTC -7)<br />
|-<br />
|Harold Drabkin<br />
|MGI<br />
|harold.drabkin@jax.org<br />
|EST<br />
|-<br />
|Mary Dolan<br />
|MGI<br />
|mdolan@informatics.jax.org<br />
|EST<br />
|-<br />
|Li Ni<br />
|MGI<br />
|ln@informatics.jax.org<br />
|EST<br />
|-<br />
|Alexander Diehl<br />
|University at Buffalo<br />
|addiehl@buffalo.edu<br />
|EST<br />
|-<br />
|Kim Rutherford<br />
|PomBase/University of Cambridge<br />
|kmr44@cam.ac.uk<br />
|NZDT (UTC+13)<br />
|-<br />
|Paul Thomas<br />
|USC<br />
|pdthomas@usc.edu<br />
|PST<br />
|-<br />
|Amelia Ireland<br />
|BBOP<br />
|aireland@lbl.gov<br />
|PST<br />
|}<br />
<br />
<br />
<br />
----<br />
Return to [[Consortium_Meetings]] page<br />
[[Category:Meetings]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=2012_Timeline&diff=379382012 Timeline2011-11-02T18:44:07Z<p>Girlwithglasses: </p>
<hr />
<div>[[2011_Timeline]]<br />
<br><br />
<br />
{|border="1" cell spacing="0" cellpadding="5" align="center"<br />
!Project<br />
!Personnel<br />
!January <br />
!February<br />
!March<br />
!April<br />
!May<br />
!June<br />
!July<br />
!August<br />
!September<br />
!October<br />
!November<br />
!December<br />
|-<br />
|[[Chemical terms in GO]]. Phase I: metabolism, binding, transport<br />
|Harold, David, Tanya, Jane, Chris, Becky, Paola<br />
|Fix GO/goche misalignments - add CHEBI xps to GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase II: improve enzyme functions, align with external databases inc. Rhea, MF-BP links for enzymes/metabolic processes<br />
|Amelia, Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Chemical terms in GO]]. Phase III: biological roles<br />
|Chris, David, Jane, Tanya, Becky, Paola<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[XP:biological_process_xp_cell|Cell ontology cross-products]]<br />
|Paola, Chris, Jane, Alex<br />
|<br />
|Review and improve biological_process_xp_cell<br />
|Make TG templates<br />
|Incorporate cell xps into GO<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Cross_Product_Guide|Internal Cross Products]]<br />
|David, Tanya, Jane, Chris, Paola, Becky<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neuro_Behaviour_Ontology_(NBO)-GO_alignment|Behaviour: GO/NBO integration]]<br />
|Jane, Chris, George, Janna, David OS<br />
|Survey ontology consumers to examine the implications of including non-GO ids in GO <br />
|Generate file of mismatches between GO and NBO<br />
|Fix GO (and NBO) so they align<br />
|<br />
|<br />
|<br />
|Replace GO ids with equivalent NBO ids and definitions<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Signaling |Signaling overhaul]]<br />
|Becky, Alex, Sandra, Peter, Ruth and others<br />
|<br />
|Representing GPCR and second messenger signaling<br />
|<br />
|Ligand-gated ion channels that signal <br />
|<br />
|<br />
|Define intracellular signaling start and stops<br />
|<br />
|Ligand-mediated signaling pathways and receptor-mediated signaling pathways<br />
|<br />
|<br />
|<br />
|-<br />
|[[Neurobiology Project|Neurological processes and components]]<br />
|Jane, Chris, David and Paola<br />
|<br />
| <br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Cardiac Conduction Ontology Development<br />
|Ruth, Stan, Doug, David, Tanya, Becky, Paola and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|Apoptosis<br />
|Emily, Becky, Paola, Pablo and community experts<br />
|Revisiting second-level terms: feedback from experts<br />
|Revisiting existing annotations to second-level terms<br />
|<br />
|Final overall structure: sanity checks and completion<br />
|<br />
|Write paper<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|-<br />
|[[Virus_terms|Viral processes]]<br />
|Jane, Philippe Le Mercier (SIB) and community experts<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37846Gene Product Association Data (GPAD) Format (Archived)2011-11-01T15:20:54Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism or colocalizes_with or contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37845Gene Product Association Data (GPAD) Format (Archived)2011-11-01T15:20:00Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism\|colocalizes_with\|contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37843Gene Product Association Data (GPAD) Format (Archived)2011-11-01T15:18:41Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism|colocalizes_with|contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37842Gene Product Association Data (GPAD) Format (Archived)2011-11-01T15:17:16Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT or integral_to)? (other_organism|colocs|contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37841Gene Product Association Data (GPAD) Format (Archived)2011-11-01T15:16:22Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || (NOT|integral_to)? (other_organism|colocs|contributes_to)? annotation_relation<br />
|- <br />
| GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37839Gene Product Association Data (GPAD) Format (Archived)2011-11-01T14:50:45Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
Note II: Unlike GAF 2.0, there is no extra column for spliceforms; the spliceform ID goes directly in the DB_Object_ID. The relation between the spliceform ID and the canonical form is held in the GPI file.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37746Gene Product Association Data (GPAD) Format (Archived)2011-10-26T22:35:56Z<p>Girlwithglasses: /* Proposed new format */</p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || 745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || 2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37745Gene Product Association Data (GPAD) Format (Archived)2011-10-26T21:19:16Z<p>Girlwithglasses: </p>
<hr />
<div>An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation XP (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation XP || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37742Gene Product Association Data (GPAD) Format (Archived)2011-10-26T18:28:13Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37741Gene Product Association Data (GPAD) Format (Archived)2011-10-26T18:27:47Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark.ade<br />
<br />
It is strongly suggested that the final line of the file header is a tab-separated list of the column contents, as follows:<br />
<br />
!DB DB_Object_ID DB_Object_Type Taxon DB_Object_Symbol DB_Object_Name DB_Object_Synonym(s) Parent_GP_ID DB_Object_Xrefs<br />
<br />
An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
!DB DB Object ID DB Object Type Taxon DB Object Symbol DB Object Name DB Object Synonym(s) Parent GP ID Xrefs<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37740Gene Product Association Data (GPAD) Format (Archived)2011-10-26T18:02:38Z<p>Girlwithglasses: /* Proposed new format */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || NOT GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37739Gene Product Association Data (GPAD) Format (Archived)2011-10-26T18:00:50Z<p>Girlwithglasses: /* Proposed file format */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed Gene Product Association Data (GPAD) file format ==<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Additional Data===<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
<br />
<br />
==Proposed Gene Product Information (GPI) file format ==<br />
<br />
Gene product data is stored separately from annotation data.<br />
<br />
===File Header===<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpi-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37738Gene Product Association Data (GPAD) Format (Archived)2011-10-26T17:54:50Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark. An example of a full file header:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information is stored in a separate file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpi-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37737Gene Product Association Data (GPAD) Format (Archived)2011-10-26T17:54:04Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file begins with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. An example of a full file header could look like this:<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information is stored in a separate file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpi-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37736Gene Product Association Data (GPAD) Format (Archived)2011-10-26T17:53:14Z<p>Girlwithglasses: /* File Header */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file should begin with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. E.g.<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpad-version: 1.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information is stored in a separate file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpi-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37735Gene Product Association Data (GPAD) Format (Archived)2011-10-26T17:52:37Z<p>Girlwithglasses: /* File Body */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file should begin with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. E.g.<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpad-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation XP ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information is stored in a separate file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpi-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37734Gene Product Association Data (GPAD) Format (Archived)2011-10-26T17:46:07Z<p>Girlwithglasses: /* Proposed file format */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file should begin with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. E.g.<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpad-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpad-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information is stored in a separate file.<br />
<br />
====File Header====<br />
<br />
The file starts with a line declaring the file format:<br />
<br />
!gpi-version: 1.0<br />
<br />
Further information or remarks should be prefixed by an exclamation mark:<br />
<br />
!gpi-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
====File Body====<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37732Gene Product Association Data (GPAD) Format (Archived)2011-10-26T16:55:11Z<p>Girlwithglasses: /* Current Association File Format */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
===File Header===<br />
<br />
The gene association file should begin with a line declaring the format version as follows:<br />
<br />
!gaf-version: 2.0<br />
<br />
Any comments or information should be preceded by an exclamation mark, indicating that parsers should ignore that line. E.g.<br />
<br />
!gaf-version: 2.0<br />
!CVS Version: Revision: 1.134 $<br />
!GOC Validation Date: 08/26/2009 $<br />
!Submission Date: 8/26/2009<br />
!<br />
! The above "Submission Date" is when the annotation project provided<br />
! this file to the Gene Ontology Consortium (GOC). The "GOC Validation<br />
! Date" indicates when this file was last changed as a result of a GOC<br />
! validation and filtering process. The "CVS Version" above is the<br />
! GOC version of this file.<br />
!<br />
!<br />
!Project_name: Schizosaccharomyces pombe GeneDB<br />
!URL: www.genedb.org/genedb/pombe<br />
!Contact Email: val@sanger.ac.uk<br />
!<br />
<br />
<br />
===File Body===<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information would be stored in a separate file.<br />
<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37731Gene Product Association Data (GPAD) Format (Archived)2011-10-26T16:47:51Z<p>Girlwithglasses: /* Proposed new format */</p>
<hr />
<div><br />
An alternative means of exchanging annotations. The GPAD format is designed to be more normalized than GAF, and is intended to work in conjunction with a separate format for exchanging gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information would be stored in a separate file.<br />
<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || ECO:0000015 || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || ECO:0000304 || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || ECO:0000002 || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || ECO:0000305 || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ECO:0000250 || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || ECO:0000002 || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || ECO:0000021 || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || ECO:0000002 || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Annotation_Conf._Call,_September_13,_2011&diff=37220Annotation Conf. Call, September 13, 20112011-09-20T17:49:18Z<p>Girlwithglasses: </p>
<hr />
<div>==Agenda==<br />
<br />
Present:<br />
SGD: Rama, Julie, Karen, Jodi<br><br />
UK: Emily, Jane, Susan, Val, Midori, Paola<br><br />
WB: Kimberly<br><br />
Rat: Dan<br><br />
TAIR: Donghui<br><br />
Berkeley: Chris, Amelia<br />
<br />
===Unpublished Reference for use with IC evidence code===<br />
<br />
We sent around an abstract for the unpublished reference for use with IC evidence code. We haven't received any comments. Is everybody okay with the text? If not, we will move this to the GO_Refs page asap. If you have more examples for use of IC in this fashion, please post them on the go-discuss mailing list.<br />
<pre><br />
GO_REF:0000036 Manual annotations that require more than one source of functional data to support the assignment of the associated GO term. <br />
<br />
The Gene Ontology Consortium uses the IC (Inferred by Curator) evidence code when an annotation cannot be supported by any direct evidence, <br />
but can be inferred by GO annotations that have been annotated to the same gene/gene product identifier in conjunction with the curator's <br />
knowledge of biology (supporting GO annotations must not be IC-evidenced). In many cases an IC-evidenced annotation simply applies the same <br />
reference that was used in the supporting GO annotation. The use of IC evidence code in an annotation with reference GO_REF:0000036 signifies <br />
a curator inferred the GO term based on evidence from multiple sources of evidence/GO annotations. The 'with/from' field in these annotations <br />
will therefore supply >1 GO identifier, obtained from the set of supporting GO annotations assigned to the same gene/gene product identifier <br />
which cite publicly-available references.<br />
<br />
Example: (from: http://wiki.geneontology.org/index.php/Transcription_jamboree)<br />
Primary annotations to CUP9:<br />
CUP9 GO:0000122 negative regulation of transcription from RNA polymerase II promoter IMP PMID:9427760<br />
CUP9 GO:0000978 RNA polymerase II core promoter proximal region sequence-specific DNA binding IDA PMID:9427760 <br />
CUP9 GO:0001103 RNA polymerase II repressing transcription factor binding IPI PMID:18708352 CYC8<br />
Composite IC annotation to CUP9:<br />
CUP9 GO:0001133 Sequence-specific transcription regulatory region DNA binding RNA <br />
polymerase II transcription factor recruiting transcription factor <br />
activity IC GO_REF:0000036 GO:0000122|GO:0000978|GO:0001103<br />
</pre><br />
<br />
===Discussion===<br />
The IC unpublished reference is all set to go. Remember to make annotations to the granular annotations. Rama will add it to the GO_REF page and the Documentation page. We will also come up with rules for QC checks. If the annotating groups generate an internal DBID for this reference, they should make sure that ID is listed as External Accession on the GO_REF page.<br />
<br />
=== Include the term 'GO:0005488; Binding' to QC Rule GO_AR:0000003 ===<br />
<br />
See: http://www.geneontology.org/GO.annotation_qc.shtml<br />
<br />
GO_AR:0000003 Annotations to 'protein binding ; GO:0005515', should be made with IPI and interactor should be in the 'with' field<br />
<br />
* Annotations to the parent term 'binding' should also be included in this rule, as the same poor-level of information is conveyed in an annotation of gene_product 1 GO:0005488 binding IDA PMID:12345<br />
===Discussion===<br />
* Agreed, that the upper level term should also be subjected to that rule. <br />
* What sorts of things can go in the 'with' column for Binding? Chebi is one example. But if you know exactly what it binds to, then pick one of the appropriate child terms.<br />
* We will email this QC to the GO mailing list<br />
<br />
===Use of Col-16===<br />
Midori et al (Spombe) have done lot of work in capturing data in col-16. She will go over the relationships they have used and some examples.<br><br />
Relations (OBO format): http://www.geneontology.org/scratch/xps/go_annotation_extension_relations.obo <br><br />
Annotation examples on the Pombase Wiki page: http://sourceforge.net/apps/trac/pombase/wiki/ListOfRelations <br><br />
Examples are also available in OBO format at: http://www.geneontology.org/scratch/xps/go_annotation_extension_examples.obo<br />
<br />
===Discussion===<br />
* Curators can refer to the pombe wiki page to see how they have used the relationships<br />
* if you have multiple relationships, use a comma.<br />
* In the future, if you need new relationships, please request them on the Ontology SF tracker.<br />
* How do we view these relationships? Amelia will work on a static page to view these. Obo-edit can be used too.<br />
* Documentation is scattered for col-16. Rama will try to clean up.<br />
* If your group is providing data in col-16, please post a message on the GO list saying so<br />
<br />
===Protein oligomerization SF item (Amelia, Becky)===<br />
http://sourceforge.net/tracker/?func=detail&aid=3053953&group_id=36855&atid=440764<br />
===Discussion===<br />
This will be an agenda item for the upcoming GOC meeting at UK.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37035Gene Product Association Data (GPAD) Format (Archived)2011-09-07T21:20:29Z<p>Girlwithglasses: /* Gene Product Association Data, GPAD */ adding note on annotation IDs</p>
<hr />
<div>Proposal to split the information in the GAF files into two sets, association data and gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
====Additional Data====<br />
<br />
The addition of an annotation ID could be useful for a number of reasons, including updating, removing or citing annotations, linking different annotations, and so on. There are questions about how and by whom these IDs would be maintained that would need to be answered before introducing such IDs.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information would be stored in a separate file.<br />
<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || IMP || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || TAS || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=37034Gene Product Association Data (GPAD) Format (Archived)2011-09-07T21:04:08Z<p>Girlwithglasses: </p>
<hr />
<div>Proposal to split the information in the GAF files into two sets, association data and gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
This proposal separates data on genes and gene products, objects being annotated, from the annotation data. The data related to gene products--symbol, name, synonyms, taxon--can be submitted, updated and maintained separately from that concerned with annotation: term IDs, evidence codes, references, annotation extension (col 16), and so on.<br />
<br />
Additionally, we would like to introduce further flexibility in annotation by using the entire suite of evidence codes available in the [http://www.obofoundry.org/cgi-bin/detail.cgi?id=evidence_code Evidence Code Ontology] and standardization of the types of object (gene product types) that can be annotated. A controlled vocabulary has not yet been introduced for the latter.<br />
<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
==Current Association File Format==<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
<br />
==Proposed file format==<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
All gene product data barring the ID of the object being annotated is removed from the annotation file.<br />
<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|}<br />
<br />
Note: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information would be stored in a separate file.<br />
<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex? PRO?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
* should we have a richer way to represent relationships between genes and the various types of gene product and GP complexes?<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+, pipe-separated || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
====Additional Data====<br />
<br />
Extra information can be included in the GPI files, including whether annotation is complete for a certain GP, and whether that GP belongs to a set prioritized for annotation. See GOA comment below.<br />
<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || IMP || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || TAS || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=LEGO_August_23rd_2011&diff=36921LEGO August 23rd 20112011-08-23T15:29:30Z<p>Girlwithglasses: /* Summary of call */</p>
<hr />
<div>'''LEGO call'''<br />
<br />
'''Tuesday 23rd August 2011'''<br />
<br />
Present:<br />
*Chris Mungall<br />
*Jane Lomax<br />
*Emily Dimmer<br />
*Rebecca Foulger<br />
*Tony Sawford<br />
*Pascale Gaudet<br />
*Paul Thomas<br />
*Amelia Ireland<br />
*Rama Balakrishnan<br />
<br />
<br />
==Aim of the call==<br />
*To come up with a broad plan of action on how to manage the LEGO project, what's in the scope of LEGO, what's out the scope of LEGO and how to proceed.<br />
<br />
<br />
==Relevant documents== <br />
<br />
Chris went through his proposal, which is designed to be read alongside Paul's White paper.<br />
<br />
*http://wiki.geneontology.org/index.php/LEGO_Model_Draft_Specification<br />
*http://wiki.geneontology.org/index.php/File:Paul%27s_LEGO_white_paper_March_2010.pdf<br />
<br />
<br />
==Summary of call==<br />
<br />
* Need to separate (i) the LEGO idea, and (ii) the implementation of LEGO. Therefore, after a brief discussion on annotation IDs and how they'd be maintained by smaller groups without much man-power, it was decided to leave the annotation IDs for a separate discussion later down the line.<br />
** We'll have to come up with displays to hide the 'ugly/complicated' stuff from users. This will be covered with the implementation discussions.<br />
<br />
* Right now, we need to take a step back and collect the use cases, justifications, what problems we're trying to solve with LEGO, WHY GO should be doing this, and why it's different to a pathway database (Chris and Paul have ideas on this).<br />
<br />
* We'll always need a mixture of pre- and post-composition for annotations and terms, because there's no ideal level of either.<br />
<br />
* '''AI''': Add annotation examples where you'd like to capture additional information (e.g. timing of a process, targets of a process etc) to http://wiki.geneontology.org/index.php/LEGO-style_annotation_ideas<br />
** Jane will add a virus example.<br />
** Need to start capturing relationships, alongside the relationship work that's ongoing for column 16.<br />
<br />
==Other points==<br />
<br />
* We can't currently capture that a GP is a TARGET of a process, where you don't know the exact GP doing the process (e.g. that a protein contains a caspase cleavage site, but you don't know which caspase is cleaving it) (ECD). LEGO will be able to capture this using, for example, PRO IDs for a generic caspase (CJM)<br />
<br />
* Need to make nested statements about which processes are involved in other processes (refer to the NEDD4, RNAPII and response to diagram in Pauls pdf). (PT)<br />
<br />
<br />
<br />
==Remaining Questions==<br />
<br />
* Can any of this information be captured in the current GAF format, so we can move forward with this without waiting for the common annotation framework (CAF)?</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=LEGO-style_annotation_ideas&diff=36897LEGO-style annotation ideas2011-08-23T00:00:12Z<p>Girlwithglasses: /* Build a pathway on the fly */</p>
<hr />
<div>This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.<br />
<br />
===What can't be fully captured in current format===<br />
<br />
For annotation developments that could be included in the current GAF format, see full details at: [[Proposed Developments to the GAF annotation format]]<br />
<br />
====Terms from external ontologies====<br />
* clarification: can be used as differentia in c16, but not in c5.<br />
* Allowing other onts in c5 is not in principal hard - BUT:<br />
* Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)<br />
<br />
====Nested class expressions (post-composed terms)====<br />
* c16 allows multiple differentia but not nested class expressions<br />
* the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading<br />
<br />
====Multiple pieces of evidence for a single assertion==== <br />
<br />
A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation<br />
<br />
* Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a [[chain of evidence]] that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated [[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:<br />
** ''protein accessions'': <br />
*** for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)<br />
*** for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained<br />
** ''gene identifiers'': mutated genes<br />
** ''GO IDs'' that support an IC annotation by providing a way of tracing back to primary-evidenced annotations<br />
* IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)<br />
09:11, 16 August 2011 (PDT)<br />
<br />
====Optimize the annotation format for viral curators.==== <br />
<br />
The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below)<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
====Capture the subject of a GO term's activity.====<br />
Although the target of an activity can now be captured in column 16, how do curator annotate a target of an activity when they do not know the identity of the gene product carrying out the activity (the annotation object). <br />
<br />
For instance:<br />
<br />
1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.<br />
<br />
Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? <br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Linking Annotations via a unique annotation id ====<br />
<br />
See [[2010_Bar_Harbor_Minutes#Alex.27s_proposal| Alex's proposal from Bar Harbor]] and [[Multiple_term_annotations | multiple term annotations]]<br />
<br />
- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner? <br />
<br />
- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?<br />
<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Capturing further information on subcellular location ====<br />
* When a gene product is active in more than one location, but the curator is not provided with the activity carried out at each location, it would be useful to be able to indicate that the protein moves between locations A and B. Ideally two GO terms to be annotated in the equivalent of column 5 e.g. cytoplasm and nucleus. I can't see how this can be done with the current format without losing information.<br />
<br />
* Would it be desirable to indicate in an annotation when a gene product is ''predominately'' in location X?<br />
<br />
==== Gene product state information ====<br />
<br />
* Capture specific information about the state or structure of a GP without having to give it a new ID. For example, a GP may be able to perform a reaction in a phosphorylated state but not when unphosphorylated. Different domains could be phosphorylated with different effects on the reactions the GP can perform. The configuration of pores and transporters is very important in whether or not transport occurs.<br />
<br />
==== Uncertain information ====<br />
<br />
* Gene product X performs reaction X or Y<br />
* several gene products involved in process X; perhaps we know the functions involved but don't know which GP does which, or we have two candidates for performing a reaction, but don't know which does it<br />
<br />
==== Build a pathway on the fly ====<br />
<br />
* Take a process like sucrose catabolism; there are a number of different routes by which this can occur - see [http://metacyc.org/META/NEW-IMAGE?object=SUCROSE-DEG this MetaCyc page] for examples. May not be possible to capture this pathway information in GO due to the strength of the part-of / has-part relations (i.e. must be ALL X have part some Y or ALL Y part of some X). The pathway could instead be created at the annotation stage by specifying the order of the reactions, components of the cell in which the reactions occur, etc..<br />
<br />
===Future annotation areas===<br />
<br />
*There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes<br />
<br />
===Useful links===<br />
<br />
* [[File:Paul's LEGO presentation from Bar Harbor Sept 2010.pdf]]<br />
* [[File:Paul's LEGO white paper March 2010.pdf]]<br />
<br />
=== Technical ===<br />
<br />
* [[LEGO in OWL]]<br />
<br />
===Meetings===<br />
<br />
Aug 23 8am PST</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=LEGO-style_annotation_ideas&diff=36877LEGO-style annotation ideas2011-08-18T23:11:00Z<p>Girlwithglasses: /* What can't be fully captured in current format */</p>
<hr />
<div>This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.<br />
<br />
===What can't be fully captured in current format===<br />
<br />
For annotation developments that could be included in the current GAF format, see full details at: [[Proposed Developments to the GAF annotation format]]<br />
<br />
====Terms from external ontologies====<br />
* clarification: can be used as differentia in c16, but not in c5.<br />
* Allowing other onts in c5 is not in principal hard - BUT:<br />
* Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)<br />
<br />
====Nested class expressions (post-composed terms)====<br />
* c16 allows multiple differentia but not nested class expressions<br />
* the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading<br />
<br />
====Multiple pieces of evidence for a single assertion==== <br />
<br />
A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation<br />
<br />
* Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a [[chain of evidence]] that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated [[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:<br />
** ''protein accessions'': <br />
*** for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)<br />
*** for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained<br />
** ''gene identifiers'': mutated genes<br />
** ''GO IDs'' that support an IC annotation by providing a way of tracing back to primary-evidenced annotations<br />
* IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)<br />
09:11, 16 August 2011 (PDT)<br />
<br />
====Optimize the annotation format for viral curators.==== <br />
<br />
The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below)<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
====Capture the subject of a GO term's activity.====<br />
Although the target of an activity can now be captured in column 16, how do curator annotate a target of an activity when they do not know the identity of the gene product carrying out the activity (the annotation object). <br />
<br />
For instance:<br />
<br />
1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.<br />
<br />
Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? <br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Linking Annotations via a unique annotation id ====<br />
<br />
See [[2010_Bar_Harbor_Minutes#Alex.27s_proposal| Alex's proposal from Bar Harbor]] and [[Multiple_term_annotations | multiple term annotations]]<br />
<br />
- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner? <br />
<br />
- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?<br />
<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Capturing further information on subcellular location ====<br />
* When a gene product is active in more than one location, but the curator is not provided with the activity carried out at each location, it would be useful to be able to indicate that the protein moves between locations A and B. Ideally two GO terms to be annotated in the equivalent of column 5 e.g. cytoplasm and nucleus. I can't see how this can be done with the current format without losing information.<br />
<br />
* Would it be desirable to indicate in an annotation when a gene product is ''predominately'' in location X?<br />
<br />
==== Gene product state information ====<br />
<br />
* Capture specific information about the state or structure of a GP without having to give it a new ID. For example, a GP may be able to perform a reaction in a phosphorylated state but not when unphosphorylated. Different domains could be phosphorylated with different effects on the reactions the GP can perform. The configuration of pores and transporters is very important in whether or not transport occurs.<br />
<br />
==== Uncertain information ====<br />
<br />
* Gene product X performs reaction X or Y<br />
* several gene products involved in process X; perhaps we know the functions involved but don't know which GP does which, or we have two candidates for performing a reaction, but don't know which does it<br />
<br />
==== Build a pathway on the fly ====<br />
<br />
* Take a process like sucrose catabolism; there are a number of different routes by which this can occur - see [http://metacyc.org/META/NEW-IMAGE?object=SUCROSE-DEG] for examples. May not be possible to capture this pathway information in GO due to the strength of the part-of / has-part relations (i.e. must be ALL X have part some Y or ALL Y part of some X). The pathway could instead be created at the annotation stage by specifying the order of the reactions, components of the cell in which the reactions occur, etc..<br />
<br />
===Future annotation areas===<br />
<br />
*There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes<br />
<br />
===Useful links===<br />
<br />
* [[File:Paul's LEGO presentation from Bar Harbor Sept 2010.pdf]]<br />
* [[File:Paul's LEGO white paper March 2010.pdf]]<br />
<br />
=== Technical ===<br />
<br />
* [[LEGO in OWL]]<br />
<br />
===Meetings===<br />
<br />
Aug 23 8am PST</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=LEGO-style_annotation_ideas&diff=36874LEGO-style annotation ideas2011-08-18T17:29:20Z<p>Girlwithglasses: </p>
<hr />
<div>This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.<br />
<br />
===What can't be fully captured in current format===<br />
<br />
For annotation developments that could be included in the current GAF format, see full details at: [[Proposed Developments to the GAF annotation format]]<br />
<br />
====Terms from external ontologies====<br />
* clarification: can be used as differentia in c16, but not in c5.<br />
* Allowing other onts in c5 is not in principal hard - BUT:<br />
* Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)<br />
<br />
====Nested class expressions (post-composed terms)====<br />
* c16 allows multiple differentia but not nested class expressions<br />
* the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading<br />
<br />
====Multiple pieces of evidence for a single assertion==== <br />
<br />
A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation<br />
<br />
* Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a [[chain of evidence]] that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated [[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
* there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:<br />
** ''protein accessions'': <br />
*** for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)<br />
*** for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained<br />
** ''gene identifiers'': mutated genes<br />
** ''GO IDs'' that support an IC annotation by providing a way of tracing back to primary-evidenced annotations<br />
* IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)<br />
09:11, 16 August 2011 (PDT)<br />
<br />
====Optimize the annotation format for viral curators.==== <br />
<br />
The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below)<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
====Capture the subject of a GO term's activity.====<br />
Although the target of an activity can now be captured in column 16, how do curator annotate a target of an activity when they do not know the identity of the gene product carrying out the activity (the annotation object). <br />
<br />
For instance:<br />
<br />
1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.<br />
<br />
Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? <br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Linking Annotations via a unique annotation id ====<br />
<br />
See [[2010_Bar_Harbor_Minutes#Alex.27s_proposal| Alex's proposal from Bar Harbor]] and [[Multiple_term_annotations | multiple term annotations]]<br />
<br />
- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner? <br />
<br />
- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?<br />
<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Capturing further information on subcellular location ====<br />
* When a gene product is active in more than one location, but the curator is not provided with the activity carried out at each location, it would be useful to be able to indicate that the protein moves between locations A and B. Ideally two GO terms to be annotated in the equivalent of column 5 e.g. cytoplasm and nucleus. I can't see how this can be done with the current format without losing information.<br />
<br />
* Would it be desirable to indicate in an annotation when a gene product is ''predominately'' in location X?<br />
<br />
==== Gene product state information ====<br />
<br />
* Capture specific information about the state or structure of a GP without having to give it a new ID. For example, a GP may be able to perform a reaction in a phosphorylated state but not when unphosphorylated. Different domains could be phosphorylated with different effects on the reactions the GP can perform. The configuration of pores and transporters is very important in whether or not transport occurs.<br />
<br />
==== Uncertain information ====<br />
<br />
* Gene product X performs reaction X or Y<br />
* several gene products involved in process X; perhaps we know the functions involved but don't know which GP does which, or we have two candidates for performing a reaction, but don't know which does it<br />
<br />
===Future annotation areas===<br />
<br />
*There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes<br />
<br />
===Useful links===<br />
<br />
* [[File:Paul's LEGO presentation from Bar Harbor Sept 2010.pdf]]<br />
* [[File:Paul's LEGO white paper March 2010.pdf]]<br />
<br />
=== Technical ===<br />
<br />
* [[LEGO in OWL]]<br />
<br />
===Meetings===<br />
<br />
Aug 23 8am PST</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Gene_Product_Association_Data_(GPAD)_Format_(Archived)&diff=36869Gene Product Association Data (GPAD) Format (Archived)2011-08-17T23:08:39Z<p>Girlwithglasses: /* Association Data */</p>
<hr />
<div>Proposal to split the information in the GAF files into two sets, association data and gene product information.<br />
<br />
<br />
==In Brief...==<br />
<br />
===Why?===<br />
<br />
*allow unannotated gene products to be submitted to the GO database<br />
** could be useful in estimating the proportion of a genome that has been annotated<br />
** will also allow users to see that the GP they are searching for ''does'' exist, so they won't spend a long time fruitlessly searching for it [see note below]<br />
*reduce the amount of redundant gene product information in the GAF files<br />
**every annotation to a gene product repeats the same gene product data; this only needs to be stated once. Removing this repeated information and supplying it in a separate file will mean the annotation data files will be smaller, which would certainly be helpful for huge files like the UniProt releases.<br />
<br />
NB: although the gp2protein files may contain IDs of unannotated gene products, '''this data does not go into the GO database, and it is not available in AmiGO'''. There is also no information available about gene product name, synonyms, type or taxon of unannotated gene products in the gp2protein file or in the GAF files.<br />
<br />
<br />
===How?===<br />
<br />
*Converting a GAF file into the GP information and annotation data files, and vice versa, would be simple enough that groups submitting data will be able to provide it in either format, and data consumers will be able to download it in GAF or the proposed formats.<br />
<br />
See [[#Technical_requirements_and_impact_on_existing_software | Technical requirements and impact on existing software ]] for more details.<br />
<br />
<br />
<br />
==Current Association File Format==<br />
<br />
Annotation data has a shaded background, gene product information is in blue text, and data required for both has blue text on a shaded background.<br />
<br />
{| border=1 cell-padding=5<br />
|-<br />
! column<br />
! required?<br />
! contents<br />
! cardinality<br />
|- style="color:blue;background:#ccffff"<br />
| 1<br />
| required<br />
| DB<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 2<br />
| required<br />
| DB_Object_ID<br />
| 1<br />
|- style="color:blue"<br />
| 3<br />
| required<br />
| DB_Object_Symbol<br />
| 1<br />
|- style="background:#ccffff"<br />
| 4<br />
| optional<br />
| Qualifier<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 5<br />
| required<br />
| GO ID<br />
| 1<br />
|- style="background:#ccffff"<br />
| 6<br />
| required<br />
| DB:Reference(s)<br />
| 1 or greater<br />
|- style="background:#ccffff"<br />
| 7<br />
| required<br />
| Evidence code<br />
| 1<br />
|- style="background:#ccffff"<br />
| 8<br />
| optional<br />
| With (or) From<br />
| 0 or greater<br />
|- style="background:#ccffff"<br />
| 9<br />
| required<br />
| Aspect<br />
| 1<br />
|- style="color:blue"<br />
| 10<br />
| optional<br />
| DB_Object_Name<br />
| 0 or 1<br />
|- style="color:blue"<br />
| 11<br />
| optional<br />
| DB_Object_Synonym(s)<br />
| 0 or greater<br />
|- style="color:blue"<br />
| 12<br />
| required<br />
| DB_Object_Type (refers to col 17 if present)<br />
| 1<br />
|- style="color:blue;background:#ccffff"<br />
| 13<br />
| required<br />
| taxon<br />
| 1 or 2 (for multi-org processes)<br />
|- style="background:#ccffff"<br />
| 14<br />
| required<br />
| Date<br />
| 1<br />
|- style="background:#ccffff"<br />
| 15<br />
| required<br />
| Assigned_by<br />
| 1<br />
|- style="background:#ccffff"<br />
| 16<br />
| optional<br />
| Annotation cross products<br />
| ?<br />
|- style="color:blue;background:#ccffff"<br />
| 17<br />
| optional<br />
| Spliceform<br />
| 1<br />
|}<br />
<br />
==Proposed file format==<br />
<br />
Proposal: remove gene product information from the association data file, leaving just an identifier.<br />
<br />
<br />
===Gene Product Association Data, GPAD===<br />
<br />
new format for storing annotations:<br />
<br />
{| style="background:#ccffff" border=1 cell-padding=5 cell-spacing=10<br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! old column #<br />
! extra info<br />
|- style="color:blue"<br />
| DB || style="color:red" | required || 1 || 1 || must be in xrf_abbs<br />
|- style="color:blue"<br />
| DB_Object_ID || style="color:red" | required || 1 || 2 ||<br />
|- <br />
| Qualifier || optional || 0 or greater || 4 || 'NOT' should not be in this column<br />
|- <br />
| (NOT) GO ID || style="color:red" | required || 1 || 5 || must be extant GO ID, prefixed with NOT for NOT associations<br />
|- <br />
| DB:Reference(s) || style="color:red" | required || 1 or greater || 6 || DB must be in xrf_abbs<br />
|- <br />
| Evidence code || style="color:red" | required || 1 || 7 || from ECO<br />
|- <br />
| With (or) From || optional || 0 or greater || 8 || <br />
|- <br />
| Interacting taxon ID (for multi-organism processes) || optional || 0 or 1 || 13 || ncbi taxon ID<br />
|- <br />
| Date || style="color:red" | required || 1 || 14 || YYYYMMDD<br />
|- <br />
| Assigned_by || style="color:red" | required || 1 || 15 || from xrf_abbs<br />
|- <br />
| Annotation Extension ([[Annotation Cross Products]]) || optional || 0 or greater || 16 || <br />
|- style="color:blue"<br />
| GP Context? || optional || 0 or 1 || 17 (if present) || to be decided<br />
|}<br />
<br />
Note: a transform would need to take place if GAF col 17 is filled in. Further discussion needed to decide where info should go.<br />
<br />
Note 2: NOT would be moved to the GO ID column. This makes it clearer when an annotation is negated and makes it much more difficult for 'NOT' to be ignored.<br />
<br />
===Gene Product Information===<br />
<br />
Gene product information would be stored in a separate file. It would consist of the following pieces of information -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
! extra info<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1 || in xrf_abbs<br />
|-style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 2 || <br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12 || need a controlled vocab (SO + GO complex?)<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13 || <br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3 ||<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10 ||<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11 ||<br />
|-<br />
| Parent GP ID || blank unless GP is an isoform (see next table) || 0 || n/a || protein - list gene; complex component - list complex ID<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a || <br />
|}<br />
<br />
<br />
* check PRO for examples<br />
<br />
<br />
Spliceforms (see [[ GAF Col17 GeneProducts ]] for more about spliceforms) would have their own entries in this file, with the data as follows:<br />
<br />
{| border=1 cell-padding=5 style="color:blue" <br />
|-<br />
! contents<br />
! required?<br />
! cardinality<br />
! GAF 2.0 col #<br />
|- style="background:#ccffff"<br />
| DB || style="color:red" | required || 1 || 1<br />
|- style="background:#ccffff"<br />
| DB Object ID || style="color:red" | required || 1 || 17<br />
|-<br />
| DB Object Type || style="color:red" | required || 1 || 12<br />
|-<br />
| Taxon || style="color:red" | required || 1 || 13<br />
|-<br />
| DB Object Symbol || style="color:red" | required || 1 || 3<br />
|-<br />
| DB Object Name || optional || 0 or 1 || 10<br />
|-<br />
| DB Object Synonym(s) || optional || 0 or greater || 11<br />
|-<br />
| Parent GP ID || style="color:red" | required || 1 || 2<br />
|-<br />
| Xrefs in other DBs (e.g. xrefs from gp2protein or the QFO file) || optional || 0+ || n/a<br />
|}<br />
<br />
<br />
Multiple entries in the xrefs col should be pipe-separated.<br />
<br />
Ideally, the gene product files would also include the gp2protein data, so we would have an additional piece of data, an xref to a UniProt or NCBI ID.<br />
<br />
==Example==<br />
<br />
===Old GAF 1.0 Format===<br />
<br />
The following appears on the page http://geneontology.org/GO.annotation.fields.shtml and is an example of the current GAF file structure (shaded bg: annotation data; blue text: gp data):<br />
<br />
{| cellspacing="2" border="1"<br />
|- style="vertical-align: top"<br />
! 1<br />
DB<br />
! 2<br />
DB Object ID<br />
! 3<br />
DB Object Symbol<br />
! 4<br />
Qualifier<br />
! 5<br />
GO ID<br />
! 6<br />
DB:Reference(s)<br />
! 7<br />
Evidence code<br />
! 8<br />
With (or) From<br />
! 9<br />
Aspect<br />
! 10<br />
DB Object Name<br />
! 11<br />
DB Object Synonym(s)<br />
! 12<br />
DB Object Type<br />
(refers to col 17 if present)<br />
! 13<br />
taxon<br />
! 14<br />
Date<br />
! 15<br />
Assigned by<br />
! 16<br />
Annotation cross products<br />
! 17<br />
Spliceform<br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0003993 || SGD_REF:S000047763 || IMP || || F ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20010118 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || style="background:white;color:blue" | PHO3 || || GO:0006796 || SGD_REF:S000047115 || TAS || || P ||style="background:white;color:blue" | acid phosphatase ||style="background:white;color:blue" | YBR092C ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20041220 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932 || 20020530 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || P ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:745953 || 20030221 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || style="background:white;color:blue" | RCL1 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || F ||style="background:white;color:blue" | aminodeoxychorismate synthase ||style="background:white;color:blue" | YOL010W ||style="background:white;color:blue" | gene ||style="color:blue" | taxon:4932<nowiki>|</nowiki>taxon:2861 || 20030106 || SGD || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0031410 || PMID:11257124 || IDA || || C ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:11257124 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | <br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043116 || PMID:16043488 || IDA || || P ||style="background:white;color:blue" | AMOT, KIAA1071:Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | snoRNA ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-1<br />
|- style="background:#ccffff"<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || style="background:white;color:blue" | AMOT_HUMAN || || GO:0043532 || PMID:16043488 || IDA || || F ||style="background:white;color:blue" | AMOT, KIAA1071: Angiomotin ||style="background:white;color:blue" | IPI00163085 ||style="background:white;color:blue" | protein ||style="color:blue" | taxon:9606 || 20051207 || UniProtKB || ||style="background:white;color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
===Proposed new format===<br />
<br />
This is how it could look in the proposed new format.<br />
<br />
Association data:<br />
<br />
{| cellspacing="2" border="1" style="background:#ccffff"<br />
|-<br />
! DB<br />
! DB Object ID<br />
! Qualifier<br />
! GO ID<br />
! DB:Reference(s)<br />
! Evidence code<br />
! With (or) From<br />
! Interacting taxon ID (for multi-organism processes)<br />
! Date<br />
! Assigned_by<br />
! Annotation cross products<br />
! Spliceform ID (if applicable)<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0003993 || SGD_REF:S000047763 || IMP || || || 20010118 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000000296 || || GO:0006796 || SGD_REF:S000047115 || TAS || || || 20041220 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || NOT || GO:0003963 || SGD_REF:S000039255 || IDA || || || 20020530 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0006406 || SGD_REF:S000069956 || IC || GO:0000346 || taxon:745953 || 20030221 || SGD || ||<br />
|-<br />
| style="color:blue" | SGD || style="color:blue" | S000005370 || || GO:0046820 || SGD_REF:S000057703 || ISS || CGSC:pabA || taxon:2861 || 20030106 || SGD || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0031410 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043532 || PMID:11257124 || IDA || || || 20051207 || UniProtKB || ||<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0043116 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5 || || GO:0005515 || PMID:16043488 || IPI || UniProtKB:Q6RHR9-2 || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-1<br />
|-<br />
| style="color:blue" | UniProtKB || style="color:blue" | Q4VCS5-2 || || GO:0043532 || PMID:16043488 || IDA || || || 20051207 || UniProtKB || || style="color:blue" | Q4VCS5-2<br />
|}<br />
<br />
<br />
Gene Product Information (including possible data from gp2protein file) -- see the [[Gene_Product_Data_File_Format | gene product information file format]] for an in-depth view.<br />
<br />
{| cellspacing="2" border="1" style="color:blue"<br />
|-<br />
! DB<br />
! DB_Object_ID<br />
! DB_Object_Type<br />
! Taxon<br />
! DB Object Symbol<br />
! DB Object Name<br />
! DB Object Synonym(s)<br />
! Parent GP ID<br />
! Xrefs in other DBs<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000000296 || gene || 4932 || PHO3 || acid phosphatase || YBR092C || || UniProt:NE92D8<br />
|-<br />
| style="background:#ccffff" |SGD || style="background:#ccffff" |S000005370 || gene || 4932 || RCL1 || aminodeoxychorismate synthase || YOL010W || || UniProt:JN97D8<br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5 || protein || 9606 || AMOT_HUMAN || AMOT, KIAA1071: Angiomotin || KIAA1071 || || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-1 || snoRNA || 9606 || AMOT_HUMAN || Isoform 1 of Angiomotin || || UniProtKB:Q4VCS5 || <br />
|-<br />
| style="background:#ccffff" |UniProtKB || style="background:#ccffff" |Q4VCS5-2 || protein || 9606 || AMOT_HUMAN || Isoform 2 of Angiomotin || || UniProtKB:Q4VCS5 ||<br />
|}<br />
<br />
<br />
<br />
==Technical requirements and impact on existing software==<br />
<br />
<br />
For the most part, the changes suggested are simply a split of existing GAF files into two separate files, with a rearrangement of existing columns. It would be fairly easy to write a standalone script that could take in the two files and produce one from them, or vice versa.<br />
<br />
<br />
<br />
===GO Database===<br />
<br />
Some modifications of the database loading scripts would need to be made; the new formats mirror the database structure more closely so it should actually be an improvement.<br />
<br />
<br />
===Groups submitting GO data===<br />
<br />
Rather than writing out a single GAF file, two will be required: a file containing gene product data, and a file containing annotations. For databases with a GO database-like schema, where annotation information and gene product data are kept in separate tables, this new format will more closely mirror database structure.<br />
<br />
<br />
===Groups using GO data===<br />
<br />
There are two options here; either we provide files in all formats, or we provide files in one format and supply users with a script to convert the data. The latter would incur less cost in terms of processing time and storage space for the GOC. A gradual transition to using the new format and a period of supplying both old and new files (as was done with the ontology files) is probably the best option.<br />
<br />
<br />
==Any Other Business==<br />
<br />
===What's all this spliceforms / isoforms stuff about?===<br />
<br />
Please see the [[ GAF Col17 GeneProducts | documentation on column 17]] for more details. Although the proposal for col 17 has been accepted by the GO Consortium, it is not clear how many databases are annotating different isoforms and are hence using col 17.<br />
<br />
<br />
[[Category:GAF]] [[Category:Annotation]]<br />
<br />
====Comments====<br />
<br />
GOA: This file is a great idea. We feel that it would greatly benefit our users. However could we extend the proposed format of this file further, to additionally include optional attribute-value pairs that describe:<br />
<br />
<br />
1. DB subset (for GOA this would have values either Swiss-Prot or TrEMBL)<br />
<br />
2. Target set member (to indicate if a protein has been prioritized by an annotation project, such as Reference Genomes, the Renal or Cardiovascular annotation projects). This would mean curators could move away from having to fill in multiple Reference Genomes google spreadsheets.<br />
<br />
3. Annotation Complete: yes/no (annotation data all groups now store such information, but there is no current export mechanism for this data)<br />
<br />
([[User:Edimmer|Edimmer]] 11:27, 26 January 2010 (UTC))<br />
<br />
==The UniProt gp_association and gp_information files==<br />
<br />
Since June 2010 GOA has been supplying the UniProt annotation set in gp_association and gp_information files, in addition to the (GAF2.0 format) gene_association file; they can be found at [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association.goa_uniprot.gz gp_association.goa_uniprot.gz] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz gp_information.goa_uniprot.gz].<br />
<br />
The format of these files is fully documented in [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_association_readme gp_association_readme] and [ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information_readme gp_information_readme], but, in summary, the columns present in each of the files are as follows:<br />
<br />
===gp_association===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
|-<br />
| 01 || DB || required || 1 || 1<br />
|-<br />
| 02 || DB_Object_ID || required || 1 || 2<br />
|-<br />
| 03 || Qualifier || optional || 0 or greater || 4<br />
|-<br />
| 04 || GO ID || required || 1 || 5<br />
|-<br />
| 05 || DB:Reference(s) || required || 1 or greater || 6<br />
|-<br />
| 06 || Evidence code || required || 1 || 7<br />
|-<br />
| 07 || With || optional || 0 or greater || 8<br />
|-<br />
| 08 || Extra taxon ID || optional || 0 or 1 || 13<br />
|-<br />
| 09 || Date || required || 1 || 14<br />
|-<br />
| 10 || Assigned_by || required || 1 || 15<br />
|-<br />
| 11 || Annotation Extension || optional || 0 or greater || 16<br />
|-<br />
| 12 || Spliceform ID || optional || 0 or 1 || 17<br />
|}<br />
<br />
===gp_information===<br />
<br />
{| cellspacing="2" border="1"<br />
|-<br />
! column<br />
! name<br />
! required?<br />
! cardinality<br />
! GAF column<br />
! Example content<br />
|-<br />
| 01 || DB || required || 1 || 1 || UniProtKB<br />
|-<br />
| 02 || DB_Subset || optional || 0 or 1 || - || Swiss-Prot or TrEMBL<br />
|-<br />
| 03 || DB_Object_ID || required || 1 || 2 || Q4VCS5<br />
|-<br />
| 04 || DB_Object_Symbol || required || 1 || 3 || AMOT<br />
|-<br />
| 05 || DB_Object_Name || optional || 0 or 1 || 10 || Angiomotin<br />
|-<br />
| 06 || DB_Object_Synonym(s) || optional || 0 or greater || 11 || AMOT|KIAA1071|IPI:IPI00163085|IPI:IPI00644547|UniProtKB:AMOT_HUMAN<br />
|-<br />
| 07 || DB_Object_Type || required || 1 || 12 || protein<br />
|-<br />
| 08 || Taxon || required || 1 || 13 || taxon:9606<br />
|-<br />
| 09 || Annotation_Target_Set || optional || 0 or greater || - || BHF-UCL|KRUK|Reference Genome<br />
|-<br />
| 10 || Annotation_Completed || optional || 1 || - || timestamp (YYYYMMDD)<br />
|-<br />
| 11 || Parent_Object_ID || optional || 0 or 1 || - || UniProtKB:P21677<br />
|}<br />
<br />
[[Category:Specification]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Proposal_for_integral_to_qualifier_(Archived)&diff=36868Proposal for integral to qualifier (Archived)2011-08-17T23:07:11Z<p>Girlwithglasses: </p>
<hr />
<div>GO annotations indicate that a gene product<br />
<br />
* is part of a cell component<br />
* executes a molecular function<br />
* is an active participant in a biological process.<br />
<br />
Currently annotations are weak in that they only indicate that there is some context in which the gene product is observed to do these things. The addition of a new annotation qualifier '''integral_to''' will allow annotators to make stronger annotations to indicate if a gene product is '''required''' by an organism to carry out some process or constitute a cell component. This will in turn allow for additional inferences, improving comprehensivity of query results, term enrichment analyses, and cross-species annotation propagation.<br />
<br />
Formally this can be thought of as making a has_part or has_participant relationship between the gene product and the GO term. This will be transitive with the existing [[has_part]] relationships in GO, allowing us to use them properly.<br />
<br />
An example:<br />
<br />
* NEF3 complex [[has_part]] core TFIIH complex ''(asserted in GO)''<br />
* yeast TFB1 [[integral to]] core TFIIH complex ''(asserted GO annotation)''<br />
&rarr;<br />
* yeast TFB1 [[integral to]] NEF3 complex ''(inferred GO annotation)''<br />
<br />
<br />
== Problem Statement ==<br />
<br />
Information is currently being lost in annotations because an important distinction is not being made. We currently have no way of indicating that a specific organism ''requires'' a gene product in order to carry out some process, or that a specific complex always contains a certain gene product. This means we are losing the ability to use the [[has_part]] relation to make additional inferences.<br />
<br />
One side effect of the current lack of inference is confusion about when to exclude certain annotations. For example, at the 2010 GO annotation camp there was a discussion about worm rpb-2, and whether to exclude IMP annotations to developmental terms. There was a reluctance to remove these IMP annotations, because they were the only experimental data in worm (this is from my memory, TODO: check). If rpb-2 was annotated as being [[integral to]] transcription, and the ontology stated that all development ''requires'' transcription, then we would see that these are just another case of redundant annotations, as we can infer that rpb-2 is involved in ALL development.<br />
<br />
For additional information, see the slides for [http://www.slideshare.net/cmungall/haspart-in-go has_part in GO]. This focuses on a cell component example, but the solution presented is applicable to BP<br />
<br />
== Proposed Solution ==<br />
<br />
We would allow an additional qualifier '''integral_to'''. This could be mixed with existing qualifiers and the '''NOT''' modifier.<br />
<br />
The formal meaning of the qualifier is specified below. Informally, the use of this qualifier with a gene product '''G''' in a species '''S''' means:<br />
<br />
* for a CC annotation: every instance of the annotated component in '''S''' has a '''G''' as part<br />
** Example: ''every'' core TFIIH complex [[has_part]] ''some'' TFB1 (in yeast)<br />
* for a BP annotation: every instance of the annotated process in '''S''' requires '''G''', otherwise the process cannot be carried out<br />
** Example: <br />
* for a MF annotation: every instance of the annotated molecular function in '''S''' is catalyzed or otherwise executed by a '''G'''<br />
** Example:<br />
<br />
=== Examples ===<br />
<br />
I believe the following annotations could be made ('''TODO''': to be checked by an annotator). Currently these are normal annotations. These could be "promoted" to integral_to annotations.<br />
<br />
* rpb-2 in C.elegans is integral_to transcription<br />
* TFB1 in S.cer is integral_to every core TFIIH complex (GO:0000439)<br />
* MSH2 meiosis example (Pascale/Paul to fill in)<br />
<br />
TODO: more compelling example https://sourceforge.net/tracker/?func=detail&atid=440764&aid=3047074&group_id=36855 cell cycle and DNA replication<br />
<br />
== Annotation propagation behavior (informal description) ==<br />
<br />
===Propagation DOWN the is_a hierarchy ===<br />
<br />
Typically annotations "propagate up". integral to propagates down. For example:<br />
<br />
* If rpb-2 is integral to transcription in C elegans, then it is integral to DNA transcription.<br />
<br />
Note that an integral to annotation also implies a normal annotation. So the full inferences are<br />
<br />
* integral to transcription, and integral to all is_a descendants<br />
* sometimes active in transcription, and therefore sometimes active in all is_a ancestors of transcription<br />
<br />
TODO: diagram<br />
<br />
=== Propagation over has_part ===<br />
<br />
Protein Complex example:<br />
<br />
* If TFB1 is integral to core TFIIH complex (GO:0000439), then it is integral to NEF3 complex and also integral to holo TFIIH complex. This is based on their being two relationships in the ontology:<br />
** [every] NEF3 complex has_part [some] integral to core TFIIH complex<br />
** [every] holo TFIIH complex has_part [some] integral to core TFIIH complex<br />
<br />
Process example:<br />
<br />
here we assume that the ontology contains the links<br />
<br />
* [every] developmental process has_part [some] gene expression<br />
* [every] gene expression has_part [some] transcription<br />
<br />
TODO: check with developmental biologist, but this seems uncontroversial.<br />
<br />
== Formal Description ==<br />
<br />
TODO: when the annotations and ontologies are expressed in OWL, the correct semantics come for free.<br />
Maybe move this into a separate page...<br />
<br />
== Where do these annotations come from? ==<br />
<br />
Obviously it would be a lot of work to retrospectively go back and strengthen existing annotations. Some of this could come from "common biological knowledge".<br />
<br />
Another source is pathway databases. When a gene product is assigned to a step in reactome it can be treated as an integral_to qualified annotation. Care must be taken when mapping from the reatcome ID to the GO term, because the reactome process may well be more specific than the GO term.<br />
<br />
[[Category:Proposals]]<br />
[[Category:GAF]]<br />
[[Category:Reasoning]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=LEGO-style_annotation_ideas&diff=36867LEGO-style annotation ideas2011-08-17T23:01:56Z<p>Girlwithglasses: /* Linking Annotations via a unique annotation id */</p>
<hr />
<div>This page is for collecting together thoughts and ideas on what we want from a new annotation system. Given a clean sheet, what would you want to capture? Examples are always useful.<br />
<br />
===What can't be fully captured in current format===<br />
<br />
For annotation developments that could be included in the current GAF format, see full details at: [[Proposed Developments to the GAF annotation format]]<br />
<br />
====Terms from external ontologies====<br />
** clarification: can be used as differentia in c16, but not in c5.<br />
** Allowing other onts in c5 is not in principal hard - BUT:<br />
** Adding other ontologies (e.g. CL in c5 would require explicit relationship type like expressed_in) (next item)<br />
<br />
====Nested class expressions (post-composed terms)====<br />
** c16 allows multiple differentia but not nested class expressions<br />
** the syntax was designed to accommodate this, but at this point it gets quite complex for people and feels like overloading<br />
<br />
====Multiple pieces of evidence for a single assertion==== <br />
<br />
A new annotation format could provide a more detailed, structured format for the evidence supporting an annotation<br />
<br />
** Currently the evidence for an annotation is located in the reference (col.6), evidence (col.7) and with (col.8) fields in the GAF. There are restrictions on the acceptable values and their cardinality in these fields. However, curators would like to make a [[chain of evidence]] that would result in the inclusion of multiple evidence and reference identifiers to support a single annotation. While work-arounds are being discussed on calls, solutions are not ideal.[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
** there are sets of annotations that can only be identified via the assigned_by field (for instance the GOC-assigned annotations automatically inferred from MF-BP links), this to me seems to indicate that we need another field to consistently indicate how these annotations are generated [[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
** there is a diversity of values are included in the 'with' field. This diversity can only be interpreted correctly once users have consulted the evidence code documentation and in some cases the cited GO_REFs. Common values are:<br />
*** ''protein accessions'': <br />
**** for IPI-evidenced annotations these are binding partners to the annotation object (col.2) (there are also differences in the format between curation groups where many such ids are listed- 1:many or distinct binary interactions)<br />
**** for ISS or IEA-evidenced annotations that use sequence/orthology information, they indicate the orthologous protein from which annotation data was obtained<br />
*** ''gene identifiers'': mutated genes<br />
*** ''GO IDs'' that support an IC annotation by providing a way of tracing back to primary-evidenced annotations<br />
* IEA annotations include external vocabularies which have supported the annotation prediction (e.g. IPR ids)<br />
09:11, 16 August 2011 (PDT)<br />
<br />
====Optimize the annotation format for viral curators.==== <br />
<br />
The dual taxon requirement for many virus annotations is not ideal - many investigators use an organism for investigations that never act as its natural host. While it might be of interest to the user that the experimental host context is captured in some cases, perhaps data on known viral hosts should be additionally used/supplied? UniProt has a virus/host list that could be used. However how is such dual taxon information intended to be used by to our users? Should there exist a reciprocal annotation/link to the host protein/process to indicate they are targetted by viral action? (see below)<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
====Capture the subject of a GO term's activity.====<br />
Although the target of an activity can now be captured in column 16, how do curator annotate a target of an activity when they do not know the identity of the gene product carrying out the activity (the annotation object). <br />
<br />
For instance:<br />
<br />
1. PMID:10085113 describes the caspase cleavage site in Atrophin-1; indicating that it is a target of executioner caspases and involved in the execution phase of apoptosis. While caspase 3 is used in the paper to demonstrate this protein is a caspase substrate, it is likely to be the target of other executioner caspases as well.<br />
<br />
Although could something be done using a full set of relationships between the id in col.2 and col.5 ? Could targets of an annotation that are cited in column 16 be used to automatically generate an annotation with the target in col. 2 along with an appropriate relationship to the GO ID in column 5? <br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Linking Annotations via a unique annotation id ====<br />
<br />
See [[2010_Bar_Harbor_Minutes#Alex.27s_proposal| Alex's proposal from Bar Harbor]] and [[Multiple_term_annotations | multiple term annotations]]<br />
<br />
- this does sound powerful, but am concerned whether is possible before all annotations are kept and developed in the same one annotation database (CAF), where they can be consistently audited. Building complex annotation lines using as their basis annotation IDs might be problematic where we cannot be sure that all groups are maintaining the annotations and the associated IDs in the same manner? <br />
<br />
- could be useful for different external annotation efforts. For instance, they might like to use an annotation ID to indicate where a specific gp involved in a normal MF/BP is disrupted to become involved in a disease/trait/phenotype?<br />
<br />
[[User:Edimmer|Edimmer]] 09:11, 16 August 2011 (PDT)<br />
<br />
==== Capturing further information on subcellular location ====<br />
* When a gene product is active in more than one location, but the curator is not provided with the activity carried out at each location, it would be useful to be able to indicate that the protein moves between locations A and B. Ideally two GO terms to be annotated in the equivalent of column 5 e.g. cytoplasm and nucleus. I can't see how this can be done with the current format without losing information.<br />
<br />
* Would it be desirable to indicate in an annotation when a gene product is ''predominately'' in location X?<br />
<br />
===Future annotation areas===<br />
*There's likely to be a lot of data coming from metabolomics and metagenomics studies in the next few years e.g. the Human Microbiome Project so we might want to consider how you might annotate population-level processes<br />
===Useful links===<br />
* [[File:Paul's LEGO presentation from Bar Harbor Sept 2010.pdf]]<br />
* [[File:Paul's LEGO white paper March 2010.pdf]]<br />
<br />
=== Technical ===<br />
<br />
* [[LEGO in OWL]]<br />
<br />
===Meetings===<br />
Aug 23 8am PST</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Guidelines_from_Annotation_Camp&diff=36702Guidelines from Annotation Camp2011-08-04T19:11:19Z<p>Girlwithglasses: /* Downstream Process guidelines */</p>
<hr />
<div>==Downstream Process guidelines==<br />
<br />
===Requesting more specific terms for downstream processes=== <br />
<br />
Quite often it is the case that the most relevant GO term will not exist. It is desirable to request terms which describe the involvement of a process in another process, if that will give more specificity to the annotation.<br />
For example, to describe a gene product's "intent" to change the "state" of the cell;<br />
<br />
• Growth factor BMP2 is instrumental in cardiac cell differentiation<br />
<br />
• Following stimulation with BMP2, large numbers of genes are up/down regulated<br />
<br />
Requesting the new GO term 'BMP signaling involved in cardiac cell differentiation' may be preferable to annotating to the separate terms 'BMP signaling' and 'cardiac cell differentiation' as it will be clear how the gene product is involved in cardiac cell differentiation. i.e. qualify how the gene product is involved in the downstream process in preference to annotating to the downstream process term.<br />
<br />
To assist in the creation of these new terms, the [http://amigo.berkeleybop.org/cgi-bin/amigo/xp_term_request?mode=process AmiGO 'Cross-product Term Request' tool] will be useful, when it has been put into production.<br />
<br />
===Annotating downstream processes for gene products involved in core or specific processes=== <br />
<br />
For small scale experiments, curators should annotate to the experimental evidence in the paper.<br />
<br />
However, curator judgement should be used, taking into account what the curator knows about:<br />
<br />
a) the gene product; does it have a central role causing it to affect multiple processes, or does it have few specific targets?<br />
<br />
b) the quality of the experimental assays performed in the paper; are they fully explained and the evidence supplied convincing? (See separate guidelines for annotation of high-throughput experiments.)<br />
<br />
''Example 1. Gene product involved in core process.''<br />
<br />
'''a) Yeast RNA polymerase II subunit RPB2'''<br />
<br />
• has core function of RNA polymerase activity<br />
<br />
• likely to affect large number of processes unrelated to its function<br />
<br />
• most curators agree should annotate only to 'transcription'<br />
<br />
'''b) Yeast spliceosome'''<br />
<br />
• in S. cerevisiae several genes are components of spliceosome<br />
<br />
• when mutated the strains have defects in translation<br />
<br />
• later evidence confirmed the genes' involvement in mRNA splicing, NOT translation<br />
<br />
• since most splicing in yeast is to ribosome genes the effect on translation was seen<br />
<br />
• so annotations to 'translation' were removed from the spliceosome components<br />
<br />
''Example 2. Gene product involved in core and specific process(es).''<br />
<br />
'''S. pombe gene Sre1'''<br />
<br />
• direct transcriptional regulator of genes which have a role in heme and lipid biosynthesis [http://www.ncbi.nlm.nih.gov/sites/entrez/16537923 PMID:16537923]<br />
<br />
• the curator judged this to be important information for this gene product<br />
<br />
• annotations were made to:<br />
<br />
** specific RNA polymerase II transcription factor activity<br />
** regulation of transcription<br />
** positive regulation of heme biosynthesis<br />
** positive regulation of lipid biosynthesis<br />
<br />
• In accordance with Guideline 1 for Downstream Processes, we would recommend that new terms are requested for;<br />
<br />
** Regulation of transcription involved in heme biosynthesis<br />
** Regulation of transcription involved in lipid biosynthesis<br />
<br />
===Annotating downstream processes to poorly characterised gene products=== <br />
<br />
If a gene product has limited experimental literature, such as a newly characterised protein, it is acceptable to annotate to more general 'downstream' process terms that may represent a phenotype.<br />
<br />
As more functional information is published about a gene product, these annotations to potential downstream processes may be removed if they are deemed by the annotating group as indirect, or they may be kept depending on each MOD's strategy.<br />
<br />
Always remove annotations that are incorrect or are from substandard evidence (NAS/TAS/IC) when replaced with better evidence to the same or more-granular term.<br />
<br />
===Annotating downstream processes to gene products in a ligand-receptor signaling pathway=== <br />
<br />
Annotate ligand-receptor signaling pathways as shown in following diagrams<br />
<br />
General consideration;<br />
For a signaling pathway the ligand is considered part of the pathway, e.g. the insulin signaling pathway. In this case, a factor which limits/increases the availability of a ligand to a receptor should be annotated as regulating the ligand/receptor pathway.<br />
<br />
N.B. Clarification of the start/end of a signaling pathway by the signaling group will allow us to refine these guidelines<br />
<br />
[[File:Pathway_annotation_diagram2.pdf]]<br />
<br />
===General note on revision of annotation sets===<br />
<br />
''Relevant to gene products with little annotatable evidence''<br />
<br />
When further information about a gene product is obtained, there are two options for the annotation set:<br />
<br />
1. Remove annotations to indirect/downstream processes (or update them to ‘regulation’ terms). This ‘deleted’ information is usually stored in the annotating group’s phenotype database.<br />
<br />
2. Do not remove annotations to indirect/downstream processes because;<br />
<br />
a) downstream annotations are supported by good evidence / want to keep as history of annotation / want to give a complete overview of knowledge about the gene product.<br />
<br />
b) do not have resources to revise annotation sets / do not have alternative place to store data<br />
<br />
''It is important to note that MODs that keep these annotations will be a source of downstream process terms to MODs which do not keep these terms, via ISS from orthologs (e.g. PAINT).''<br />
<br />
==Binding guidelines==<br />
<br />
===Using terms that imply binding of substrates===<br />
<br />
As many terms in the Molecular Function ontology implicitly or explicitly imply the binding of a chemical or protein, it is unnecessary to co-annotate a gene product to a term from the binding node of GO to describe the binding of substrates or products that are already adequately captured in the definition of the Molecular Function term. For instance, a protein with enzymatic activity MUST bind all of the substrates and products of the reaction it catalyzes. Similarly, a protein with transporter activity MUST bind the molecules it transports. The curator should try to capture the specifics as much as feasible and avoid redundant annotations. Annotate to a binding term whenever an experiment shows binding, but not catalysis/transport. Curators should use their judgment to decide whether the interaction is physiologically relevant and capture information relevant to the in vivo situation. <br />
<br />
===Choosing more descriptive terms than 'protein binding'===<br />
<br />
Child terms that describe a particular class of protein binding (e.g. GO:0030971:receptor tyrosine kinase binding) should be used in preference to the parent term GO:0005515 protein binding. The IPI evidence code should be used where possible for annotation of all protein-protein interactions and the precise identity of the interacting protein should be captured in the ‘with’ column (8). At present a variety of identifiers can be used in the ‘with’ column (8) or the annotation extension column (16), see [http://www.geneontology.org/GO.format.gaf-2_0.shtml GO Annotation File Format 2.0 Guide].<br />
<br />
===Identifying binding partners using columns 8 and 16===<br />
<br />
When a gene product is being annotated to a binding activity term, the 'with' column (8) and/or the annotation extension column (16) can be used to capture additional information about the identify of the binding partner of the gene product being annotated. To understand when to use column 8, column 16, or both, it is important to remember that entries in column 8 support the evidence used to infer the function, while entries in column 16 modify the GO term used in the GO_ID column (5). The curator also needs to remember that the 'with' column (8) can be used with only a subset of evidence codes: IPI, IC, IEA, IGI, IMP or ISS; column 8 cannot be used with an IDA evidence code, see [http://www.geneontology.org/GO.evidence.shtml evidence code documentation].<br />
<br />
''Examples of using the 'with' column (8)''<br />
<br />
The annotation of '''Protein A to a GO binding term with evidence code IPI and Protein B in the 'with' column (8)''' makes the statement that Protein A has the binding activity defined by the GO term and this function was inferred from interaction with Protein B; binding to Protein B isn't necessarily the in vivo function of Protein A.<br />
<br />
1) Column 8 can be used to make annotations based on experiments where the evidence for the function of Protein A binding Protein B in species X is based on binding of protein B from species Y. For [http://dev.biologists.org/content/130/4/693.long example], the C. elegans Unc-115 protein was shown to bind to actin filaments made with actin purified from rabbit skeletal muscle. This would be annotated as GO:0051015:actin filament binding using an IPI evidence code and putting an accession for rabbit skeletal muscle actin, UniProtKB:P68135, in the 'with' column (8). This annotation makes the statement that C. elegans Unc-115 has the molecular function of actin filament binding inferred from experiments using rabbit actin. <br />
<br />
2) Column 8 can be used to indicate that the evidence for binding a small molecule is based on an experiment using an analog. The annotation '''Protein A GO:0005524:ATP binding IPI column 8 ATP-gamma-S''' captures the information that ATP binding activity was inferred from binding of a non-hydrolyzable ATP analog. <br />
<br />
''Examples of using the annotation extension column (16)''<br />
<br />
The annotation of '''Protein A to a GO function term with Protein B and a has_participant relationship in the annotation extension column (16)''' makes the statement that an in vivo target of Protein A is Protein B. This is equivalent to the post-compositional creation of a new child term.<br />
<br />
3) The zebrafish Lnx2b protein (UnitProtKB:A4VCF7) was shown to ubiquitinate zebrafish Dharma (UniProtKB:O93236) in [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2759713/?tool=pubmed PMID:19668196]. Therefore Lnx2b can be annotated to GO:0004842:ubiquitin-protein ligase activity adding has_input UniProtKB:O93236 in annotation extension column (16). This annotation makes the statement that Dharma is a substrate of the ubiquitin-protein ligase activity of Lnx2b. <br />
<br />
4) The human ABCG1 protein has been annotated to GO:0034041 sterol-transporting ATPase activity with an IDA evidence code. The experiments in the [http://gowiki.tamu.edu/wiki/index.php/PMID:17408620 paper], demonstrate that the target is 7β-hydroxycholesterol; this information can be added to the annotation by including the ChEBI ID for 7β-hydroxycholesterol, CHEBI:42989, in the annotation extension column (16): post-composing the GO term 7β-hydroxycholesterol-transporting ATPase activity. <br />
<br />
The 'with' column (8) and the annotation extension column (16) should be used '''only''' for direct interactions and '''only''' when the binding relationship is not already included in the GO term and/or definition. See [http://wiki.geneontology.org/index.php/Annotation_Cross_Products column 16 documentation] for relationship types to use when adding IDs in the annotation extension column (16).<br />
<br />
===Ontology development for protein binding===<br />
<br />
Future ontology development efforts should be relied upon to improve the searching capability of any user who is specifically interested in gene products carrying out a certain type of substrate/product binding. Ongoing relevant ontology development of 'has_part' relationships will provide links to implied substrate binding (the GOC are developing 'has_part' relationships to implying substrate binding). The existing GO will follow this new format, e.g. Transcription factor activity will have a 'has_part' relationship to DNA binding rather than an 'is_a' relationship. Curators should request new 'has_part' relationships (and terms) if these do not exist. <br />
<br />
=='Response to' guidelines==<br />
<br />
1. Updated definition of top-level 'response to' terms, to indicate where the response begins and ends.<br />
<br />
from: A change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a stimulus.<br />
<br />
to:<br />
<br />
Any process that results in a change in state or activity of a cell or organism as the result of a stimulus. The process begins with detection of the stimulus and ends with a change in state or activity or the cell or organism. <br />
<br />
This change was made and released in ontology version 1.1960<br />
<br />
'''Examples:''' <br />
<br />
response to stimulus ; GO:0050896<br />
<br />
Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a stimulus. The process begins with detection of the stimulus and ends with a change in state or activity or the cell or organism<br />
<br />
GO:0051716 cellular response to stimulus<br />
<br />
Any process that results in a change in state or activity of a cell (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a stimulus. The process begins with detection of the stimulus by a cell and ends with a change in state or activity or the cell.<br />
<br />
<br />
2. Advisory quality control check: High level ‘response to’ terms should not directly be used for annotation, unless additional information is supplied in column 16.<br />
<br />
3. Update guidelines: Encourage the use of granular terms for ‘responses’<br />
<br />
4. Update guidelines: Be careful to use IEP when the experiment is observing expression level. Example: PMID:8888624 and annotation for A.thaliana [http://www.arabidopsis.org/servlets/TairObject?accession=locus:2182783 BIP1]. Should use IEP than IDA<br />
<br />
==Use of Regulation Terms==<br />
<br />
===Background===<br />
The GO Consortium recognized quite early on in the development of the Biological Process ontology that there were gene products that participated directly in a process and gene products that regulated a process, positively and/or negatively. But how do curators know to which of these terms they should be annotating and is it possible, for a given process, to annotate the same gene product to both a parent term and one of its associated regulation term?<br />
<br />
To begin to address these questions here are some guidelines for annotating, or not, to regulation terms:<br />
<br />
===Guideline 1: Use existing biological knowledge to define the process.=== <br />
<br />
In order to determine whether a gene product participates in a process or regulates that process (or both) curators need to consider the nature of the process. Processes can be considered as ordered assemblies of molecular functions and every process has a beginning, middle, and end. <br />
<br />
Use existing biological knowledge and the paper being curated as guides. Is there a defined pathway, i.e. distinct molecular functions, and have the gene products that perform those functions been identified? Does the gene product being annotated perform one of those functions or a function outside of the process that might start, stop, or change the rate at which the process proceeds? <br />
<br />
In reality, the beginning, middle, and end of some processes will be easier to define than others. For example, signaling pathways, such as MAPK signaling, will be easier to define than broader, organismal-level processes such as embryonic development. Curators should use their jugdement, based on the published literature, to guide their annotation.<br />
<br />
'''Example: Atg1'''<br />
<br />
Saccharomyces cerevisiae Atg1 encodes a protein kinase that is involved in [http://amigo.geneontology.org/cgi-bin/amigo/term-details.cgi?term=GO:00069148 autophagy]: "The process by which cells digest parts of their own cytoplasm; allows for both recycling of macromolecular constituents under conditions of cellular stress and remodeling the intracellular structure for cell differentiation."<br />
<br />
Atg1 activity is critical for the induction of autophagy, specifically for formation of autophagic vacuoles. Should Atg1 be annotated to autophagic vacuole formation or regulation of autophagic vacuole formation? Authors have used language that could lead curators to make annotations to either term.<br />
<br />
In this case, annotators need to consider the sum of what is known about the autophagic pathway and Atg1's role in that pathway. <br />
<br />
Using that knowledge, SGD has annotated Atg1 to the parent process term, autophagic vacuole formation, because once Atg1 is active, the 'go' or 'no go' decision for autophagy has already been made. More upstream genes appear to actually be regulating the autophagic pathway. <br />
<br />
http://gocwiki.geneontology.org/index.php/2010_GO_camp_Use_of_Regulation_issues#Example_2<br />
<br />
===Guideline 2: If you aren’t sure, consider annotating to the parent process term.=== <br />
<br />
If the gene product performs one of the functions, annotate directly to the process. If the gene product regulates then it should be annotated to regulation of that process.<br />
<br />
If you aren't sure what term to use, annotate to the parent process term. As more information about the process becomes available, you may be able to refine your annotations (see Guideline #4 below).<br />
<br />
===Guideline 3: Improve the ontology by defining, wherever possible, the beginning, middle, and end of a process.===<br />
<br />
Wherever possible, include the beginning, middle, and end of a process in the corresponding term definition. This will help annotators choose the appropriate term for their annotations.<br />
<br />
===Guideline 4: Revisit annotations when new knowledge becomes available.===<br />
<br />
GO annotations should reflect the present state of biological knowledge. Therefore, as the understanding of a biological process improves, it may be necessary to revisit and refine existing annotations.<br />
<br />
===Guideline 5: Annotations based on mutant phenotypes should take mechanism into account.===<br />
<br />
Mutant phenotypes are often used to make annotations to regulation terms because they fit the criteria of the term definition, i.e. authors report a change in the frequency, rate, or extent of a process. <br />
<br />
However, in using IMP to correctly make regulation annotations it is important to consider various factors, including: 1) the assay type, 2) nature of the alleles (null vs reduction of function), and 3) molecular identity of the gene product. <br />
<br />
Again, if it isn't clear that a gene product is involved in regulation, then it is better to annotate to the parent process term.<br />
<br />
'''Example: muscle contraction and ''C. elegans'' mutants'''<br />
<br />
In ''C. elegans'', a number of genes can mutate to paralysis or slowed locomotion due to defects in muscle contraction. This includes genes that encode everything from myosin heavy chain to calcium channels to transcription factors. Depending upon the nature of the allele, sometimes the mutant phenotypes for the same gene can lead to both process and regulation terms. In this case, consideration of the process, the nature of the allele (complete or partial loss of function), and the molecular identity of the gene product can guide curators in making the appropriate annotation.<br />
<br />
http://wiki.geneontology.org/images/4/47/Regulation_example.pdf<br />
<br />
===Guideline 6: Some gene products may be annotated to both a process and regulation of that process.===<br />
<br />
Positive and negative feedback loops are an essential part of many signaling pathways.<br />
<br />
If one member of a pathway regulates the activity of a ''different'' member of the pathway, it could be annotated to both the process and regulation of that process.<br />
<br />
When annotating gene products involved in a signaling pathway, however, curators should not annotate gene products that directly activate the next gene product in the pathway to regulation of that pathway.<br />
<br />
For example, MAPKK would not be annotated to positive regulation of MAPKKK cascade just because it phosphorylates and activates MAPK. <br />
<br />
However, gene products that, for example, feed back onto earlier steps in the pathway, may be annotated to both the parent process term and a regulation term.<br />
<br />
'''Example: ERK1/2'''<br />
<br />
ERK1/2 activation requires activity of FRS2alpha which, in turn, is negatively regulated by activated ERK1/2.<br />
<br />
Could ERK1/2 be annotated to both MAPKKK cascade and negative regulation of MAPKKK cascade?<br />
<br />
[http://www.molbiolcell.org/cgi/content/full/21/4/664 Phosphoprotein Enriched in Astrocytes 15 kDa (PEA-15) Reprograms Growth Factor Signaling by Inhibiting Threonine Phosphorylation of Fibroblast Receptor Substrate 2{alpha}]<br />
<br />
Cases where the presence/absence of one of the members of a pathway is limiting should not be annotated to regulation, e.g. if the amount of a receptor on the surface of a cell regulates the process, the receptor should ''not'' be annotated to the regulation term.<br />
<br />
==Protein complexes guidelines==<br />
<br />
1. Long term goal is to annotate complexes; details and requirements need to be clarified.<br />
<br />
2. Guidelines + Quality control check: Avoid annotations to GO: MF by IPI (except for ‘protein binding’ and children) - Error reports will be generated.<br />
<br />
3. Add to the guidelines: Do not make EXP annotations to MF when only the CC is observed.<br />
<br />
==Quality control checks==<br />
<br />
1. Check for co-annotation of a less-granular term with a more-granular term in the same path.<br />
Any action from this check is optional for each group as it may still be appropriate to keep both annotations, for example, it is acceptable to retain the less-granular annotation if;<br />
<br />
• It has a 'better' evidence code<br />
<br />
• The curator feels it adds weight to the more-granular annotation<br />
<br />
• Both annotations add value, e.g. 'histone methylation' and 'protein amino acid methylation'<br />
<br />
2. No use of the 'NOT' qualifier with 'protein binding'; GO:0005515. This rule only applies to GO:0005515, children of this term can be qualified with NOT, as further information on the type of binding is then supplied in the GO Term e.g. NOT + 'GO:0051529 NFAT4 protein binding', would be fine, as the negative binding statement only applies to the NFAT4 protein.<br />
<br />
3. Annotations to 'protein binding'; GO:0005515, should only be supplied with an evidence code where the interactor can be identified in the 'with' field. This rule only applies to GO:0005515, is not such a problem with child terms of protein binding where the type of protein is identified in the GO term name.<br />
<br />
4. Annotations to 'protein binding' should not use the ISS evidence code This rule only applies to GO:0005515, is not such a problem with child terms of protein binding where the type of protein is identified in the GO term name.</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Ontology_meeting_2011-07-06&diff=36352Ontology meeting 2011-07-062011-07-06T15:21:38Z<p>Girlwithglasses: </p>
<hr />
<div>'''Report'''<br />
<br />
Chris will report on status of [[:Category:Internal_Cross_Products|internal cross products]]. We need to make a plan of which sets come next for the timeline.<br />
<br />
'''Discussion notes'''<br />
<br />
* Internal xps. An example of a not-so-straightforward one is in this SF request: [https://sourceforge.net/tracker/?func=detail&aid=3123877&group_id=36855&atid=440764]<br />
* Regulation xps - we've noticed quite a few places where the regulation tree is out-of-sync with the main tree e.g. synaptogenesis. Can the xps be used to check for these inconsistencies?<br />
**This would be in the abduced links report, which we have not been keeping up with. Tanya and David will start looking at this again. Perhaps we can reformat it?<br />
<br />
* Host cc. GOA want to make a mapping between host cc x and cc x. Perhaps we should just instantiate these relationships in the ontology? What relationship though? [Jane]<br />
* Ontology reports/email digests. Where did we get up to with this? Rama asked again this week.<br />
**Example files:<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.html weekly report]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.txt weekly report, text]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.html weekly report, long format]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.txt weekly report, long format, text]<br />
** [http://geneontology.org/scratch/ont_diffs/new_term.rss RSS feed for new terms]<br />
** [http://geneontology.org/scratch/ont_diffs/obs_term.rss RSS feed for obsolete terms]<br />
** [http://geneontology.org/scratch/def_diffs/def_diffs-2011-05-29.shtml Def diffs report]<br />
* Annotation xp relations.<br />
** We need to set up a meeting with members of the annotation group so we are coordinated<br />
** I have started adding these relations here:<br />
*** http://www.geneontology.org/scratch/xps/go_annotation_extension_relations.obo<br />
*** http://www.geneontology.org/scratch/xps/go_annotation_extension_examples.obo<br />
* Remember to name wiki pages appropriately - I have moved this one<br />
<br />
'''Task List'''<br />
<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Ontology_meeting_2011-07-06&diff=36329Ontology meeting 2011-07-062011-07-06T00:46:41Z<p>Girlwithglasses: </p>
<hr />
<div>'''Report'''<br />
<br />
Chris will report on status of internal cross products. We need to make a plan of which sets come next for the timeline.<br />
<br />
'''Discussion notes'''<br />
<br />
* Internal xps. An example of a not-so-straightforward one is in this SF request: [https://sourceforge.net/tracker/?func=detail&aid=3123877&group_id=36855&atid=440764]<br />
* Regulation xps - we've noticed quite a few places where the regulation tree is out-of-sync with the main tree e.g. synaptogenesis. Can the xps be used to check for these inconsistencies?<br />
**This would be in the abduced links report, which we have not been keeping up with. Tanya and David will start looking at this again. Perhaps we can reformat it?<br />
<br />
* Host cc. GOA want to make a mapping between host cc x and cc x. Perhaps we should just instantiate these relationships in the ontology? What relationship though? [Jane]<br />
* Ontology reports/email digests. Where did we get up to with this? Rama asked again this week.<br />
**Example files:<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.html weekly report]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.txt weekly report, text]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.html weekly report, long format]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.txt weekly report, long format, text]<br />
** [http://geneontology.org/scratch/ont_diffs/new_term.rss RSS feed for new terms]<br />
** [http://geneontology.org/scratch/ont_diffs/obs_term.rss RSS feed for obsolete terms]<br />
'''Task List'''<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Ontology_meeting_2011-07-06&diff=36328Ontology meeting 2011-07-062011-07-06T00:46:18Z<p>Girlwithglasses: </p>
<hr />
<div>'''Report'''<br />
<br />
Chris will report on status of internal cross products. We need to make a plan of which sets come next for the timeline.<br />
<br />
'''Discussion notes'''<br />
<br />
* Internal xps. An example of a not-so-straightforward one is in this SF request: [https://sourceforge.net/tracker/?func=detail&aid=3123877&group_id=36855&atid=440764]<br />
* Regulation xps - we've noticed quite a few places where the regulation tree is out-of-sync with the main tree e.g. synaptogenesis. Can the xps be used to check for these inconsistencies?<br />
**This would be in the abduced links report, which we have not been keeping up with. Tanya and David will start looking at this again. Perhaps we can reformat it?<br />
<br />
* Host cc. GOA want to make a mapping between host cc x and cc x. Perhaps we should just instantiate these relationships in the ontology? What relationship though? [Jane]<br />
* Ontology reports/email digests. Where did we get up to with this? Rama asked again this week.<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.html weekly report]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-med-2011-05-08.txt weekly report, text]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.html weekly report, long format]<br />
** [http://geneontology.org/scratch/ont_diffs/weekly-diff-2011-05-08.txt weekly report, long format, text]<br />
** [http://geneontology.org/scratch/ont_diffs/new_term.rss RSS feed for new terms]<br />
** [http://geneontology.org/scratch/ont_diffs/obs_term.rss RSS feed for obsolete terms]<br />
'''Task List'''<br />
<br />
[[Category:Ontology]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Relation_composition&diff=36280Relation composition2011-07-01T21:38:43Z<p>Girlwithglasses: /* Updates to relations involving gene products, April 2011 */</p>
<hr />
<div>This page describes the relation composition rules for relations used in GO. See the OBO Edit Reasoner paper on google docs for background. See also [[Transitive_closure]]<br />
<br />
See also the [[http://obofoundry.org/ro Relation Ontology]] and accompanying paper<br />
<br />
== Simple composition rules ==<br />
<br />
=== rules for is_a and part_of ===<br />
<br />
TODO: fill in examples<br />
<br />
Basic transitivity compositions:<br />
<br />
* [[is_a]] . [[is_a]] &rarr; [[is_a]] ''transitivity of is_a''<br />
* [[part_of]] . [[part_of]] &rarr; [[part_of]] ''transitivity of part_of''<br />
<br />
For example:<br />
<br />
mitosis [[is_a]] cell cycle phase [[is_a]] cell cycle process, <br />
''THEREFORE'' mitosis [[is_a]] cell cycle process<br />
<br />
The following rules arise from the definitions give in the [http://obofoundry.org/ro OBO Relation Ontology]<br />
<br />
* [[is_a]] . [[part_of]] &rarr; [[part_of]] ''transitivity under is_a''<br />
* [[part_of]] . [[is_a]] &rarr; [[part_of]] ''transitivity over is_a''<br />
<br />
For example, starting with:<br />
<br />
mitosis [[part_of]] M phase of mitotic cell cycle [[is_a]] M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
<br />
We can iteratively reduce this by repearted application of composition rules:<br />
<br />
# mitosis [[part_of]] M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] M phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] cell cycle<br />
<br />
rules can be applied in any order (e.g. in the second reduction we reduced M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process)<br />
<br />
We can also infer the same link from the following asserted links:<br />
<br />
# mitosis [[part_of]] M phase of mitotic cell cycle [[part_of]] mitotic cell cycle [[is_a]] cell cycle<br />
<br />
=== rules for regulates ===<br />
<br />
With the addition of the regulates relations in GO, the composition rules expand. <br />
<br />
First the standard interaction with ''is_a'':<br />
<br />
* [[is_a]] . R &rarr; R ''transitivity under is_a''<br />
* R . [[is_a]] &rarr; R ''transitivity over is_a''<br />
<br />
In the above R stands for any of: [[regulates]], [[negatively_regulates]], [[positively_regulates]]<br />
<br />
Note that regulates is not itself transitive, but we may wish to include a weaker transitive relation (see below)<br />
<br />
Note that positively and negatively regulates are sub-relations of regulates; i.e.<br />
<br />
* IF: X ''negatively_regulates'' Y<br />
* THEN: X ''regulates'' Y<br />
<br />
The regulates relation is [[transitive over]] part_of; i.e.<br />
<br />
* regulates . [[part_of]] &rarr; [[regulates]] ''transitivity over part_of''<br />
<br />
Slight modification for the negatively and positively regulates relations:<br />
<br />
* negatively_regulates . [[part_of]] &rarr; [[regulates]] <br />
* positively_regulates . [[part_of]] &rarr; [[regulates]] <br />
<br />
Note that this rule is not hard-coded - it is declared in the gene_ontology .obo file, in the stanza for regulates (see the [[transitive_over]] tag)<br />
<br />
=== rules involving gene products ===<br />
<br />
<br />
Most of the time we talk of the relation between gene products and GO terms informally as one of "annotated_to". As we expand the relations used in GO (for example, between process and function), we need to be more precise. This will allow us to be consistent in giving recommendations for how tools and databases should handle annotations and the graph.<br />
<br />
To formalize annotations we need two further relations, to be defined in RO:<br />
<br />
* has_function_in - between a protein and a MF or BP (as specified in an annotation). Potentially also between a CC and an MF.<br />
* localized_to - between a protein and a CC.<br />
<br />
With the addition of these relations it is simpler to show the compositions in a table:<br />
<br />
* [[Media:go-relation-composition.xls|Initial version of composition table]] (Excel)<br />
* [[Media:go-relation-composition.pdf|Initial version of composition table]] (PDF)<br />
<br />
Here's how it works. If you have two links (annotations or ontology links)<br />
<br />
a ''R1'' b ''R2'' c<br />
<br />
And you want to know the relation (if any) between a and c, look up the composition ''R1''.''R2'' in the table. Row first, then column<br />
<br />
For example, if you have<br />
<br />
* a [[positively_regulates]] b [[part_of]] c<br />
<br />
Lookup (R+,P) in the table - the cell value is R+ (i.e. the regulates relations are [[transitive_over]] [[part_of]])<br />
<br />
Composition is recursive, e.g, this:<br />
* a R1 b, b R2 c, c R3 d<br />
can be written as this:<br />
* a ((R1.R2).R3) d<br />
<br />
Which means you look up R1.R2 first, take the result, then plug that in as the row and look under the R3 column.<br />
<br />
If you get a red X, you know something is wrong (remember we have defined regulates as holding between processes; we can generalize so that we can say a gene product is regulated, though it may be better to introduce a different but similar relation)<br />
<br />
If you get a -/? then you have a legal relation, just one we have so far declined to name. There is nothing to stop us naming for example "indirectly_regulates" (SEE BELOW: David and Tanya have provided these)<br />
<br />
It's important to name the links between gene products and what is denoted by GO terms, this allows us to give consistent coherent explanations of why we propagate certain things up the DAG by default. For example, we don't propagate annotations over [[part_of]] by intuition. It's because ''L''.''P'' &rarr; ''L'' and ''F''.''P'' &rarr; ''F''.<br />
<br />
Say we have a gene product ''p'' directly annotated to ''a''. ''a'' is in BP, so the implicit relation is has_function_in (F). The user queries for ''e'' (a MF)<br />
<br />
If the ontology has:<br />
<br />
* a [[is_a]] b [[part_of]] c [[regulates]] d [[is_a]] e<br />
<br />
(this is post BP->MF links)<br />
<br />
The full path from the gene product to the query term is:<br />
<br />
* p [[has_function_in]] a [[is_a]] b [[part_of]] c [[[regulates]] d [[is_a]] e<br />
<br />
Should the tool return p? (Here 'tool' can be generalized to amigo queries, map2slim, enrichment calculation etc.)<br />
<br />
According to the table there is no name for the relation that holds between p and e. The tool should not include p in the results since there is nothing we can say about how p relates to the query. This is in accord with what we have been saying about how tools should work with the regulates relation. However, there may be circumstances where we want to allow this propagation to occur, but not in an ad-hoc fashion.<br />
<br />
If we like, we can name the composition of P.R e.g. "part_of_regulation_of", PR for short. We can also name the composition F.PR - say "functions_as_part_of_regulation_of" or FPR for short (our table starts getting a bit more complex but that's OK). The composition F.I.P.R.I is reduced to FPR.<br />
<br />
This means the tool has a concrete basis for offering the user options for how the gene product is propagated. For example, it could say "no gene products are annotated as *having the function* e. Do you want to extend your search to include products that *function as part of the regulation of* e?<br />
<br />
Of course tools could also just have a checkbox of relations to propagate over too: but this doesn't take into account the fact that that certain orderings have different semantics.<br />
<br />
If we name the relations then this makes it easier for people using the table of implied relations in GO: [[Transitive_closure#Calculating_the_transitive_closure:_the_new_way]]<br />
<br />
(of course we won't precompute every gene product to every term, just every meaningful term-term relation. The final composition is done without the table)<br />
<br />
David and Tanya proposed the following extension to the table:<br />
<br />
* A (F) B (R) C= A is a regulator of C<br />
* A (F) B (R+) C=A is a positive regulator of C<br />
* A (F) B (R-) C= A is a negative regulator of C<br />
* A (P) B (F) C= A contributes_to C<br />
* A (P) B (R)C= A (R) C this assumes that the other parts of B will occur<br />
* A (P) B (R+) C= A (R+) C this assumes that the other parts of B will occur<br />
* A (P) B (R-) C= A (R-) C this assumes that the other parts of B will occur<br />
* A (R) B (R) C= A indirectly_regulates C<br />
* A (R) B (R+) C= A indirectly_regulates C<br />
* A (R) B (R-) C= A indirectly_regulates C<br />
* A (R+) B (R) C= A indirectly_regulates C<br />
* A (R+) B (R+) C= A indirectly_positively_regulates C<br />
* A (R+) B (R-) C= A indirectly_negatively_regulates C<br />
* A (R-) B (R) C= A indirectly_regulates C<br />
* A (R-) B (R+) C= A indirectly_negatively_regulates C<br />
* A (R-) B (R-) C= A indirectly_positively_regulates C<br />
* A (L) B (F) C= A may contribute_to C<br />
<br />
<br />
=== Updates to relations involving gene products, April 2011 ===<br />
<br />
--[[User:Girlwithglasses|gwg]] 14:43, 6 April 2011 (PDT)<br />
<br />
This documents the relations used in the GO Moose perl toolkit.<br />
<br />
<br />
Two relationships are used for connecting annotated entities to GO terms, depending on the GO terms. These are:<br />
<br />
gene product == part of ==> [cellular component]<br />
<br />
and<br />
<br />
gene product == capable of ==> [ molecular function | biological process ]<br />
<br />
<br />
For terms that do not belong to the GO ontologies, the generic relationship ''annotated to'' is used.<br />
<br />
<br />
These are the properties of the relations:<br />
<br />
<protect><!--box uid=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168--><br />
<!--<br />
******************************************************************************************<br />
* <br />
* ** PLEASE DON'T EDIT THIS TABLE DIRECTLY. Use the edit table link under the table. ** <br />
* <br />
****************************************************************************************** --><br />
{| id="V4d9cde6be0168" class=" tableEdit " <br />
|-<br />
!|term 1 ontology!!GP --> term 1!!term 1 --> term 2!!Inferred GP --> term 2<br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
part of<br />
|<br />
part of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
is a<br />
|<br />
capable of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
part of<br />
|<br />
capable of part of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
regulates<br />
|<br />
regulator of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
positively/negatively regulates<br />
|<br />
positive/negative regulator of <br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
has part<br />
|<br />
no inference possible <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
has part<br />
|<br />
no inference possible <br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
is a<br />
|<br />
part of <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
is a<br />
|<br />
annotated to <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
part of<br />
|<br />
annotated to <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
(positively/negatively) regulates<br />
|<br />
(positive/negative) regulator of <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
has part<br />
|<br />
no inference <br />
<br />
|- class="tableEdit_footer" <br />
|<span class="tableEdit_editLink plainlinks">[{{SERVER}}{{SCRIPTPATH}}?title=Special:TableEdit&id=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168&page=4554&pagename={{FULLPAGENAMEE}}&type=0&template= edit table]</span> || || ||<br />
|}<br />
<!--box uid=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168--></protect><br />
<br />
<br />
<br />
GP -- capable of --> molecular function<br />
GP --has function in --> biological process<br />
GP --localizes to-->cellular component<br />
<br />
if GP capable of MF AND MF part of BP ==> GP has function in BP<br />
<br />
integral to - essentially creates a subclass of a complex which is species-specific; saying GP is integral to complex means complex always has part GP in that species<br />
<br />
=== Has_part===<br />
<br />
See [[has_part]] page<br />
<br />
== Example of relation composition ==<br />
<br />
This example assumes that amongst our annotations we have:<br />
<br />
* MGI Bcl2 - (direct/asserted) annotation to '''positive regulation of anti-apoptosis'''<br />
* RGD Apoe - (direct/asserted) annotation to '''anti-apoptosis'''<br />
<br />
(For the sake of the example, we assume that these are the only annotations that were created for these genes. We ignore evidence codes here -- assuming they are trusted annotations)<br />
<br />
According to our formalization of what annotations mean, the annotation corresponds to has_function_in<br />
<br />
We can then apply the composition rules to get the implied links to ''apoptosis''<br />
<br />
* Apoe ''negative_regulator_of'' apoptosis<br />
* Bcl2 ''indirect_negative_regulator_of'' apoptosis<br />
<br />
This page uses oboedit to illustrate the relationships between the gene produts and different kinds of process. It may seem odd to view annotations in OE, but according to our formalism the links between proteins and the processes they participate in are not a different kind of beast from the other kinds of links in GO. Still, we'll hopefully have this in AmiGO too shortly.<br />
<br />
You can get the subset of GO (plus annotations in obo format) used to make these screenshots here:<br />
<br />
* [[http://www.geneontology.org/scratch/transitive_closure/GO_0045768.obo GO_0045768.obo]]<br />
<br />
The full transitive closure is here:<br />
<br />
* [[http://www.geneontology.org/scratch/transitive_closure/GO_0045768.linkfile GO_0045768.linkfile]]<br />
<br />
[[Image:Bcl2-graph.jpg]]<br />
<br />
[[Image:Bcl2-OEP.jpg]]<br />
<br />
It should also be possible to do queries using the OE2 link search box too - e.g. ask for genes that bear some relation to apoptosis and get back "Bcl2 negative_regulator_of GO:apoptosis". However, the link search doesn't appear to be working properly in conjunction with the reasoner - Amina is working on this.<br />
<br />
== OBO Format ==<br />
<br />
* The '''is_transitive''' tag is the same as a R <- R.R composition<br />
* The '''transitive_over''' tag is the same as a R <- R.R2 composition<br />
* The '''holds_over_chain''' tag allows for arbitrary compositions R <- R1.R2<br />
<br />
[[Category:Relations]]<br />
[[Category:Annotation]]</div>Girlwithglasseshttps://wiki.geneontology.org/index.php?title=Relation_composition&diff=36278Relation composition2011-07-01T21:36:40Z<p>Girlwithglasses: /* Updates to relations involving gene products, April 2011 */</p>
<hr />
<div>This page describes the relation composition rules for relations used in GO. See the OBO Edit Reasoner paper on google docs for background. See also [[Transitive_closure]]<br />
<br />
See also the [[http://obofoundry.org/ro Relation Ontology]] and accompanying paper<br />
<br />
== Simple composition rules ==<br />
<br />
=== rules for is_a and part_of ===<br />
<br />
TODO: fill in examples<br />
<br />
Basic transitivity compositions:<br />
<br />
* [[is_a]] . [[is_a]] &rarr; [[is_a]] ''transitivity of is_a''<br />
* [[part_of]] . [[part_of]] &rarr; [[part_of]] ''transitivity of part_of''<br />
<br />
For example:<br />
<br />
mitosis [[is_a]] cell cycle phase [[is_a]] cell cycle process, <br />
''THEREFORE'' mitosis [[is_a]] cell cycle process<br />
<br />
The following rules arise from the definitions give in the [http://obofoundry.org/ro OBO Relation Ontology]<br />
<br />
* [[is_a]] . [[part_of]] &rarr; [[part_of]] ''transitivity under is_a''<br />
* [[part_of]] . [[is_a]] &rarr; [[part_of]] ''transitivity over is_a''<br />
<br />
For example, starting with:<br />
<br />
mitosis [[part_of]] M phase of mitotic cell cycle [[is_a]] M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
<br />
We can iteratively reduce this by repearted application of composition rules:<br />
<br />
# mitosis [[part_of]] M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] M phase [[is_a]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] cell cycle process [[part_of]] cell cycle<br />
# mitosis [[part_of]] cell cycle<br />
<br />
rules can be applied in any order (e.g. in the second reduction we reduced M phase [[is_a]] cell cycle phase [[is_a]] cell cycle process)<br />
<br />
We can also infer the same link from the following asserted links:<br />
<br />
# mitosis [[part_of]] M phase of mitotic cell cycle [[part_of]] mitotic cell cycle [[is_a]] cell cycle<br />
<br />
=== rules for regulates ===<br />
<br />
With the addition of the regulates relations in GO, the composition rules expand. <br />
<br />
First the standard interaction with ''is_a'':<br />
<br />
* [[is_a]] . R &rarr; R ''transitivity under is_a''<br />
* R . [[is_a]] &rarr; R ''transitivity over is_a''<br />
<br />
In the above R stands for any of: [[regulates]], [[negatively_regulates]], [[positively_regulates]]<br />
<br />
Note that regulates is not itself transitive, but we may wish to include a weaker transitive relation (see below)<br />
<br />
Note that positively and negatively regulates are sub-relations of regulates; i.e.<br />
<br />
* IF: X ''negatively_regulates'' Y<br />
* THEN: X ''regulates'' Y<br />
<br />
The regulates relation is [[transitive over]] part_of; i.e.<br />
<br />
* regulates . [[part_of]] &rarr; [[regulates]] ''transitivity over part_of''<br />
<br />
Slight modification for the negatively and positively regulates relations:<br />
<br />
* negatively_regulates . [[part_of]] &rarr; [[regulates]] <br />
* positively_regulates . [[part_of]] &rarr; [[regulates]] <br />
<br />
Note that this rule is not hard-coded - it is declared in the gene_ontology .obo file, in the stanza for regulates (see the [[transitive_over]] tag)<br />
<br />
=== rules involving gene products ===<br />
<br />
<br />
Most of the time we talk of the relation between gene products and GO terms informally as one of "annotated_to". As we expand the relations used in GO (for example, between process and function), we need to be more precise. This will allow us to be consistent in giving recommendations for how tools and databases should handle annotations and the graph.<br />
<br />
To formalize annotations we need two further relations, to be defined in RO:<br />
<br />
* has_function_in - between a protein and a MF or BP (as specified in an annotation). Potentially also between a CC and an MF.<br />
* localized_to - between a protein and a CC.<br />
<br />
With the addition of these relations it is simpler to show the compositions in a table:<br />
<br />
* [[Media:go-relation-composition.xls|Initial version of composition table]] (Excel)<br />
* [[Media:go-relation-composition.pdf|Initial version of composition table]] (PDF)<br />
<br />
Here's how it works. If you have two links (annotations or ontology links)<br />
<br />
a ''R1'' b ''R2'' c<br />
<br />
And you want to know the relation (if any) between a and c, look up the composition ''R1''.''R2'' in the table. Row first, then column<br />
<br />
For example, if you have<br />
<br />
* a [[positively_regulates]] b [[part_of]] c<br />
<br />
Lookup (R+,P) in the table - the cell value is R+ (i.e. the regulates relations are [[transitive_over]] [[part_of]])<br />
<br />
Composition is recursive, e.g, this:<br />
* a R1 b, b R2 c, c R3 d<br />
can be written as this:<br />
* a ((R1.R2).R3) d<br />
<br />
Which means you look up R1.R2 first, take the result, then plug that in as the row and look under the R3 column.<br />
<br />
If you get a red X, you know something is wrong (remember we have defined regulates as holding between processes; we can generalize so that we can say a gene product is regulated, though it may be better to introduce a different but similar relation)<br />
<br />
If you get a -/? then you have a legal relation, just one we have so far declined to name. There is nothing to stop us naming for example "indirectly_regulates" (SEE BELOW: David and Tanya have provided these)<br />
<br />
It's important to name the links between gene products and what is denoted by GO terms, this allows us to give consistent coherent explanations of why we propagate certain things up the DAG by default. For example, we don't propagate annotations over [[part_of]] by intuition. It's because ''L''.''P'' &rarr; ''L'' and ''F''.''P'' &rarr; ''F''.<br />
<br />
Say we have a gene product ''p'' directly annotated to ''a''. ''a'' is in BP, so the implicit relation is has_function_in (F). The user queries for ''e'' (a MF)<br />
<br />
If the ontology has:<br />
<br />
* a [[is_a]] b [[part_of]] c [[regulates]] d [[is_a]] e<br />
<br />
(this is post BP->MF links)<br />
<br />
The full path from the gene product to the query term is:<br />
<br />
* p [[has_function_in]] a [[is_a]] b [[part_of]] c [[[regulates]] d [[is_a]] e<br />
<br />
Should the tool return p? (Here 'tool' can be generalized to amigo queries, map2slim, enrichment calculation etc.)<br />
<br />
According to the table there is no name for the relation that holds between p and e. The tool should not include p in the results since there is nothing we can say about how p relates to the query. This is in accord with what we have been saying about how tools should work with the regulates relation. However, there may be circumstances where we want to allow this propagation to occur, but not in an ad-hoc fashion.<br />
<br />
If we like, we can name the composition of P.R e.g. "part_of_regulation_of", PR for short. We can also name the composition F.PR - say "functions_as_part_of_regulation_of" or FPR for short (our table starts getting a bit more complex but that's OK). The composition F.I.P.R.I is reduced to FPR.<br />
<br />
This means the tool has a concrete basis for offering the user options for how the gene product is propagated. For example, it could say "no gene products are annotated as *having the function* e. Do you want to extend your search to include products that *function as part of the regulation of* e?<br />
<br />
Of course tools could also just have a checkbox of relations to propagate over too: but this doesn't take into account the fact that that certain orderings have different semantics.<br />
<br />
If we name the relations then this makes it easier for people using the table of implied relations in GO: [[Transitive_closure#Calculating_the_transitive_closure:_the_new_way]]<br />
<br />
(of course we won't precompute every gene product to every term, just every meaningful term-term relation. The final composition is done without the table)<br />
<br />
David and Tanya proposed the following extension to the table:<br />
<br />
* A (F) B (R) C= A is a regulator of C<br />
* A (F) B (R+) C=A is a positive regulator of C<br />
* A (F) B (R-) C= A is a negative regulator of C<br />
* A (P) B (F) C= A contributes_to C<br />
* A (P) B (R)C= A (R) C this assumes that the other parts of B will occur<br />
* A (P) B (R+) C= A (R+) C this assumes that the other parts of B will occur<br />
* A (P) B (R-) C= A (R-) C this assumes that the other parts of B will occur<br />
* A (R) B (R) C= A indirectly_regulates C<br />
* A (R) B (R+) C= A indirectly_regulates C<br />
* A (R) B (R-) C= A indirectly_regulates C<br />
* A (R+) B (R) C= A indirectly_regulates C<br />
* A (R+) B (R+) C= A indirectly_positively_regulates C<br />
* A (R+) B (R-) C= A indirectly_negatively_regulates C<br />
* A (R-) B (R) C= A indirectly_regulates C<br />
* A (R-) B (R+) C= A indirectly_negatively_regulates C<br />
* A (R-) B (R-) C= A indirectly_positively_regulates C<br />
* A (L) B (F) C= A may contribute_to C<br />
<br />
<br />
=== Updates to relations involving gene products, April 2011 ===<br />
<br />
--[[User:Girlwithglasses|gwg]] 14:43, 6 April 2011 (PDT)<br />
<br />
This documents the relations used in the GO Moose perl toolkit.<br />
<br />
<br />
Two relationships are used for connecting annotated entities to GO terms, depending on the GO terms. These are:<br />
<br />
gene product == part of ==> [cellular component]<br />
<br />
and<br />
<br />
gene product == capable of ==> [ molecular function | biological process ]<br />
<br />
<br />
For terms that do not belong to the GO ontologies, the generic relationship ''annotated to'' is used.<br />
<br />
<br />
These are the properties of the relations:<br />
<br />
<protect><!--box uid=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168--><br />
<!--<br />
******************************************************************************************<br />
* <br />
* ** PLEASE DON'T EDIT THIS TABLE DIRECTLY. Use the edit table link under the table. ** <br />
* <br />
****************************************************************************************** --><br />
{| id="V4d9cde6be0168" class=" tableEdit " <br />
|-<br />
!|term 1 ontology!!GP --> term 1!!term 1 --> term 2!!Inferred GP --> term 2<br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
part of<br />
|<br />
part of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
is a<br />
|<br />
capable of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
part of<br />
|<br />
capable of part of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
regulates<br />
|<br />
regulator of <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
positively/negatively regulates<br />
|<br />
positive/negative regulator of <br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
has part<br />
|<br />
no inference possible <br />
|- <br />
|<br />
function or process<br />
|<br />
capable of<br />
|<br />
has part<br />
|<br />
no inference possible <br />
|- <br />
|<br />
cellular component<br />
|<br />
part of<br />
|<br />
is a<br />
|<br />
part of <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
is a<br />
|<br />
annotated to <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
part of<br />
|<br />
annotated to <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
(positively/negatively) regulates<br />
|<br />
(positive/negative) regulator of <br />
|- <br />
|<br />
any<br />
|<br />
annotated to<br />
|<br />
has part<br />
|<br />
no inference <br />
<br />
|- class="tableEdit_footer" <br />
|<span class="tableEdit_editLink plainlinks">[{{SERVER}}{{SCRIPTPATH}}?title=Special:TableEdit&id=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168&page=4554&pagename={{FULLPAGENAMEE}}&type=0&template= edit table]</span> || || ||<br />
|}<br />
<!--box uid=bf281ae42314a4847a219b1156b9d7b6.4554.V4d9cde6be0168--></protect><br />
<br />
<br />
<br />
GP -- capable of --> molecular function<br />
GP --has function in --> biological process<br />
GP --localizes to-->cellular component<br />
<br />
if GP capable of MF AND MF part of BP ==> GP has function in BP<br />
<br />
=== Has_part===<br />
<br />
See [[has_part]] page<br />
<br />
== Example of relation composition ==<br />
<br />
This example assumes that amongst our annotations we have:<br />
<br />
* MGI Bcl2 - (direct/asserted) annotation to '''positive regulation of anti-apoptosis'''<br />
* RGD Apoe - (direct/asserted) annotation to '''anti-apoptosis'''<br />
<br />
(For the sake of the example, we assume that these are the only annotations that were created for these genes. We ignore evidence codes here -- assuming they are trusted annotations)<br />
<br />
According to our formalization of what annotations mean, the annotation corresponds to has_function_in<br />
<br />
We can then apply the composition rules to get the implied links to ''apoptosis''<br />
<br />
* Apoe ''negative_regulator_of'' apoptosis<br />
* Bcl2 ''indirect_negative_regulator_of'' apoptosis<br />
<br />
This page uses oboedit to illustrate the relationships between the gene produts and different kinds of process. It may seem odd to view annotations in OE, but according to our formalism the links between proteins and the processes they participate in are not a different kind of beast from the other kinds of links in GO. Still, we'll hopefully have this in AmiGO too shortly.<br />
<br />
You can get the subset of GO (plus annotations in obo format) used to make these screenshots here:<br />
<br />
* [[http://www.geneontology.org/scratch/transitive_closure/GO_0045768.obo GO_0045768.obo]]<br />
<br />
The full transitive closure is here:<br />
<br />
* [[http://www.geneontology.org/scratch/transitive_closure/GO_0045768.linkfile GO_0045768.linkfile]]<br />
<br />
[[Image:Bcl2-graph.jpg]]<br />
<br />
[[Image:Bcl2-OEP.jpg]]<br />
<br />
It should also be possible to do queries using the OE2 link search box too - e.g. ask for genes that bear some relation to apoptosis and get back "Bcl2 negative_regulator_of GO:apoptosis". However, the link search doesn't appear to be working properly in conjunction with the reasoner - Amina is working on this.<br />
<br />
== OBO Format ==<br />
<br />
* The '''is_transitive''' tag is the same as a R <- R.R composition<br />
* The '''transitive_over''' tag is the same as a R <- R.R2 composition<br />
* The '''holds_over_chain''' tag allows for arbitrary compositions R <- R1.R2<br />
<br />
[[Category:Relations]]<br />
[[Category:Annotation]]</div>Girlwithglasses