28 SEPT 2010 RefGen Priorities Discussion (Archived)
Pascale, Kara, Mike L, Paul T, Rama
- Main conclusion is that it's possible to annotate families in about 30-120 minutes.
- Action item: we'll annotate xx families between now and December.
- Prioritization: For this exercise, Mike had selected enzymes that had relatively small family size and enough MF annotations. For the next phase, we decided it was better to annotate families at random so as not to be biased when we do the stats on the curation.
- Bug in PAINT: many human annotations are not loaded
Some PAINT features that would be helpful:
- Issue with the .save.gaf and .gaf files: we need the .save.gaf to work in PAINT. That one however, contains IRD annotations. Right now the tree curator needs to do SAVE and EXPORT to get the .gaf file. Mike had not done this and it created confusion among the MODs who thought the IRDs were real NOT annotations. Two possibilities to deal with the issue:
- A. SAVE function would create both .gaf and .save.gaf files
- B. Suzi's script would also strip the IRDs on the .save.gaf file
- Very Long Branches should not get annotations (ie, we wouldn't have to to the NOTs manually)
- Right now it's not possible to block propagation (IRD) to a sequence that already has a positive annotation. This is too stringent: we should get a warning, but still be able to block the propagation.
- Option to show only cellular processes (we first need to decide which branches to exclude)
- Load taxon triggers into PAINT
Mike annotated :
Mike's annotations are in the CVS repository: http://cvsweb.geneontology.org/cgi-bin/cvsweb.cgi/go/gene-associations/submission/paint/#dirlist
- 65 sequences; 4 with EXP data
- Propagated 6 annotations:
- MF: GO:0000287: magnesium ion binding
- MF: GO:0004647: phosphoserine phosphatase activity
- MF: GO:0016791: phosphatase activity
- CC: GO:0009507: chloroplast
- CC: GO:0005737: cytoplasm
- BP: GO:0006564: L-serine biosynthetic process
- Comparing with InterPro:
- Yeast serB matches IPR004469
- BP: GO:0006564 L-serine biosynthetic process
- MF: GO:0004647 phosphoserine phosphatase activity
InterPro: for the mouse protein Q9QZD4
- IPR006166 : ERCC4 domain : http://www.ebi.ac.uk/interpro/IEntry?ac=IPR006166
- Process GO:0006259 DNA metabolic process
- Function GO:0003677 DNA binding
- GO:0004518 nuclease activity
- GO:0005515 protein binding
- IPR006167: DNA repair protein: http://www.ebi.ac.uk/interpro/IEntry?ac=IPR006167
- Process GO:0006281 DNA repair
- Function GO:0003676 nucleic acid binding
- GO:0004519 endonuclease activity
Mike Description of phylogeny
This family consists of one major clade spanning plants to humans. There are very few duplications.
MF -S. cerevisiae and Arabidopsis each have annotations to GO:0000014 "single-stranded DNA specific endodeoxyribonuclease activity." The yeast annotation is an IDA from a paper (PMID 8253764) that shows that the RAD1/RAD10 complex has the activity, and it has a CONTRIBUTES_TO qualifier. The Arabidopsis annotation is an IMP from PMID 11027708, showing that uvh1 mutatnts do not have the activity. From these 2 annotations, it is not possible to determine whether the active site lies in RAD10/uvh1 itself. However, since the plant annotation is not based on a biochemical experiment, it may perhaps be better as a BP annotatiopn than a MF annotation. Therefore, porpagate 0000014 to AN0 with the CONTRIBUTES_TO qualifier. -0000014 should has_part GO:0003697 "single-stranded DNA binding," but it doesn't, so propagate 3697 to AN0.
Time to annotate MF: 15 minutes
CC -Propagate GO:0000110 "nucleotide-excision repair factor 1 complex" to AN0. This covers nucleus, too. -Do not propagate "spindle pole body" from S. pombe because it's a lone HTP finding.
Time to annotate CC: 3 minutes.
BP -Propagate the following to AN0: GO:0006296 nucleotide-excision repair, DNA incision, 5'-to lesion GO:0000724 double-strand break repair via homologous recombination GO:0000710 meiotic mismatch repair GO:0000712 resolution of meiotic recombination intermediates -Propagate GO:0000736 "double-strand break repair via single-strand annealing, removal of nonhomologous ends" to the fungal clade only, as it is a child of "gene coversion at mating-type locus" and all children of "mitotic recombination" here are found in fungi.
Time to annotate BP: 20 miunutes.
- MOLECULAR FUNCTION**
- 20 min, 1 annotation propagated. - Only 3 direct EXP: -- GO:0046982 protein heterodimerization activity -- GO:0000014 single-stranded DNA specific endodeoxyribonuclease activity -- GO:0003697 single-stranded DNA binding
1. Propagated to all tree: GO:0004520 : endodeoxyribonuclease activity - Could perhaps have propagared to the more specific term: "GO:0000014 : single-stranded DNA specific endodeoxyribonuclease activity" but need more data (one mammalian species would be great).
2. NOT to long branches:
- ORNAN ENSOANG00000025598, XP_001199502, XP_001186979 (Strongylocentrotus purpuratus - however, reported as partial sequences in GenBank) Rat XP_573032 (record removed in GenBank), Rat XP_001077837 - looks like it needs to be merged with another sequence (5' only)
Questions for MODs: SGD:
1. Is the gene comprehensively annotated? Seems like there are few annotation compared with the volume of lit. Date last reviewed: 2007-10-01.
2. Cerevisiae has more processes, possibly based on more biochemical data. Are those all independent? -- GO:0000735 removal of nonhomologous ends -- GO:0006296 nucleotide-excision repair, DNA incision, 5'-to lesion -- GO:0000736 double-strand break repair via single-strand annealing, removal of nonhomologous ends
- CELLULAR COMPONENT**
- Annotation: 15 min
- Again, human annotations don't show up. Missing "GO:0000109 nucleotide-excision repair complex" in PAINT. GOA should have annotated to GO:0000110 , not GO:0000109.
1. Annotated "GO:0000110 : nucleotide-excision repair factor 1 complex" to AN1. (left plants out). We need to know whether the other spp have RAD10 as well, but since yeast and human have them, I am assuming it's conserved.
2. NOT to long branches:
- ORNAN ENSOANG00000025598,
- XP_001199502, XP_001186979 (Strongylocentrotus purpuratus - however, reported as partial sequences in GenBank)
- Rat XP_573032 (record removed in GenBank)
- Rat XP_001077837 - looks like it needs to be merged with another sequence (5' only)
-GOA: ERCC4 should be annotated to GO:0000110 , not GO:0000109.
Did not annotate: - Spindle body : S. pombe HTP - This is the only outlier
- BIOLOGICAL PROCESS**
- Annotation: 30 minutes - GO:0006289 nucleotide-excision repair: TAIR, pombe, cerevisiae, fly, human (annotations not showing up) - Annotations are related (recombinatio, meiosis, recombinational repair, etc) but all over the GO, probably dependent on experiments or annotators.
1. GO:0006289 nucleotide-excision repair: Propagate to all (except long branches, see MF annotations)
2. : Annotate all to GO:0006310 : DNA recombination based on:
- Fly: GO:0007131 reciprocal meiotic recombination //
- cerevisiae: GO:0006312 mitotic recombination
- S. pombe: GO:0007534 : gene conversion at mating-type locus (child of GO:0006312 : mitotic recombination )
- A. thaliana: GO:0000724 double-strand break repair via homologous recombination
Outliers: - do not propagate to GO:0009792 embryo development ending in birth or egg hatching (worm RNAi) - do not propagate GO:0016321 female meiosis chromosome segregation (Fly IMP)
- IPR001424 : sodA dicty
- Process GO:0006801 superoxide metabolic process
- GO:0055114 oxidation reduction
- Function GO:0046872 metal ion binding
- MOLECULAR FUNCTION*
- EXP annotations : 15 different terms: direct + indirect; 3 direct annotations (GO:0016532 SOD copper chaperone, GO:0004784 (SOD activity), and GO:0005507 (Copper binding) - Annotated MF in about 120 minutes. It took about 30-45 min to review some general literature to get familiar with the family. The fact that teo subc;ades have very different roles made MF a bit harder. - Needed about 15-30 minutes to look at long branches and missing residues. - Annotation took about 60 min.
- The CSS* subfamily (AN3 and descendents) is strictly a copper chaperone, with no SOD activity
MF: GO:0016532 SOD copper chaperone: Dm, Sc, Sp, - copper binding consensus sequence MHCXXC (http://www.jbc.org/content/272/38/23469.long) present in Sc CCS1 - there is also a CXC conserved site at the C terminus (PMC2602909) - This is hard to visualize on the MSA, but it corresponds to position 125 or so: MHCENCV in SC. Human has CQCSV. - The M and the two C are conserved in most members of this clade. - The N-terminal copper binding site is not essential, see data for the fly gene (PMC2602909): "Drosophila CCS lacks the MXCXXC copper-binding motif that is well conserved in CCS molecules from phylogenetically distant taxa. In fact, an inspection of CCS molecules across diverse species reveals that with the exception of Drosophila and mosquito CCS, all CCS molecules identified to date harbor these cysteines (Table 1). Interestingly, we observed that Drosophila CCS is very poor at activating yeast SOD1 compared with the homologous yeast CCS.", and also : "To address whether the yeast SOD1 preference for yeast CCS reflected loss of the conserved MXCXXC cysteines, we tested the effects of a C17S,C20S substitution in yeast CCS. As seen in Fig. 6D, this mutant retains the ability to fully activate yeast SOD1."
- INCORRECT ANNOTATIONS: The TAIR annotations are inconsistent with the sequence data, which indicated that the protein is a copper chaperone. Moreover, the data in PMID: 15848163, Fig 2, shows that At CCS1 complements Sc CCS1, ie, it's a chaperone, not a SOD.
MF ANNOTATIONS: 1. GO:0016532 (SOD chaperone) to AN3 2. Bacteria: AN163: not enough information to propagate to other species. The structure of the E. coli enzyme have beed solved: PMID: 9405149 but I cannot determine which are the essential residues. 3. Eukaryotic SODs: AN46: propagate GO:0004784 (SOD activity) to all, as this function has been experimentally determined in many species spread across the tree.
- Some sequences look partial, for example AGAP007497, Q5XNS3 (A. gambiae)
- NOT'ed 4 sequences: chicken XP_001232830, bovin Q9TS96, S. purpuratus XP_001177068, A. gambiae (Rapid divergence)
MF: Questions for MODs: - General: Annotation consistency issue - RDG annotated to chaperone binding : this protein needs a copper ion which is provided by a chaperone. Is this a valid annotation? If so, other groups could probably make it.
- TAIR: please review annotation from PMID: 15848163: correct from GO:0004784 (SOD activity) to GO:0016532 (SOD chaporone)
- CELLULAR COMPONENT*
- This is a relatively large family, presumably with different clades having different tissue/organelle distribution. Supporting data for annotation transfer is weak.
CC ANNOTATIONS: : No annotations possible. Failed to annotate CC in about 30 minutes.
The CSS chaperone seems to be present in several organelles in yeat - not sure this is solid enough to propagate to all. There are also several members with annotations to extracellular space.
CC: Questions for MODs: GOA (Human): please provide CC annotations Previous studies have revealed that SOD1 is a soluble protein localized in the cytoplasm and nuclei of cells (refs 30, 31 in PMID: 9726962) GOA (Human): please provide CC annotations for CCS: PMID: 9726962 ? (There is some immunostaining but the protein looks rather overexpressed)
- BIOLOGICAL PROCESS*
Time required: about 90 minutes. the large number of outlier/downstream process annotations make it difficult.
- MGI annotates CCS (the chaperone) to "GO:0051353 positive regulation of oxidoreductase activity" - is this correct? If so, other groups should also annotate. - This clade has otherwise no BP
Annotation not propagated: - Worm Sod-1 and sod-1 , Fly SOD and CSS are annotated to 'determination of adult lifespan'. Not comfortable propagating such a process (evidence = IMP) - Worm Sod-1, Fly Sod and CCS, SC SOD1, MGI SOD1 are annotated to GO:0007568 'aging' - There are also annotations to "GO:0001320 age-dependent response to reactive oxygen species involved in chronological cell aging" (in the same GO branch). I think this is real, since SOD protects from oxidative damage, but I am not sure this can be propagated. Should this also be correlated with a reduced expression in aging cells/organisms? - SGD SOD1 and Pombe CCS1 are annotated to "GO:0006878 cellular copper ion homeostasis" , do not propagate - MGI SOD1 is annotated to several outlier downstream processes such as "GO:0006309 DNA fragmentation involved in apoptotic nuclear change", "GO:0007566 embryo implantation" and "GO:0060047 heart contraction". Do not propagate. - Rat SOD1 is annotated to several outlier downstream processes such as "GO:0006916 anti-apoptosis" and "GO:0001975 response to amphetamine", do not propagate - ZFish has outlier annotation to "GO:0009410 response to xenobiotic stimulus"; do not propagate. - Worm SOD1 annotated to outlier developmental processes ; do not propagate. -SGD is annotated to "GO:0031505 fungal-type cell wall organization" do not propagate. - Dicty sodC annotated to cytokinesis, do not propagate.
- Do not propagate anything within bacteria since I did not propagate the function.
BP ANNOTATIONS: 1. Response to oxidative stress: propagated to AN46 (eukaryotic SOD) 2. Removal of superoxide radicals propagated to AN46 (eukaryotic SOD)
-Propagate GO:0016532 "superoxide dismutase copper chaperone activity" to AN3. -There is an annoitation to SOD activity on Arabidopsis CCS, but no other SOD annotations in the CCS clade. There are widespread SOD annotations throughout the rest of the family. Propagate GO:0004784 "superoxide dismutase activity" to AN0. -E. coli SOD (http://biocyc.org/ECOLI/NEW-IMAGE?type=NIL&object=G6886-MONOMER) has 4 active site histidines that have been identified, at positions 67, 69, 92, and 147, corresponding approximately to positions 320 (for 67 and 69), 348, and 470 in this alignment. The first three of these histidines are conserved throughout this family, except for the fungal and plant CCS proteins, which include the Arabidopsis CCS with SOD activity. The fourth histidine is absent from most of the CCS clade, including plant. So, either there are 3 incorrect annotations (including an IDA) from 2 different papers showing SOD activity on plant CCS, or the plant CCS has acquired SOD activity through a different mechanism. Let's go with the latter explanation for now. Block propagation of SOD activity to the CCS clade by placing an IRD at AN3; plant CCS will still have the positive annotations curated to it. Correction: block propagation at AN4, since PAINT will not allow the IRD at AN3. -Both 16532 and 4784 should has_part GO:0005507 "copper ion binding," but propagate 5507 to AN0 until this is implemented. -Also propagate GO:0008270 "zinc ion binding" to AN0. In the absence of contradictory information, allow this to propagate to the CCS clade, but be prepared to change this decision.
Time to curate MF: 47 minutes
-SOD/mitochondrion: There are anotations to mitochondrion or some child of mitochondrion for mouse, rat, worm, and yeast SOD1. Propagate mitochondrion to AN48 and block poropagation to the SOD3 clade. -The eukaryotic SODs have multiple annotations to "extracellular region" or its children. Propagate to AN46. Similarly, propagate "periplasmic space" from E. coli sodC to the other bacterial proteins. -Can't really make any good inference for the CCS clade.
Time to annotate CC: 14 minutes
Start with functions directly related to the MFs of these proteins. Set aside multicellular processes.
CCS clade: Propagate: GO:0051341 : regulation of oxidoreductase activity GO:0015680 : intracellular copper ion transport
Propagate to AN0: GO:0019430 : removal of superoxide radicals
That covers most of the cellular processes.
Time to annotate BP: 18 minutes.