Difference between revisions of "Taxon-GO Implementation April 2008 onwards"
|Line 436:||Line 436:|
I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again.
I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again.
Revision as of 08:13, 28 July 2008
At the Consortium meetings in Princeton in 2007 and Salt Lake City in 2008 Jennifer presented a proposal and pilot on the system of implementing taxon information. At the Salt Lake City Meeting it was decided to implement the proposal. This page is for recording of progress on that implementation.
The original proposal is not currently archived.
The pilot data is at
- 1 29th April 2008
- 2 1st May 2008
- 3 6th May 2008
- 4 6th May PM
- 5 10th May
- 6 15th May
- 7 16th May
- 8 30th May
- 9 5th June
- 10 18th June
- 11 26th June
- 12 16th July
- 13 17th July
- 14 Adding links en masse
- 15 July 18th
- 16 July 21st
- 17 July 23rd
- 18 24th July
29th April 2008
In starting to implement the links I am using the custom taxon slim that Chris Mungall made from the NCBI taxonomy hierarchy.
This is what he did to make the slim:
I grabbed all species with an annotation in the database, then did a simple filter on the results: http://wiki.geneontology.org/index.php/ Example_Queries#Total_annotations.2C_grouped_by_species. 2C_broken_down_by_evidence grep -v IEA z | cut -f1 | sort -u | perl -npe 's//NCBITaxon:/' > ~/ tmp/tax-ids.txt (there were almost a 1000!) I then used my segmentation tool (part of obol) to slice these IDs and their descendants from the ncbi tax file I publish on the obo download page. The results are in: http://www.berkeleybop.org/obol/tmp/ncbitax-slim.obo there's a bug in my segmenter in that the ranks (genus, order, family) were not included. But this may work to your advantage in that these are stored using generic term properties which people aren't used to yet. It seems like you don't need these anyway. I am to reproduce my segmenter functionality in OE. In fact it may be possible to do this right now with filter scripts. In this particular case the segmenter is doing something pretty basic - following all input terms up to the root and writing as .obo
Q/ Should this slim now be checked into cvs in a non-scratch directory?
Cross Product files
Chris has made files in scratch that show cross products between the go ontology file and the various other ontologies. He has suggested that I should look at the cell type file and categorise the cell types by taxon and then transfer those to the GO file. This will cover far more terms with less work.
The cross product files are at /go/scratch/xps/
I have pulled out the list of cell types to be categorized and it is here:
CL:0000017 ! spermatocyte CL:0000018 ! spermatid CL:0000019 ! sperm CL:0000023 ! oocyte CL:0000025 ! egg CL:0000026 ! nurse cell CL:0000030 ! glioblast CL:0000031 ! neuroblast CL:0000034 ! stem cell CL:0000037 ! hematopoietic stem cell CL:0000056 ! myoblast CL:0000057 ! fibroblast CL:0000062 ! osteoblast CL:0000066 ! epithelial cell CL:0000071 ! blood vessel endothelial cell CL:0000075 ! columnar/cuboidal epithelial cell CL:0000081 ! blood cell CL:0000084 ! T cell CL:0000092 ! osteoclast CL:0000094 ! granulocyte CL:0000097 ! mast cell CL:0000115 ! endothelial cell CL:0000125 ! glial cell CL:0000127 ! astrocyte CL:0000128 ! oligodendrocyte CL:0000129 ! microglial cell CL:0000134 ! mesenchymal cell CL:0000136 ! fat cell CL:0000138 ! chondrocyte CL:0000147 ! pigment cell CL:0000148 ! melanocyte CL:0000150 ! glandular epithelial cell CL:0000178 ! Leydig cell CL:0000187 ! muscle cell CL:0000188 ! skeletal muscle cell CL:0000192 ! smooth muscle cell CL:0000201 ! auditory receptor cell CL:0000202 ! auditory hair cell CL:0000210 ! photoreceptor cell CL:0000216 ! Sertoli cell CL:0000218 ! Schwann cell CL:0000221 ! ectodermal cell CL:0000222 ! mesodermal cell CL:0000223 ! endodermal cell CL:0000228 ! multinucleate cell CL:0000232 ! erythrocyte CL:0000233 ! platelet CL:0000235 ! macrophage CL:0000236 ! B cell CL:0000248 ! microsporocyte CL:0000250 ! megaspore CL:0000252 ! microspore CL:0000253 ! eurydendroid cell CL:0000254 ! egg cell CL:0000262 ! guard mother cell CL:0000276 ! sclerenchyma cell CL:0000280 ! generative cell CL:0000282 ! trichome CL:0000284 ! companion cell CL:0000287 ! eye photoreceptor cell CL:0000288 ! synergid CL:0000292 ! guard cell CL:0000294 ! sieve cell CL:0000295 ! somatotropin secreting cell CL:0000296 ! vegetative cell CL:0000299 ! trichoblast CL:0000300 ! gamete CL:0000301 ! pole cell CL:0000312 ! keratinocyte CL:0000332 ! atrichoblast CL:0000333 ! neural crest cell CL:0000362 ! epidermal cell CL:0000365 ! zygote CL:0000373 ! histoblast CL:0000392 ! crystal cell CL:0000394 ! plasmatocyte CL:0000396 ! lamellocyte CL:0000408 ! male gamete CL:0000430 ! xanthophore CL:0000431 ! iridophore CL:0000439 ! prolactin secreting cell CL:0000442 ! follicular dendritic cell CL:0000448 ! white fat cell CL:0000449 ! brown fat cell CL:0000451 ! dendritic cell CL:0000453 ! Langerhans cell CL:0000467 ! adrenocorticotropic hormone secreting cell CL:0000469 ! ganglion mother cell CL:0000474 ! pericardial cell CL:0000476 ! thyroid stimulating hormone secreting cell CL:0000477 ! follicle cell CL:0000486 ! garland cell CL:0000487 ! oenocyte CL:0000492 ! T-helper cell CL:0000501 ! granulosa cell CL:0000522 ! spore CL:0000537 ! antipodal cell CL:0000540 ! neuron CL:0000542 ! lymphocyte CL:0000545 ! T-helper 1 cell CL:0000546 ! T-helper 2 cell CL:0000556 ! megakaryocyte CL:0000562 ! nucleate erythrocyte CL:0000563 ! endospore CL:0000571 ! leucophore CL:0000573 ! retinal cone cell CL:0000574 ! erythrophore CL:0000576 ! monocyte CL:0000579 ! border follicle cell CL:0000586 ! germ cell CL:0000595 ! enucleate erythrocyte CL:0000598 ! pyramidal cell CL:0000599 ! conidium CL:0000604 ! retinal rod cell CL:0000607 ! ascospore CL:0000608 ! zygospore CL:0000609 ! vestibular hair cell CL:0000615 ! basidiospore CL:0000616 ! sporangiospore CL:0000623 ! natural killer cell CL:0000624 ! CD4-positive, alpha-beta T cell CL:0000625 ! CD8-positive, alpha-beta T cell CL:0000644 ! Bergmann glial cell CL:0000656 ! primary spermatocyte CL:0000668 ! parenchymal cell CL:0000674 ! interfollicle cell CL:0000675 ! female gamete CL:0000681 ! radial glial cell CL:0000695 ! Cajal-Retzius cell CL:0000711 ! cumulus cell CL:0000716 ! lymph gland crystal cell CL:0000722 ! cystoblast CL:0000723 ! somatic stem cell CL:0000724 ! heterocyst CL:0000726 ! chlamydospore CL:0000730 ! leading edge cell CL:0000731 ! urothelial cell CL:0000732 ! amoeboid cell CL:0000733 ! lymph gland plasmatocyte CL:0000735 ! lymph gland hemocyte CL:0000737 ! striated muscle cell CL:0000738 ! leukocyte CL:0000740 ! retinal ganglion cell CL:0000746 ! cardiac muscle cell CL:0000747 ! cyanophore CL:0000748 ! retinal bipolar neuron CL:0000762 ! thrombocyte CL:0000763 ! myeloid cell CL:0000766 ! myeloid leukocyte CL:0000767 ! basophil CL:0000771 ! eosinophil CL:0000775 ! neutrophil CL:0000782 ! myeloid dendritic cell CL:0000784 ! plasmacytoid dendritic cell CL:0000785 ! mature B cell CL:0000786 ! plasma cell CL:0000787 ! memory B cell CL:0000789 ! alpha-beta T cell CL:0000792 ! CD4-positive, CD25-positive, alpha-beta regulatory T cell CL:0000793 ! CD4-positive, alpha-beta intraepithelial T cell CL:0000794 ! CD8-positive, alpha-beta cytotoxic T cell CL:0000795 ! CD8-positive, alpha-beta regulatory T cell CL:0000796 ! CD8 positive, alpha-beta intraepithelial T cell CL:0000797 ! alpha-beta intraepithelial T cell CL:0000798 ! gamma-delta T cell CL:0000801 ! gamma-delta intraepithelial T cell CL:0000802 ! CD8-positive, gamma-delta intraepithelial T cell CL:0000803 ! CD4-positive, gamma-delta intraepithelial T cell CL:0000804 ! immature T cell CL:0000813 ! memory T cell CL:0000814 ! NK T cell CL:0000815 ! regulatory T cell CL:0000816 ! immature B cell CL:0000817 ! pre-B cell CL:0000818 ! transitional stage B cell CL:0000819 ! B-1 B cell CL:0000820 ! B-1a B cell CL:0000821 ! B-1b B cell CL:0000825 ! natural killer cell progenitor CL:0000826 ! pro-B cell CL:0000827 ! pro-T cell CL:0000837 ! hematopoietic progenitor cell CL:0000838 ! lymphoid progenitor cell CL:0000839 ! myeloid progenitor cell CL:0000842 ! mononuclear cell CL:0000843 ! follicular B cell CL:0000844 ! germinal center B cell CL:0000845 ! marginal zone B cell CL:0000851 ! neuromast mantle cell CL:0000852 ! neuromast support cell CL:0000855 ! hair cell CL:0000856 ! neuromast hair cell CL:1000274 ! trophectodermal cell
Taxon-GO file format
This is the proposed file format for the taxon-go links:
|GO term||GO:id||relationship||taxon name||taxon id|
|photosynthesis||GO:0015979||never_in_taxon||Mammalia||Taxonomy ID: 40674|
|male germ-line cyst formation||GO:0048136||never_in_taxon||Mammalia||Taxonomy ID: 40674|
|hemocyte differentiation||GO:0042386||never_outside_taxon||Arthropoda||Taxonomy ID: 6656|
|multicellular organismal process||GO:0032501||never_outside_taxon||Eukaryota||Taxonomy ID: 2759|
|nucleus||GO:0005634||never_outside_taxon||Eukaryota||Taxonomy ID: 2759|
|gametophyte development||GO:0048229||never_in_taxon||Dictyostelium||Taxonomy ID: 5782|
|viral reproduction||GO:0016032||never_outside_taxon||Viruses||Taxonomy ID: 10239|
|compund eye development||GO:0048749||never_in_taxon||Mammalia||Taxonomy ID: 40674|
|lactation||GO:0007595||never_outside_taxon||Mammalia||Taxonomy ID: 40674|
|fat body development||GO:0007503||never_in_taxon||Mammalia||Taxonomy ID: 40674|
I am not yet sure how to save a file like this from OBO-Edit after having added links. I will have a go at that.
1st May 2008
I have arranged a meeting with Susan Tweedie and Rebecca Foulger to start labeling the cell type and GO terms by taxon.
6th May 2008
I have cleared away all the old taxon-go-related files from the scratch directory and made a new folder in there called go-taxon. This folder contains a copy of the taxon slim.
I have still not worked out how to save the tab-delimited file of relationships out of OBO-Edit and this is the major obstacle to starting work just now.
Chris has pointed out that I don't need to be able to propagate the links down the graph and have those links actually instantiated as the it is easy to infer them. For working in OBO-Edit I just need to set a render that will show if a term has a taxon link already applied to one of its ancestors.
I have made a tab-delimited file to contain the links between the ontology file and the taxon slim and it is in the go/scratch/go-taxon/ directory
6th May PM
Further progress as described in a mail to Chris:
I made the file of go-taxon relationships in obo and tab-delimited format to test both. When I load with the tab-delimited version in OBO-Edit the relationships between the go terms and the taxon terms don't show up at all and I'm not sure what to do to persuade them. I suppose this format is just not one that OBO-Edit is prepared for. When I load with the obo format go-taxon file the taxon links show in the graph viewer, and the normal links show in the graphviz component. However, when I click on a go term with a taxon link, the graphviz component goes on strike and does not update at all. No idea of why it is being picky about that. The other weird thing is that the application treats the go-taxon file as a separate ontology and it does not seem to realise that this is a relationship between the two other loaded ontologies. I will attach a picture so you can see. With either format I'm not sure that OBO-Edit knows how to save the three files out separately.
The obo version of the relationship file is now working. There was a formatting problem with the taxon ids.
I have now written to Chris to ask how I should represent links where the go term should not be used outside of the combination of two taxa. For example photosynthesis, which should not be used outside of the combination of bacterial and viridiplantae taxa.
I am not currently able to save out the taxon links. The major barrier is that the save ontology panel in oboedit is too big to open out fully in my laptop. I have submitted a bug report but the code looks quite complicated in that part. The setup to let different panels appear or disappear when boxes are checked is quite hard to fix.
The graphviz plugin would also be much easier to use for this if the disjoint relationships were not shown, and I have gone some way to figuring out how to do that. The text file listing the relationships in the graph is set up with a small piece of code in the graphviz component, but I do not yet understand the relationship management methods enough to be able to configure which relationships are included.
I have figured out that it is not possible to save the go-taxon link file out of oboedit. I will need to save the whole ontology file out and write a perl script to extract the taxon links in OBO format, and then another to convert this to tab-delimited format.
Chris and I are discussing how best to deal with situations where an only_in_taxon link should be made to the conjunction of two taxonomic groups.
Perl script in progress.
Perl scripts completed. Checked into cvs.
Participants: Chris Mungall, Jennifer Deegan.
We discussed how to deal with situations where there are processes that are only seen in a specific group of organisms, but where that group of organisms can only be expressed as the union two taxonomic groups. Chris has suggested that if at all possible processes should only be connected to one taxon term, and that in many cases it will be possible to do that by looking at a child go term rather than the one that immediately suggests itself for example:
[i]eye development (this term is hard to connect to taxon, and would require multiple taxa.) ---[i]compound eye development (this term is able to be connected to a single taxon.)
However there are some situations where more than one taxon is needed. For example chloroplast-type photosynthesis happens in organisms with chloroplasts and in cyanobacteria. In order to express this Chris suggests that we should use a union term.
id: Jens_ID_1234 name: viridiplante or bacteria union_of: NCBITaxon:2 ; Bacteria union_of: NCBITaxon:33090 ; Viridiplantae
We didn't manage to work out how to implement this during the meeting but I (Jen) had some thoughts afterwards.
It will be best if we can make all of the links in OBO-Edit so I think that I will add the union terms under a grouping term but not in the taxon slim. (I think I cannot really edit the taxon slim.)
[i]Taxon slim [i]union terms ---[i]viridiplante or bacteria
but also linked to the taxon slim.
[i]viridiplantae ---[u]viridiplante or bacteria
(where u represents the union_of relationship.)
I think it will then be necessary to have these union terms saved out at the bottom of the go-taxon relationship file, as we cannot really make additions to the taxon slim. Once we have made these union terms in OBO-Edit it will be easy enough to make the connections between the GO terms and the union terms.
Workflow for editing file.
Load these files:
go/ontology/editor/gene_ontology_write.obo go/scratch/go-taxon/ncbitax-slim.obo go/scratch/go-taxon/TaxonFile.obo go/scratch/go-taxon/UnionTerms.obo
Make the connections between terms and then save out the entire dataset as
Then run the scripts:
These will produce the two files
These can be checked into scratch for the next editing session.
New union terms should be made as children of the union terms node and connected to the terms of which they are unions via the union_of relationship.
To avoid having to use the perl scripts every time I save and load I have just saved out all the ontologies into one file and am calling it 'all_files_mid_edit.obo' and checking it into scratch/go-taxon. There are two good ID manager rules in my OBO-Edit config on windows now but the application does not seem to be able to save these out to the ontology file. I will document them for the user guide as they are quite complicated and this feature is previously undocumented.
Later: The ID Manager work is now fully documented at the bottom of the page on how to work the ID Manager component in the OBO-Edit help guide. Here is a copy of the ID preferences file, which is called idprofiles.xml.
I have been working on getting the display in OBO-Edit right so that I can start editing with other curators next week.
This is what the relationships look like the in the ontology tree editor:
Below is what the go-taxon links look like in the graphviewer with the reasoner on. Note that not all paths are shown. It is not clear why this is. This is shown when I click on the term 'chloroplast-style photosynthesis' (This is just a term I made up for this work. It is not actually in the GO.)
Below is how the union relationships are shown in the graphviewer if I click on the union terms 'Viridiplantae and Bacteria'.
The full graph is shown in the graphviz component (below) with or without the reasoner, which is very good as the reasoner is currenty slowing things down a lot. (Note, this picture has been updated to show the correct logic. The link from viridiplantae and cyanobacteria to union_terms is not longer in the file. However, if it was there is would be shown. To see the image that shows the same graph as the other images then see the previous version in the history log.)
I have started experimenting with adding links in bulk. For example I can search for all the terms that has 'sensu endopterygota' in them and give all these a link that says 'only_in' endopterygota. Then it is quick for the fly curators to look through and check that these are right. However I am wondering whether it makes more sense to only label the top term in a branch that is all to be 'only_in' endopterygota and then have the application infer that it should add the links to all the descendants of the labelled terms. If I was going to do this then it would also make sense to find out how to detect redundant links with the reasoner. For example, have the reasoner look for when a term and it's ancestor both had only_in endopterygota links and offer to remove the links from the descendent term. I'm not sure that we currently have a policy on whether to hard code all the links or leave it to the reasoner to add the inferred links in repair mode.
For now I think I will just hard code them all.
1) Make some links and fully document pipeline.
2) Show to ontology list cc'd to Judy. (tab-delimited version + OBO editing file)
3) Show at managers' call.
4) Discuss plans for use with volunteer testers in Ensembl and InterPro. (one month from now)
5) Test until error or ommission detection rate reaches acceptably low level.
6) Publicise. (Aim for November for best case, or January at the latest.)
I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again. They are called Taxonlinks_TabDel2OBO.pl and Taxonlinks_OBO2TabDel.pl and are available in go/software/utilities/.
I also wrote up the entire workflow in http://wiki.geneontology.org/index.php/Taxon_Editing_Workflow More links have been made from the old sensu terms. Currently Endopterygota, Insecta, Arthropoda, Diptera, Drosophila (not yet checked).
The majority of the sensu terms now have taxon links.
Participants: Susan Tweedie, Becky Foulger, Jennifer Deegan: at EBI.
Worked all afternoon with Becky Foulger and Susan Tweedie to check the links between the old insect sensu terms and the taxon terms. Edits were made live, and Becky made the following notes of action items.
1. Look at def of adult chitin-based cuticle development, and see if it can be made less insect-specific. Term name seems to apply to a wider base than insects (arthropods) but def contains references to insects.
2. Remove the sensu insecta text from the definition of GO:48085.
3. GO:30381 widen definition to chorions in arthropods, rather than insects.
4. GO:7487 remove sensu from the definition.
5. consider changing all descendents of terms with 'imaginal-disc derived' in the term name to contain 'imaginal disc-derived' terminology. Eg. antennal morphogenesis, compound eye morphogenesis.
6. decide where to put 'photoreceptor cell differentiation' and its
children. Under Eukaryote lineage?
plants PRs are molecules rather than cells.
7. Heart terms (dorsal vessel) need looking at. do insects other than flies have a heart proper and and aorta as part of the dorsal vessel?
8. come back to oogenesis terms
19. come back to tracheal system terms.
10. double check dosage compensation terms. Are they limited just to the diptera?
11. GO:7425. Clarify in def that the 80 cells on each side of the embyro refers to Drosophila, but that the term isn't limited to Drosophila.
12. Are instar larvae in organisms other than insects? Tadpoles? crabs it
Jen has labelled instar larvae under arthropoda for now, but it might need revisiting.
Plan to meet again on August 1st in Genetics building in Cambridge. 1pm.
Participants: Jane Lomax, Jennifer Deegan
Worked through many different groups and checked the old sensu terms and their links to taxon. See diff for complete listing as this was a large and diverse piece of work.