Difference between revisions of "Taxon-GO Implementation April 2008 onwards"

From GO Wiki
Jump to: navigation, search
Line 435: Line 435:
For now I think I will just hard code them all.  
For now I think I will just hard code them all.  
==July 18th==
I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again.  
I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again.  
They are called Taxonlinks_TabDel2OBO.pl and Taxonlinks_OBO2TabDel.pl and are available in go/software/utilities/.
They are called Taxonlinks_TabDel2OBO.pl and Taxonlinks_OBO2TabDel.pl and are available in go/software/utilities/.
I also wrote up the entire workflow in http://wiki.geneontology.org/index.php/Taxon_Editing_Workflow
More links have been made from the old sensu terms. Currently Endopterygota and Insecta.

Revision as of 07:27, 18 July 2008

At the Consortium meetings in Princeton in 2007 and Salt Lake City in 2008 Jennifer presented a proposal and pilot on the system of implementing taxon information. At the Salt Lake City Meeting it was decided to implement the proposal. This page is for recording of progress on that implementation.

The original proposal is not currently archived.
The pilot data is at

29th April 2008

In starting to implement the links I am using the custom taxon slim that Chris Mungall made from the NCBI taxonomy hierarchy.

Taxonomy Slim

This is what he did to make the slim:

I grabbed all species with an annotation in the database, then did a  simple filter on the results:
http://wiki.geneontology.org/index.php/ Example_Queries#Total_annotations.2C_grouped_by_species. 2C_broken_down_by_evidence

grep -v IEA z | cut -f1 | sort -u | perl -npe 's//NCBITaxon:/' > ~/ tmp/tax-ids.txt 

(there were almost a 1000!)

I then used my segmentation tool (part of obol) to slice these IDs  and their descendants from the ncbi tax file I publish on the   
obo  download page. The results are in:


there's a bug in my segmenter in that the ranks (genus, order,  family) were not included. But this may work to your advantage in    
that these are stored using generic term properties which people  aren't used to yet. It seems like you don't need these anyway.

I am to reproduce my segmenter functionality in OE. In fact it may be  possible to do this right now with filter scripts. In this   
particular  case the segmenter is doing something pretty basic - following all  input terms up to the root and writing as .obo

Q/ Should this slim now be checked into cvs in a non-scratch directory?

Cross Product files

Chris has made files in scratch that show cross products between the go ontology file and the various other ontologies. He has suggested that I should look at the cell type file and categorise the cell types by taxon and then transfer those to the GO file. This will cover far more terms with less work.

The cross product files are at /go/scratch/xps/

I have pulled out the list of cell types to be categorized and it is here:

CL:0000017 ! spermatocyte
CL:0000018 ! spermatid
CL:0000019 ! sperm
CL:0000023 ! oocyte
CL:0000025 ! egg
CL:0000026 ! nurse cell
CL:0000030 ! glioblast
CL:0000031 ! neuroblast
CL:0000034 ! stem cell
CL:0000037 ! hematopoietic stem cell
CL:0000056 ! myoblast
CL:0000057 ! fibroblast
CL:0000062 ! osteoblast
CL:0000066 ! epithelial cell
CL:0000071 ! blood vessel endothelial cell
CL:0000075 ! columnar/cuboidal epithelial cell
CL:0000081 ! blood cell
CL:0000084 ! T cell
CL:0000092 ! osteoclast
CL:0000094 ! granulocyte
CL:0000097 ! mast cell
CL:0000115 ! endothelial cell
CL:0000125 ! glial cell
CL:0000127 ! astrocyte
CL:0000128 ! oligodendrocyte
CL:0000129 ! microglial cell
CL:0000134 ! mesenchymal cell
CL:0000136 ! fat cell
CL:0000138 ! chondrocyte
CL:0000147 ! pigment cell
CL:0000148 ! melanocyte
CL:0000150 ! glandular epithelial cell
CL:0000178 ! Leydig cell
CL:0000187 ! muscle cell
CL:0000188 ! skeletal muscle cell
CL:0000192 ! smooth muscle cell
CL:0000201 ! auditory receptor cell
CL:0000202 ! auditory hair cell
CL:0000210 ! photoreceptor cell
CL:0000216 ! Sertoli cell
CL:0000218 ! Schwann cell
CL:0000221 ! ectodermal cell
CL:0000222 ! mesodermal cell
CL:0000223 ! endodermal cell
CL:0000228 ! multinucleate cell
CL:0000232 ! erythrocyte
CL:0000233 ! platelet
CL:0000235 ! macrophage
CL:0000236 ! B cell
CL:0000248 ! microsporocyte
CL:0000250 ! megaspore
CL:0000252 ! microspore
CL:0000253 ! eurydendroid cell
CL:0000254 ! egg cell
CL:0000262 ! guard mother cell
CL:0000276 ! sclerenchyma cell
CL:0000280 ! generative cell
CL:0000282 ! trichome
CL:0000284 ! companion cell
CL:0000287 ! eye photoreceptor cell
CL:0000288 ! synergid
CL:0000292 ! guard cell
CL:0000294 ! sieve cell
CL:0000295 ! somatotropin secreting cell
CL:0000296 ! vegetative cell
CL:0000299 ! trichoblast
CL:0000300 ! gamete
CL:0000301 ! pole cell
CL:0000312 ! keratinocyte
CL:0000332 ! atrichoblast
CL:0000333 ! neural crest cell
CL:0000362 ! epidermal cell
CL:0000365 ! zygote
CL:0000373 ! histoblast
CL:0000392 ! crystal cell
CL:0000394 ! plasmatocyte
CL:0000396 ! lamellocyte
CL:0000408 ! male gamete
CL:0000430 ! xanthophore
CL:0000431 ! iridophore
CL:0000439 ! prolactin secreting cell
CL:0000442 ! follicular dendritic cell
CL:0000448 ! white fat cell
CL:0000449 ! brown fat cell
CL:0000451 ! dendritic cell
CL:0000453 ! Langerhans cell
CL:0000467 ! adrenocorticotropic hormone secreting cell
CL:0000469 ! ganglion mother cell
CL:0000474 ! pericardial cell
CL:0000476 ! thyroid stimulating hormone secreting cell
CL:0000477 ! follicle cell
CL:0000486 ! garland cell
CL:0000487 ! oenocyte
CL:0000492 ! T-helper cell
CL:0000501 ! granulosa cell
CL:0000522 ! spore
CL:0000537 ! antipodal cell
CL:0000540 ! neuron
CL:0000542 ! lymphocyte
CL:0000545 ! T-helper 1 cell
CL:0000546 ! T-helper 2 cell
CL:0000556 ! megakaryocyte
CL:0000562 ! nucleate erythrocyte
CL:0000563 ! endospore
CL:0000571 ! leucophore
CL:0000573 ! retinal cone cell
CL:0000574 ! erythrophore
CL:0000576 ! monocyte
CL:0000579 ! border follicle cell
CL:0000586 ! germ cell
CL:0000595 ! enucleate erythrocyte
CL:0000598 ! pyramidal cell
CL:0000599 ! conidium
CL:0000604 ! retinal rod cell
CL:0000607 ! ascospore
CL:0000608 ! zygospore
CL:0000609 ! vestibular hair cell
CL:0000615 ! basidiospore
CL:0000616 ! sporangiospore
CL:0000623 ! natural killer cell
CL:0000624 ! CD4-positive, alpha-beta T cell
CL:0000625 ! CD8-positive, alpha-beta T cell
CL:0000644 ! Bergmann glial cell
CL:0000656 ! primary spermatocyte
CL:0000668 ! parenchymal cell
CL:0000674 ! interfollicle cell
CL:0000675 ! female gamete
CL:0000681 ! radial glial cell
CL:0000695 ! Cajal-Retzius cell
CL:0000711 ! cumulus cell
CL:0000716 ! lymph gland crystal cell
CL:0000722 ! cystoblast
CL:0000723 ! somatic stem cell
CL:0000724 ! heterocyst
CL:0000726 ! chlamydospore
CL:0000730 ! leading edge cell
CL:0000731 ! urothelial cell
CL:0000732 ! amoeboid cell
CL:0000733 ! lymph gland plasmatocyte
CL:0000735 ! lymph gland hemocyte
CL:0000737 ! striated muscle cell
CL:0000738 ! leukocyte
CL:0000740 ! retinal ganglion cell
CL:0000746 ! cardiac muscle cell
CL:0000747 ! cyanophore
CL:0000748 ! retinal bipolar neuron
CL:0000762 ! thrombocyte
CL:0000763 ! myeloid cell
CL:0000766 ! myeloid leukocyte
CL:0000767 ! basophil
CL:0000771 ! eosinophil
CL:0000775 ! neutrophil
CL:0000782 ! myeloid dendritic cell
CL:0000784 ! plasmacytoid dendritic cell
CL:0000785 ! mature B cell
CL:0000786 ! plasma cell
CL:0000787 ! memory B cell
CL:0000789 ! alpha-beta T cell
CL:0000792 ! CD4-positive, CD25-positive, alpha-beta regulatory T cell
CL:0000793 ! CD4-positive, alpha-beta intraepithelial T cell
CL:0000794 ! CD8-positive, alpha-beta cytotoxic T cell
CL:0000795 ! CD8-positive, alpha-beta regulatory T cell
CL:0000796 ! CD8 positive, alpha-beta intraepithelial T cell
CL:0000797 ! alpha-beta intraepithelial T cell
CL:0000798 ! gamma-delta T cell
CL:0000801 ! gamma-delta intraepithelial T cell
CL:0000802 ! CD8-positive, gamma-delta intraepithelial T cell
CL:0000803 ! CD4-positive, gamma-delta intraepithelial T cell
CL:0000804 ! immature T cell
CL:0000813 ! memory T cell
CL:0000814 ! NK T cell
CL:0000815 ! regulatory T cell
CL:0000816 ! immature B cell
CL:0000817 ! pre-B cell
CL:0000818 ! transitional stage B cell
CL:0000819 ! B-1 B cell
CL:0000820 ! B-1a B cell
CL:0000821 ! B-1b B cell
CL:0000825 ! natural killer cell progenitor
CL:0000826 ! pro-B cell
CL:0000827 ! pro-T cell
CL:0000837 ! hematopoietic progenitor cell
CL:0000838 ! lymphoid progenitor cell
CL:0000839 ! myeloid progenitor cell
CL:0000842 ! mononuclear cell
CL:0000843 ! follicular B cell
CL:0000844 ! germinal center B cell
CL:0000845 ! marginal zone B cell
CL:0000851 ! neuromast mantle cell
CL:0000852 ! neuromast support cell
CL:0000855 ! hair cell
CL:0000856 ! neuromast hair cell
CL:1000274 ! trophectodermal cell

Taxon-GO file format

This is the proposed file format for the taxon-go links:

GO term GO:id relationship taxon name taxon id
photosynthesis GO:0015979 never_in_taxon Mammalia Taxonomy ID: 40674
male germ-line cyst formation GO:0048136 never_in_taxon Mammalia Taxonomy ID: 40674
hemocyte differentiation GO:0042386 never_outside_taxon Arthropoda Taxonomy ID: 6656
multicellular organismal process GO:0032501 never_outside_taxon Eukaryota Taxonomy ID: 2759
nucleus GO:0005634 never_outside_taxon Eukaryota Taxonomy ID: 2759
gametophyte development GO:0048229 never_in_taxon Dictyostelium Taxonomy ID: 5782
viral reproduction GO:0016032 never_outside_taxon Viruses Taxonomy ID: 10239
compund eye development GO:0048749 never_in_taxon Mammalia Taxonomy ID: 40674
lactation GO:0007595 never_outside_taxon Mammalia Taxonomy ID: 40674
fat body development GO:0007503 never_in_taxon Mammalia Taxonomy ID: 40674

I am not yet sure how to save a file like this from OBO-Edit after having added links. I will have a go at that.

1st May 2008

I have arranged a meeting with Susan Tweedie and Rebecca Foulger to start labeling the cell type and GO terms by taxon.

6th May 2008

I have cleared away all the old taxon-go-related files from the scratch directory and made a new folder in there called go-taxon. This folder contains a copy of the taxon slim.

I have still not worked out how to save the tab-delimited file of relationships out of OBO-Edit and this is the major obstacle to starting work just now.

Chris has pointed out that I don't need to be able to propagate the links down the graph and have those links actually instantiated as the it is easy to infer them. For working in OBO-Edit I just need to set a render that will show if a term has a taxon link already applied to one of its ancestors.

I have made a tab-delimited file to contain the links between the ontology file and the taxon slim and it is in the go/scratch/go-taxon/ directory

6th May PM

Further progress as described in a mail to Chris:

I made the file of go-taxon relationships in obo and tab-delimited format to test both.

When I load with the tab-delimited version in OBO-Edit the relationships between the go 
terms and the taxon terms don't show up at all and I'm not sure what to 
do to persuade them. I suppose this format is just not one that OBO-Edit is prepared for.

When I load with the obo format go-taxon file the taxon links show in the graph viewer,
 and the normal links show in the graphviz component. However, when I 
click on a go term with a taxon link, the graphviz component goes on strike and does not
 update at all. No idea of why it is being picky about that.
The other weird thing is that the application treats the go-taxon file as a separate 
ontology and it does not seem to realise that this is a relationship between  
the two other loaded ontologies. I will attach a picture so you can see.

With either format I'm not sure that OBO-Edit knows how to save the three files out separately.

10th May

The obo version of the relationship file is now working. There was a formatting problem with the taxon ids.

15th May

I have now written to Chris to ask how I should represent links where the go term should not be used outside of the combination of two taxa. For example photosynthesis, which should not be used outside of the combination of bacterial and viridiplantae taxa.

I am not currently able to save out the taxon links. The major barrier is that the save ontology panel in oboedit is too big to open out fully in my laptop. I have submitted a bug report but the code looks quite complicated in that part. The setup to let different panels appear or disappear when boxes are checked is quite hard to fix.

The graphviz plugin would also be much easier to use for this if the disjoint relationships were not shown, and I have gone some way to figuring out how to do that. The text file listing the relationships in the graph is set up with a small piece of code in the graphviz component, but I do not yet understand the relationship management methods enough to be able to configure which relationships are included.

16th May

I have figured out that it is not possible to save the go-taxon link file out of oboedit. I will need to save the whole ontology file out and write a perl script to extract the taxon links in OBO format, and then another to convert this to tab-delimited format.

Chris and I are discussing how best to deal with situations where an only_in_taxon link should be made to the conjunction of two taxonomic groups.

30th May

Perl script in progress.

5th June

Perl scripts completed. Checked into cvs.


18th June

Skype call
Participants: Chris Mungall, Jennifer Deegan.

We discussed how to deal with situations where there are processes that are only seen in a specific group of organisms, but where that group of organisms can only be expressed as the union two taxonomic groups. Chris has suggested that if at all possible processes should only be connected to one taxon term, and that in many cases it will be possible to do that by looking at a child go term rather than the one that immediately suggests itself for example:

[i]eye development (this term is hard to connect to taxon, and would require multiple taxa.)
---[i]compound eye development (this term is able to be connected to a single taxon.)

However there are some situations where more than one taxon is needed. For example chloroplast-type photosynthesis happens in organisms with chloroplasts and in cyanobacteria. In order to express this Chris suggests that we should use a union term.

id: Jens_ID_1234
name: viridiplante or bacteria
union_of: NCBITaxon:2 ; Bacteria
union_of: NCBITaxon:33090 ; Viridiplantae

We didn't manage to work out how to implement this during the meeting but I (Jen) had some thoughts afterwards.

It will be best if we can make all of the links in OBO-Edit so I think that I will add the union terms under a grouping term but not in the taxon slim. (I think I cannot really edit the taxon slim.)

[i]Taxon slim
[i]union terms
---[i]viridiplante or bacteria

but also linked to the taxon slim.

---[u]viridiplante or bacteria

(where u represents the union_of relationship.)

I think it will then be necessary to have these union terms saved out at the bottom of the go-taxon relationship file, as we cannot really make additions to the taxon slim. Once we have made these union terms in OBO-Edit it will be easy enough to make the connections between the GO terms and the union terms.

26th June

Workflow for editing file.

Load these files:


Make the connections between terms and then save out the entire dataset as


Then run the scripts:


These will produce the two files


These can be checked into scratch for the next editing session.

New union terms should be made as children of the union terms node and connected to the terms of which they are unions via the union_of relationship.

16th July

To avoid having to use the perl scripts every time I save and load I have just saved out all the ontologies into one file and am calling it 'all_files_mid_edit.obo' and checking it into scratch/go-taxon. There are two good ID manager rules in my OBO-Edit config on windows now but the application does not seem to be able to save these out to the ontology file. I will document them for the user guide as they are quite complicated and this feature is previously undocumented.

Later: The ID Manager work is now fully documented at the bottom of the page on how to work the ID Manager component in the OBO-Edit help guide. Here is a copy of the ID preferences file, which is called idprofiles.xml.

17th July

I have been working on getting the display in OBO-Edit right so that I can start editing with other curators next week.

This is what the relationships look like the in the ontology tree editor:

Never outside.png

Below is what the go-taxon links look like in the graphviewer with the reasoner on. Note that not all paths are shown. It is not clear why this is. This is shown when I click on the term 'chloroplast-style photosynthesis' (This is just a term I made up for this work. It is not actually in the GO.)

Never outside graphview.png

Below is how the union relationships are shown in the graphviewer if I click on the union terms 'Viridiplantae and Bacteria'.


The full graph is shown in the graphviz component (below) with or without the reasoner, which is very good as the reasoner is currenty slowing things down a lot. (Note, this picture has been updated to show the correct logic. The link from viridiplantae and cyanobacteria to union_terms is not longer in the file. However, if it was there is would be shown. To see the image that shows the same graph as the other images then see the previous version in the history log.)

Full graph graphviz.png

Adding links en masse

I have started experimenting with adding links in bulk. For example I can search for all the terms that has 'sensu endopterygota' in them and give all these a link that says 'only_in' endopterygota. Then it is quick for the fly curators to look through and check that these are right. However I am wondering whether it makes more sense to only label the top term in a branch that is all to be 'only_in' endopterygota and then have the application infer that it should add the links to all the descendants of the labelled terms. If I was going to do this then it would also make sense to find out how to detect redundant links with the reasoner. For example, have the reasoner look for when a term and it's ancestor both had only_in endopterygota links and offer to remove the links from the descendent term. I'm not sure that we currently have a policy on whether to hard code all the links or leave it to the reasoner to add the inferred links in repair mode.

For now I think I will just hard code them all.

July 18th

I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again. They are called Taxonlinks_TabDel2OBO.pl and Taxonlinks_OBO2TabDel.pl and are available in go/software/utilities/.

I also wrote up the entire workflow in http://wiki.geneontology.org/index.php/Taxon_Editing_Workflow More links have been made from the old sensu terms. Currently Endopterygota and Insecta.