Taxon-GO Implementation April 2008 onwards

From GO Wiki
Jump to navigation Jump to search

At the Consortium meetings in Princeton in 2007 and Salt Lake City in 2008 Jennifer presented a proposal and pilot on the system of implementing taxon information. At the Salt Lake City Meeting it was decided to implement the proposal. This page is for recording of progress on that implementation.

The original proposal is not currently archived.
The pilot data is at

29th April 2008

In starting to implement the links I am using the custom taxon slim that Chris Mungall made from the NCBI taxonomy hierarchy.

Taxonomy Slim

This is what he did to make the slim:

I grabbed all species with an annotation in the database, then did a  simple filter on the results: Example_Queries#Total_annotations.2C_grouped_by_species. 2C_broken_down_by_evidence

grep -v IEA z | cut -f1 | sort -u | perl -npe 's//NCBITaxon:/' > ~/ tmp/tax-ids.txt 

(there were almost a 1000!)

I then used my segmentation tool (part of obol) to slice these IDs  and their descendants from the ncbi tax file I publish on the   
obo  download page. The results are in:

there's a bug in my segmenter in that the ranks (genus, order,  family) were not included. But this may work to your advantage in    
that these are stored using generic term properties which people  aren't used to yet. It seems like you don't need these anyway.

I am to reproduce my segmenter functionality in OE. In fact it may be  possible to do this right now with filter scripts. In this   
particular  case the segmenter is doing something pretty basic - following all  input terms up to the root and writing as .obo

Q/ Should this slim now be checked into cvs in a non-scratch directory?

Cross Product files

Chris has made files in scratch that show cross products between the go ontology file and the various other ontologies. He has suggested that I should look at the cell type file and categorise the cell types by taxon and then transfer those to the GO file. This will cover far more terms with less work.

The cross product files are at /go/scratch/xps/

I have pulled out the list of cell types to be categorized and it is here:

CL:0000017 ! spermatocyte
CL:0000018 ! spermatid
CL:0000019 ! sperm
CL:0000023 ! oocyte
CL:0000025 ! egg
CL:0000026 ! nurse cell
CL:0000030 ! glioblast
CL:0000031 ! neuroblast
CL:0000034 ! stem cell
CL:0000037 ! hematopoietic stem cell
CL:0000056 ! myoblast
CL:0000057 ! fibroblast
CL:0000062 ! osteoblast
CL:0000066 ! epithelial cell
CL:0000071 ! blood vessel endothelial cell
CL:0000075 ! columnar/cuboidal epithelial cell
CL:0000081 ! blood cell
CL:0000084 ! T cell
CL:0000092 ! osteoclast
CL:0000094 ! granulocyte
CL:0000097 ! mast cell
CL:0000115 ! endothelial cell
CL:0000125 ! glial cell
CL:0000127 ! astrocyte
CL:0000128 ! oligodendrocyte
CL:0000129 ! microglial cell
CL:0000134 ! mesenchymal cell
CL:0000136 ! fat cell
CL:0000138 ! chondrocyte
CL:0000147 ! pigment cell
CL:0000148 ! melanocyte
CL:0000150 ! glandular epithelial cell
CL:0000178 ! Leydig cell
CL:0000187 ! muscle cell
CL:0000188 ! skeletal muscle cell
CL:0000192 ! smooth muscle cell
CL:0000201 ! auditory receptor cell
CL:0000202 ! auditory hair cell
CL:0000210 ! photoreceptor cell
CL:0000216 ! Sertoli cell
CL:0000218 ! Schwann cell
CL:0000221 ! ectodermal cell
CL:0000222 ! mesodermal cell
CL:0000223 ! endodermal cell
CL:0000228 ! multinucleate cell
CL:0000232 ! erythrocyte
CL:0000233 ! platelet
CL:0000235 ! macrophage
CL:0000236 ! B cell
CL:0000248 ! microsporocyte
CL:0000250 ! megaspore
CL:0000252 ! microspore
CL:0000253 ! eurydendroid cell
CL:0000254 ! egg cell
CL:0000262 ! guard mother cell
CL:0000276 ! sclerenchyma cell
CL:0000280 ! generative cell
CL:0000282 ! trichome
CL:0000284 ! companion cell
CL:0000287 ! eye photoreceptor cell
CL:0000288 ! synergid
CL:0000292 ! guard cell
CL:0000294 ! sieve cell
CL:0000295 ! somatotropin secreting cell
CL:0000296 ! vegetative cell
CL:0000299 ! trichoblast
CL:0000300 ! gamete
CL:0000301 ! pole cell
CL:0000312 ! keratinocyte
CL:0000332 ! atrichoblast
CL:0000333 ! neural crest cell
CL:0000362 ! epidermal cell
CL:0000365 ! zygote
CL:0000373 ! histoblast
CL:0000392 ! crystal cell
CL:0000394 ! plasmatocyte
CL:0000396 ! lamellocyte
CL:0000408 ! male gamete
CL:0000430 ! xanthophore
CL:0000431 ! iridophore
CL:0000439 ! prolactin secreting cell
CL:0000442 ! follicular dendritic cell
CL:0000448 ! white fat cell
CL:0000449 ! brown fat cell
CL:0000451 ! dendritic cell
CL:0000453 ! Langerhans cell
CL:0000467 ! adrenocorticotropic hormone secreting cell
CL:0000469 ! ganglion mother cell
CL:0000474 ! pericardial cell
CL:0000476 ! thyroid stimulating hormone secreting cell
CL:0000477 ! follicle cell
CL:0000486 ! garland cell
CL:0000487 ! oenocyte
CL:0000492 ! T-helper cell
CL:0000501 ! granulosa cell
CL:0000522 ! spore
CL:0000537 ! antipodal cell
CL:0000540 ! neuron
CL:0000542 ! lymphocyte
CL:0000545 ! T-helper 1 cell
CL:0000546 ! T-helper 2 cell
CL:0000556 ! megakaryocyte
CL:0000562 ! nucleate erythrocyte
CL:0000563 ! endospore
CL:0000571 ! leucophore
CL:0000573 ! retinal cone cell
CL:0000574 ! erythrophore
CL:0000576 ! monocyte
CL:0000579 ! border follicle cell
CL:0000586 ! germ cell
CL:0000595 ! enucleate erythrocyte
CL:0000598 ! pyramidal cell
CL:0000599 ! conidium
CL:0000604 ! retinal rod cell
CL:0000607 ! ascospore
CL:0000608 ! zygospore
CL:0000609 ! vestibular hair cell
CL:0000615 ! basidiospore
CL:0000616 ! sporangiospore
CL:0000623 ! natural killer cell
CL:0000624 ! CD4-positive, alpha-beta T cell
CL:0000625 ! CD8-positive, alpha-beta T cell
CL:0000644 ! Bergmann glial cell
CL:0000656 ! primary spermatocyte
CL:0000668 ! parenchymal cell
CL:0000674 ! interfollicle cell
CL:0000675 ! female gamete
CL:0000681 ! radial glial cell
CL:0000695 ! Cajal-Retzius cell
CL:0000711 ! cumulus cell
CL:0000716 ! lymph gland crystal cell
CL:0000722 ! cystoblast
CL:0000723 ! somatic stem cell
CL:0000724 ! heterocyst
CL:0000726 ! chlamydospore
CL:0000730 ! leading edge cell
CL:0000731 ! urothelial cell
CL:0000732 ! amoeboid cell
CL:0000733 ! lymph gland plasmatocyte
CL:0000735 ! lymph gland hemocyte
CL:0000737 ! striated muscle cell
CL:0000738 ! leukocyte
CL:0000740 ! retinal ganglion cell
CL:0000746 ! cardiac muscle cell
CL:0000747 ! cyanophore
CL:0000748 ! retinal bipolar neuron
CL:0000762 ! thrombocyte
CL:0000763 ! myeloid cell
CL:0000766 ! myeloid leukocyte
CL:0000767 ! basophil
CL:0000771 ! eosinophil
CL:0000775 ! neutrophil
CL:0000782 ! myeloid dendritic cell
CL:0000784 ! plasmacytoid dendritic cell
CL:0000785 ! mature B cell
CL:0000786 ! plasma cell
CL:0000787 ! memory B cell
CL:0000789 ! alpha-beta T cell
CL:0000792 ! CD4-positive, CD25-positive, alpha-beta regulatory T cell
CL:0000793 ! CD4-positive, alpha-beta intraepithelial T cell
CL:0000794 ! CD8-positive, alpha-beta cytotoxic T cell
CL:0000795 ! CD8-positive, alpha-beta regulatory T cell
CL:0000796 ! CD8 positive, alpha-beta intraepithelial T cell
CL:0000797 ! alpha-beta intraepithelial T cell
CL:0000798 ! gamma-delta T cell
CL:0000801 ! gamma-delta intraepithelial T cell
CL:0000802 ! CD8-positive, gamma-delta intraepithelial T cell
CL:0000803 ! CD4-positive, gamma-delta intraepithelial T cell
CL:0000804 ! immature T cell
CL:0000813 ! memory T cell
CL:0000814 ! NK T cell
CL:0000815 ! regulatory T cell
CL:0000816 ! immature B cell
CL:0000817 ! pre-B cell
CL:0000818 ! transitional stage B cell
CL:0000819 ! B-1 B cell
CL:0000820 ! B-1a B cell
CL:0000821 ! B-1b B cell
CL:0000825 ! natural killer cell progenitor
CL:0000826 ! pro-B cell
CL:0000827 ! pro-T cell
CL:0000837 ! hematopoietic progenitor cell
CL:0000838 ! lymphoid progenitor cell
CL:0000839 ! myeloid progenitor cell
CL:0000842 ! mononuclear cell
CL:0000843 ! follicular B cell
CL:0000844 ! germinal center B cell
CL:0000845 ! marginal zone B cell
CL:0000851 ! neuromast mantle cell
CL:0000852 ! neuromast support cell
CL:0000855 ! hair cell
CL:0000856 ! neuromast hair cell
CL:1000274 ! trophectodermal cell

Taxon-GO file format

This is the proposed file format for the taxon-go links:

GO term GO:id relationship taxon name taxon id
photosynthesis GO:0015979 never_in_taxon Mammalia Taxonomy ID: 40674
male germ-line cyst formation GO:0048136 never_in_taxon Mammalia Taxonomy ID: 40674
hemocyte differentiation GO:0042386 never_outside_taxon Arthropoda Taxonomy ID: 6656
multicellular organismal process GO:0032501 never_outside_taxon Eukaryota Taxonomy ID: 2759
nucleus GO:0005634 never_outside_taxon Eukaryota Taxonomy ID: 2759
gametophyte development GO:0048229 never_in_taxon Dictyostelium Taxonomy ID: 5782
viral reproduction GO:0016032 never_outside_taxon Viruses Taxonomy ID: 10239
compund eye development GO:0048749 never_in_taxon Mammalia Taxonomy ID: 40674
lactation GO:0007595 never_outside_taxon Mammalia Taxonomy ID: 40674
fat body development GO:0007503 never_in_taxon Mammalia Taxonomy ID: 40674

I am not yet sure how to save a file like this from OBO-Edit after having added links. I will have a go at that.

1st May 2008

I have arranged a meeting with Susan Tweedie and Rebecca Foulger to start labeling the cell type and GO terms by taxon.

6th May 2008

I have cleared away all the old taxon-go-related files from the scratch directory and made a new folder in there called go-taxon. This folder contains a copy of the taxon slim.

I have still not worked out how to save the tab-delimited file of relationships out of OBO-Edit and this is the major obstacle to starting work just now.

Chris has pointed out that I don't need to be able to propagate the links down the graph and have those links actually instantiated as the it is easy to infer them. For working in OBO-Edit I just need to set a render that will show if a term has a taxon link already applied to one of its ancestors.

I have made a tab-delimited file to contain the links between the ontology file and the taxon slim and it is in the go/scratch/go-taxon/ directory

6th May PM

Further progress as described in a mail to Chris:

I made the file of go-taxon relationships in obo and tab-delimited format to test both.

When I load with the tab-delimited version in OBO-Edit the relationships between the go 
terms and the taxon terms don't show up at all and I'm not sure what to 
do to persuade them. I suppose this format is just not one that OBO-Edit is prepared for.

When I load with the obo format go-taxon file the taxon links show in the graph viewer,
 and the normal links show in the graphviz component. However, when I 
click on a go term with a taxon link, the graphviz component goes on strike and does not
 update at all. No idea of why it is being picky about that.
The other weird thing is that the application treats the go-taxon file as a separate 
ontology and it does not seem to realise that this is a relationship between  
the two other loaded ontologies. I will attach a picture so you can see.

With either format I'm not sure that OBO-Edit knows how to save the three files out separately.

10th May

The obo version of the relationship file is now working. There was a formatting problem with the taxon ids.

15th May

I have now written to Chris to ask how I should represent links where the go term should not be used outside of the combination of two taxa. For example photosynthesis, which should not be used outside of the combination of bacterial and viridiplantae taxa.

I am not currently able to save out the taxon links. The major barrier is that the save ontology panel in oboedit is too big to open out fully in my laptop. I have submitted a bug report but the code looks quite complicated in that part. The setup to let different panels appear or disappear when boxes are checked is quite hard to fix.

The graphviz plugin would also be much easier to use for this if the disjoint relationships were not shown, and I have gone some way to figuring out how to do that. The text file listing the relationships in the graph is set up with a small piece of code in the graphviz component, but I do not yet understand the relationship management methods enough to be able to configure which relationships are included.

16th May

I have figured out that it is not possible to save the go-taxon link file out of oboedit. I will need to save the whole ontology file out and write a perl script to extract the taxon links in OBO format, and then another to convert this to tab-delimited format.

Chris and I are discussing how best to deal with situations where an only_in_taxon link should be made to the conjunction of two taxonomic groups.

30th May

Perl script in progress.

5th June

Perl scripts completed. Checked into cvs.


18th June

Skype call
Participants: Chris Mungall, Jennifer Deegan.

We discussed how to deal with situations where there are processes that are only seen in a specific group of organisms, but where that group of organisms can only be expressed as the union two taxonomic groups. Chris has suggested that if at all possible processes should only be connected to one taxon term, and that in many cases it will be possible to do that by looking at a child go term rather than the one that immediately suggests itself for example:

[i]eye development (this term is hard to connect to taxon, and would require multiple taxa.)
---[i]compound eye development (this term is able to be connected to a single taxon.)

However there are some situations where more than one taxon is needed. For example chloroplast-type photosynthesis happens in organisms with chloroplasts and in cyanobacteria. In order to express this Chris suggests that we should use a union term.

id: Jens_ID_1234
name: viridiplante or bacteria
union_of: NCBITaxon:2 ; Bacteria
union_of: NCBITaxon:33090 ; Viridiplantae

We didn't manage to work out how to implement this during the meeting but I (Jen) had some thoughts afterwards.

It will be best if we can make all of the links in OBO-Edit so I think that I will add the union terms under a grouping term but not in the taxon slim. (I think I cannot really edit the taxon slim.)

[i]Taxon slim
[i]union terms
---[i]viridiplante or bacteria

but also linked to the taxon slim.

---[u]viridiplante or bacteria

(where u represents the union_of relationship.)

I think it will then be necessary to have these union terms saved out at the bottom of the go-taxon relationship file, as we cannot really make additions to the taxon slim. Once we have made these union terms in OBO-Edit it will be easy enough to make the connections between the GO terms and the union terms.

26th June

Workflow for editing file.

Load these files:


Make the connections between terms and then save out the entire dataset as


Then run the scripts:


These will produce the two files


These can be checked into scratch for the next editing session.

New union terms should be made as children of the union terms node and connected to the terms of which they are unions via the union_of relationship.

16th July

To avoid having to use the perl scripts every time I save and load I have just saved out all the ontologies into one file and am calling it 'all_files_mid_edit.obo' and checking it into scratch/go-taxon. There are two good ID manager rules in my OBO-Edit config on windows now but the application does not seem to be able to save these out to the ontology file. I will document them for the user guide as they are quite complicated and this feature is previously undocumented.

Later: The ID Manager work is now fully documented at the bottom of the page on how to work the ID Manager component in the OBO-Edit help guide. Here is a copy of the ID preferences file, which is called idprofiles.xml.

17th July

I have been working on getting the display in OBO-Edit right so that I can start editing with other curators next week.

This is what the relationships look like the in the ontology tree editor:

Never outside.png

Below is what the go-taxon links look like in the graphviewer with the reasoner on. Note that not all paths are shown. It is not clear why this is. This is shown when I click on the term 'chloroplast-style photosynthesis' (This is just a term I made up for this work. It is not actually in the GO.)

Never outside graphview.png

Below is how the union relationships are shown in the graphviewer if I click on the union terms 'Viridiplantae and Bacteria'.


The full graph is shown in the graphviz component (below) with or without the reasoner, which is very good as the reasoner is currenty slowing things down a lot. (Note, this picture has been updated to show the correct logic. The link from viridiplantae and cyanobacteria to union_terms is not longer in the file. However, if it was there is would be shown. To see the image that shows the same graph as the other images then see the previous version in the history log.)

Full graph graphviz.png

Adding links en masse

I have started experimenting with adding links in bulk. For example I can search for all the terms that has 'sensu endopterygota' in them and give all these a link that says 'only_in' endopterygota. Then it is quick for the fly curators to look through and check that these are right. However I am wondering whether it makes more sense to only label the top term in a branch that is all to be 'only_in' endopterygota and then have the application infer that it should add the links to all the descendants of the labelled terms. If I was going to do this then it would also make sense to find out how to detect redundant links with the reasoner. For example, have the reasoner look for when a term and it's ancestor both had only_in endopterygota links and offer to remove the links from the descendent term. I'm not sure that we currently have a policy on whether to hard code all the links or leave it to the reasoner to add the inferred links in repair mode.

For now I think I will just hard code them all.

July 18th

Possible timeline:

1) Make some links and fully document pipeline.
2) Show to ontology list cc'd to Judy. (tab-delimited version + OBO editing file)
3) Show at managers' call.
4) Discuss plans for use with volunteer testers in Ensembl and InterPro. (one month from now)
5) Test until error or ommission detection rate reaches acceptably low level.
6) Publicise. (Aim for November for best case, or January at the latest.)

I have written two perl scripts that will reformat the file of go-taxon links from OBO to Tab-delimited format and back again. They are called and and are available in go/software/utilities/.

I also wrote up the entire workflow in More links have been made from the old sensu terms. Currently Endopterygota, Insecta, Arthropoda, Diptera, Drosophila (not yet checked).

July 21st

The majority of the sensu terms now have taxon links.

July 23rd

Participants: Susan Tweedie, Becky Foulger, Jennifer Deegan: at EBI.

Worked all afternoon with Becky Foulger and Susan Tweedie to check the links between the old insect sensu terms and the taxon terms. Edits were made live, and Becky made the following notes of action items.

1. Look at def of adult chitin-based cuticle development, and see if it can be made less insect-specific. Term name seems to apply to a wider base than insects (arthropods) but def contains references to insects.

2. Remove the sensu insecta text from the definition of GO:48085.

3. GO:30381 widen definition to chorions in arthropods, rather than insects.

4. GO:7487 remove sensu from the definition.

5. consider changing all descendents of terms with 'imaginal-disc derived' in the term name to contain 'imaginal disc-derived' terminology. Eg. antennal morphogenesis, compound eye morphogenesis.

6. decide where to put 'photoreceptor cell differentiation' and its children. Under Eukaryote lineage?
plants PRs are molecules rather than cells.

7. Heart terms (dorsal vessel) need looking at. do insects other than flies have a heart proper and and aorta as part of the dorsal vessel?

8. come back to oogenesis terms

19. come back to tracheal system terms.

10. double check dosage compensation terms. Are they limited just to the diptera?

11. GO:7425. Clarify in def that the 80 cells on each side of the embyro refers to Drosophila, but that the term isn't limited to Drosophila.

12. Are instar larvae in organisms other than insects? Tadpoles? crabs it seems do.
Jen has labelled instar larvae under arthropoda for now, but it might need revisiting.

Plan to meet again on August 1st in Genetics building in Cambridge. 1pm.

24th July

Participants: Jane Lomax, Jennifer Deegan

Worked through many different groups and checked the old sensu terms and their links to taxon. See diff for complete listing as this was a large and diverse piece of work.

1st August

Particpants: Becky Foulger, Susan Tweedie and Jennifer Deegan

Where: Met at the Genetics department of the University of Cambridge to continue linking the old Drosophila sensu terms.

Becky made these notes:

Tracheal system

Susan found that velvet worms (Onychophora) have open tracheals, so put as Protostomia rather than just insecta.

Problems in the graph

We need to tweak the definitions of:
regulation of tube size, open tracheal system; GO:0035151
Tracheal tubes undergo highly regulated tube-size increases during development, expanding up to 40 times their initial size by the end of larval life.

In insects, tracheal tubes undergo....(Not sure what this means: Ed)

Same for the children GO:0035158 and GO:0035159.

Old sensu synonyms

Q: what do we do about synonyms? Do we keep the sensu synonynms? Some of the exact synonyms that are sensu insecta may need to be changed to be narrower synonyms.

GO:0035277 : spiracle morphogenesis, open tracheal system
change definition to make it clearer that spiracles aren't just in insects.

chorion-containing eggshell formation ; GO:0007304
ZFIN Doug confirmed that fish have chorion-containing egg envelopes.

Confusing terms

Susan to have a look at these 2 terms:
maternal determination of dorsal/ventral axis, oocyte, germ-line encoded ; GO:0007311
maternal determination of dorsal/ventral axis, oocyte, soma encoded ; GO:0007313
What on earth do they mean ?????????


At the same time, back at the EBI, Midori worked through the Fungal and Saccharomyces terms.

6th August

Midori's feedback after looking through the Fungal and Saccharomyces terms:


I've completed a first pass through all the terms with synonyms containing 'sensu Fungi'. Most of them are done and have the GOC:mtg_taxon def dbxref.


cell aging GO:0007569 - I put cellular organisms; I'm not sure whether anything narrower would be OK (so haven't assigned GOC:mtg_taxon dbxref).
[from Jodi Hirschman: This term was intended to be for broader usage than just Fungi. I worked on the cell aging branch a few years ago with Chandra, and I think the sensu terms may have existed at that point and so were put here as synonyms..]

conjugation with cellular fusion GO:0000747 - Fungi may be correct taxon, but I'm not sure (so no GOC:mtg_taxon dbxref)

These should get same taxon as parent conjugation with cellular fusion GO:0000747

  • adaptation to pheromone during conjugation with cellular fusion GO:0000754
  • cytogamy GO:0000755
  • pheromone-dependent signal transduction during conjugation with cellular fusion GO:0000750
  • response to pheromone during conjugation with cellular fusion GO:0000749

regulation of sexual sporulation GO:0034306 - removed taxon link, but don't know what taxon it should have, other than that it should be the same one assigned to the parent 'sexual sporulation' GO:0034293

secondary cell septum GO:0051077 - Fungi is probably correct taxon, but it won't hurt to check

septin ring GO:0005940 - removed only_in_taxon Fungi because septins are also found in animals, but I don't know for sure that they're not in plants, so haven't used the dbxref

spore germination GO:009847 - relevant to anything that sporulates; not sure how to convert that to taxa, so just removed Fungi

Let me know if you have questions about my questions, or about what I did.

18th August

I have noticed that germination pore (a fungal term) is a child of pollen wall (a plant term). This needs fixed.

27th August

The managers' group have given permission for the file to be passed to Ensembl for testing.

28th August

I have passed the file to Albert Vilella and he is going to test it on Ensembl data. He points out that his major interest is to distinguish between human/rodent/fish genes and requests that we might look into adding such links.

28th August

I have had a skype call with Michelle and she has requested that I send a list of the Archaea and Bacterial terms to her so she can work offline on collecting the data. Becky and Susan are also awaiting a list for fly terms.

20th September

Spreadsheet received from Michelle Gwinn-Giglio with old prokaryote sensu terms categorized.
Spreadsheet of fly sensu terms is will Susan Tweedie and Becky Foulger.

October 15th

Daniel Barrell and David Binns of GOA are testing the utility of the trigger file for annotation checking.

October 22nd

Results of testing work presented at GO Consortium meeting. Conclusions

11th November

Discussed with Jane and Midori and we think it will be best if we keep the taxon links being processed out of the big file by perl scripts for the forseable future. We will have a lot of changes to file handling with the introduction of cross products in the next year and so it will be best not to try to incorporate taxon link file handling at the same time. Also once cross products are fully organised the systems will be in place to filter out files cleanly and so these same systems can be adapted for the taxon links too.