Taxon IDs and subspecies
This may have been discussed previously, but I could not find guidance.
The general problem(s):
What is the relationship between an annotation and the taxon id assigned to that annotation?
- Does the use of a taxon id assert that the annotation makes a true statement about all subnodes in the taxonomy tree?
- Does the use of a taxon id assert anything about the organism/strain used for the experiment?
The E. coli example:
E. coli Illustrates the problem, but it is a general problem for bacterial genomes if not for all organisms.
NCBI taxonomy uses Taxon ID 562 for the species E. coli. However, NCBI taxonomy also has a large number of additional taxon ids for E. coli strains and families of strains. E. coli strains in this list have a large variety of differences in their genomes (e.g. see this paper). The common core in all E. coli strains is estimated to be about 2000 genes, while the "pangenome" is estimated to be about 18,000-20,000 genes.
This from an expert consulted by Emily expands on the situation:
E coli has 4 subgroups, A, B1, B2, and D, and the paper confirms these and adds an extra one, E. K12 is in the A group.
If you can identify nodes corresponding to each subgroup, I would think about annotating to these. This is the direction we are likely to move in with Ensembl Genomes - although really it's a call for the E. coli experts.
One idea is that annotations should be made at the most appropriate level, and then GOA should present the appropriate view (i.e. if something is true for E. coli, then it's also true for E. coli K12). The problem is that in attaching an observation to a taxon, an inference is really being performed i.e. there are a set of measurements on one specific sample, and logically a statement like 'ftsZ regulates cell division in E.coli' is conceptually the same as then using InterPro2GO to project this conclusion to a still wider area of the sequence/taxon space. I don't think it will often be the case that an experimental result is really shown to apply to a specific taxonomic range; rather, the authors will have made a (possibly arbitrary) decision to conclude that this phenomenon exists in this strain, this species or even in a wider grouping. So there is probably no escape from you making your own arbitrary decision about where to annotate.
Note that A, B1, B2, D, and E are not in NCBI taxonomy as of today (3/29/10).
What does an annotation say about strains?
Because of the genome plasticity, when we make an experimental annotation to a gene from a lab strain, there is a reasonably high probability that it's a gene that is not even present in many E. coli strains.
- Is this any different from an annotation not applying to a single gene knockout strain?
What does an annotation say about experiments?
NCBI does have several taxon IDs for the strains used the main E. coli models:
- Escherichia coli K-12 = taxid 83333
- Escherichia coli BW2952
- Escherichia coli LW1655F+
- Escherichia coli NC-7
- Escherichia coli str. K-12 substr. DH10B
- Escherichia coli str. K-12 substr. MG1655 = taxid 511145
- Escherichia coli str. K-12 substr. W3110
- Escherichia coli B = 37761
- Escherichia coli B str. REL606
- Escherichia coli C = taxid 498388
- Escherichia coli ATCC 8739
However, the existing literature annotations do not distinguish which strains were used for a particular experiment, and extracting that information could be difficult in many cases. In general, the strains used are likely to be descendants of these parents, but at least one commonly used lab strain is a hybrid between K-12 and C.
- JH: Use the highest species taxid available (562 for E. coli); rely on other tools to filter the validity of inferences based on presence/absence of genes
- Karp/EcoCyc: Consider the specific question: What taxid should we use when curating GO terms in EcoCyc (E. coli K-12 MG1655)? Our position is the following. EcoCyc is meant to represent information about this specific strain of E. coli. Although it is true that much information in EcoCyc was studied experimentally in other strains, our goal is to represent as accurately as possible what is known about this strain. If information gleaned from another strain turns out to be incorrect with regard to MG1655, we will correct the annotations. We propose to use 511145 as the taxid for all GO terms annotated in EcoCyc.
- Other 1: Use 83333 as a stand-in for "laboratory E. coli"
- Other 2: Get NCBI to add nodes for intermediate groupings A, B1, B2 etc. Figure out whether one of these covers all lab strains.