Reference proteomes files

From GO Wiki
Jump to: navigation, search

Background

  • Following the Quest for Orthologs meeting in Hinxton in July 2009, a representative group from the orthology algorithm community as well as consumers of ortholog prediction data, particularly from the GO, agreed to decide upon a set of phylogenetically representative genomes. For each of these genomes, a standard, "reference" set of all protein coding genes would be compiled for each organism; and a "canonical" protein sequence would be selected for each of these genes. Rolf Apweiler at UniProt offered that his group would create and maintain these files, which is kindly being done by Dan Barrell and Eleanor Stanley.
  • For model organisms in the Reference Genome Project, these gene sets are derived from the gp2protein files generated by each MOD

Currently available pre-alpha files: not ready for release

  • File:SpeciesList.pdf Master list of over 80 selected species to be included. Not all of these are publicly available yet.
  • documentation and progress
  • ftp directory of files
  • NOTE that for now, these files should contain one entry per gene. We will discuss separately whether to follow up with another file that includes alternative splice forms, etc.

Current status

Human, mouse, rat, chicken and zebrafish proteomes from UniProt are augmented with Ensembl proteins. 51 species are covered, including all reference Genomes species. Plasmodium falciparum will appear in the 3rd release.

Release 3 has a bug fix for duplicate entries between UniProt and Ensembl that link to the same UniProtKB/Swiss-Prot entry (TrEMBL was fine). It also sees a huge improvement for the chicken proteome. This is now the best representation of the proteome available by combining UniProt and Ensembl. A better understanding of the Ensembl pipeline identified a database input that was not included in the QfO proteome generation pipeline (IPI), with the addition of this extra database identifier the proteomes are more complete.

Full details of Release 2 are here: http://www.ebi.ac.uk/~dbarrell/qfo/

Release 3 will be ready for the end of March.

Issues and bugs

  • This wiki page should be used to enter issues with the current files that need to be addressed before release, together with an email contact for getting more information about the issue
    • For MODs, there seems to be a duplication/triplication of database source in the gene ID field, e.g.
      • in the mouse FASTA, the first gene ID field is MGI:MGI:MGI:1918932 (paul.thomas@sri.com)
      • the MGI ID is 'MGI:1918932'. When the ID space 'MGI' is added in the GO files, the combo makes 'MGI:MGI:1918932'. I don't know why the third 'MGI:' is there unless a new ID space 'MGI:' is being added to the FASTA file. (judith.blake@jax.org)
      • Some of the human records have multiple tag names (coming from the gp2protein file according to Eleanor). e.g. Human:UniProtKB:Q69Z06 (david.messina@sbc.su.se)

- fixed

    • Some OS (organism) tags are still present, e.g. OS=Mus musculus. (david.messina@sbc.su.se)

- fixed

    • These records have no sequence, just the FASTA header (david.messina@sbc.su.se):
      • UniProtKB/TrEMBL:Q3UQN7|Q3UQN7_MOUSE
      • UniProtKB/TrEMBL:B1ARN8|B1ARN8_MOUSE
      • UniProtKB/TrEMBL:B1ARN7|B1ARN7_MOUSE
      • UniProtKB/Swiss-Prot:Q89UT8|NDVA_BRAJA
      • UniProtKB/TrEMBL:B0XRM7|B0XRM7_ASPFC
      • UniProtKB/TrEMBL:Q870P5|Q870P5_NEUCR
    • In some records, the description string is unlabeled. (david.messina@sbc.su.se) e.g.
UniProtKB/Swiss-Prot:A0AUV4|SMKY_MOUSE Sperm motility kinase Y OS=Mus musculus PE:2 SV=1
when it should be
UniProtKB/Swiss-Prot:A0AUV4|SMKY_MOUSE OS:Mus musculus PE:2 SV:1 Description:Sperm motility kinase Y

- fixed

    • In the files 10090_mus_musculus.fasta, 296543_thalassiosira_pseudonana.fasta, 35128_thalassiosira_pseudonana.fasta, and 9606_homo_sapiens.fasta, there are blank FASTA headers preceding actual non-empty FASTA headers. (ruchira@berkeley.edu)
    • The same sequence appears multiple times in several of the files. 10586 UniProt accessions each appear multiple times among the all the files. I have sent a list of these to Dan and Eleanor. (ruchira@berkeley.edu)

- fixed

    • formatdb does not understand headers with the UniProt identifier included. When these come up in BLAST results, a synthetic identifier like
lcl|1169_QuestForOrthologsNoDuplicates_02
appears instead of the actual identifier. A header like
UniProtKB/TrEMBL:Q6ZTL7|Q6ZTL7_HUMAN cDNA FLJ44537 fis, clone UTERU3005049 OS=Homo sapiens PE:1 SV=1
should instead be either
tr|Q6ZTL7|Q6ZTL7_HUMAN DNA FLJ44537 fis, clone UTERU3005049 OS=Homo sapiens PE:1 SV=1
or
UniProtKB/TrEMBL:Q6ZTL7 cDNA FLJ44537 fis, clone UTERU3005049 OS=Homo sapiens PE:1 SV=1 (ruchira@berkeley.edu)

- this was a bug - UniProt IDs removed