Reference Genome December 2010 (Archived)

From GO Wiki
Jump to navigation Jump to search

New release of Reference proteome sets

From Eleanor Stanley, Dec 15, 2010

A new set of reference proteomes is available on our ftp site:


Previous releases are archived in data related folders.

The Reference proteome EBI web page has been updated for the new data:

This data reflects UniProt release 2010_12 and Ensembl 60. From UniProt release 2011_02 (Feb 8th 2011) onwards these datasets will be made as part of the UniProt production cycle, so will be available every 4 weeks.

Alan Wilter Sousa da Silva has been working very hard to take over the project since Dan Barrell left. He has climbed a very steep learning curve and achieved huge amounts in his journey! All high priority bugs from the previous release have been fixed, leaving a few low priority issues that will be resolved in subsequent releases. If you find new issues/bugs with the data please let me know.

  • 5 new species
  • all redundancies involving UniProt accessions were removed, but for those involving only ensembl gene id there are still some to fix
  • removed duplicate gene symbols from fasta headers
  • removed OS species and strain tag value pair from fasta headers
  • corrected the order of GN and Description in fasta headers
  • fixed a recurrent issue with duplicates ID spaces
  • fixed descriptions containing colons
  • partially solved the issue with UniProt accession and/or ensembl gene ids in the middle fasta header that are missing the entry in gp2protein file


  • besides addressing what was partially solved above, we have UniProt accessions present in the gp2protein file but missing in fasta file
  • Well, that's all I can remember for now. And of course, besides the known bugs, is reasonable to expect some others not known yet. If spotting something, please send to us, your feedback is invaluable.

Alan Wilter SOUSA da SILVA, UniProt - PANDA, EBI-EMBL