Gp2protein

From GO Wiki
Revision as of 15:25, 6 March 2020 by Pascale (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Redirect page
Jump to: navigation, search

Redirect to:

For the RefG project it is critical we can build complete protein sequence sets for homology and phylogenetic analysis. In theory there should be a deterministic mapping from the gene in col2 of the GAF to the gp2protein mappings to a collection of FASTA files. In practice this is difficult due to some outstanding issues.

The gp2protein files are located in cvs in the go/gp2protein directory. See the README for more details.

The Reference Genome Annotation Project species are obligated to providing complete protein sets.

Report, Mar 31, 2009

chicken

  • num_entities_annotated: 16333
  • num_genes_with_seqs: 15010
  • Seq_type: RefSeq 9103 60 %
  • Seq_type: UniProtKB 5907 39 %
  • Flag: db_mismatch_in_assocfile: 33169
  • Flag: noseq: 14
  • Flag: db_mismatch_in_gp2protein: 9103


dictybase

  • num_entities_annotated: 7289
  • num_genes_with_seqs: 12295
  • Seq_type: UniProt 11657 94 %
  • Seq_type: NCBI_GP 12295 100 %
  • Flag: noseq: 191


ecocyc

  • num_entities_annotated: 3703
  • num_genes_with_seqs: 4150
  • Seq_type: NULL 1 0 %
  • Seq_type: UniProtKB 4149 99 %
  • Flag: noseq: 20


fb

  • num_entities_annotated: 12524
  • num_genes_with_seqs: 13992
  • Seq_type: protein_id 540 3 %
  • Seq_type: UniprotKB 13451 96 %
  • Seq_type: UniProtKB 4749 33 %
  • Flag: noseq: 293


genedb_spombe

  • num_entities_annotated: 5266
  • num_genes_with_seqs: 4990
  • Seq_type: UniProtKB 4990 100 %
  • Flag: noseq: 44


human

  • num_entities_annotated: 19804
  • num_genes_with_seqs: 40387
  • Seq_type: RefSeq 3361 8 %
  • Seq_type: UniProtKB 37026 91 %
  • Flag: noseq: 476
  • Flag: db_mismatch_in_gp2protein: 3361


mgi

  • num_entities_annotated: 18198
  • num_genes_with_seqs: 29215
  • Seq_type: NCBI 7922 27 %
  • Seq_type: NULL 2980 10 %
  • Seq_type: EMBL 131 0 %
  • Seq_type: UniProtKB 18182 62 %
  • Flag: noseq: 124


rgd

  • num_entities_annotated: 19965
  • num_genes_with_seqs: 27788
  • Seq_type: NCBI_GP 17266 62 %
  • Seq_type: UniProtKB 10522 37 %
  • Flag: db_mismatch_in_assocfile: 13831
  • Flag: noseq: 273


sgd

  • num_entities_annotated: 6348
  • num_genes_with_seqs: 5881
  • Seq_type: NCBI_NP 7 0 %
  • Seq_type: UniProtKB 5874 99 %
  • Flag: noseq: 11


tair

  • num_entities_annotated: 43442
  • num_genes_with_seqs: 59975
  • Seq_type: TAIR 59615 99 %
  • Seq_type: NULL 424 0 %
  • Flag: db_mismatch_in_assocfile: 2496
  • Flag: noseq: 358


wb

  • num_entities_annotated: 17894
  • num_genes_with_seqs: 29790
  • Seq_type: NULL 9648 32 %
  • Seq_type: UniProtKB 20142 67 %
  • Flag: db_mismatch_in_assocfile: 5482

gp2protein & SwissProt : Apr 1 2009

  • Suzi, Paul met with SwissProt group last week
  • SwissProt/UniProt is interested in making sets available
    • Essentially complete genomes already available in SwissProt:
      • E. coli, yeast, pombe, Arabidopsis, fly, worm, ~human
      • Dicty in progress
      • Mouse (currently ~16K genes covered)
    • Some genomes need inclusion of gene models that currently have no observed mRNA
      • UniProt is working on a pipeline to include Ensembl predicted proteins
      • Mouse, rat, chicken, zebrafish

May 14, 2008: Outstanding Issues

  • gp2protein files from GOC: downloaded 05/14/08
  • Ensemble sequences are from release 49
  • Uniprot: release 13
  • Entrez gene: mapping file downloaded on 05/14/08

The corrected versions (both protein fasta files and gp2protein files) are available from the Panther DB FTP site.

These were the additional steps that were required to build the latest sets:

Human

For human, at the time we generated the file Swissprot was not complete, so we discussed with Emily that we'd use Ensembl to get the list of all human genes, and use the Ensmart mappings to UniProt to convert to a UniProt identifier whenever possible. For future builds, now that SwissProt is complete we can use the GOA gp2protein file. -- Paul

Chicken

For chicken, UniProt is very incomplete, so we proposed to Fiona that we could also use Ensembl to get the complete gene list, and Ensmart to map Ensembl proteins to UniProt. However, Fiona preferred Entrez Gene. Entrez Gene maps genes to RefSeq proteins, so we used the mapping file at NCBI to convert these to UniProt identifiers whenever possible. -- Paul

Zebrafish

or zebrafish, the gp2protein file only covered about half of the zebrafish gene identifiers. We worked with Doug on how to create a new gp2protein file, which was a multi-step process. The list of genes is from Ensembl, and we used the mapping file at ZFIN to map to ZFIN gene identifiers whenever possible. If the gene was mapped to a ZFIN identifier, we took the UniProt identifier from the ZFIN gp2protein file. Otherwise, we used Ensmart to map to UniProt whenever possible. -- Paul

Arabidopsis

For Arabidopsis, the gp2protein file listed transcripts, not genes, as the primary identifier. So we talked to Tanya and she sent us a new file that chooses, for each gene, only one entry for each gene (and longest corresponding protein). -- Paul

Xrefs

The first column of the file should reference a MOD or annotation-contributing database (e.g. FlyBase, UniProtKB). This should be a global ID conforming to GO standards. See the Identifiers page for more details.

The second column should be a global ID referencing a sequence-providing database (eg UniProtKB). UniProtKB is the preferred source.

MGI IDs update

MGI IDs have been fixed. They are now MGI:MGI:nnnn rather than MGI:nnnn (see Identifiers)

Xrefs status 2008-04-17

We are using a mixed bag of xrefs for proteins:

 gp2protein.PAMGO_Atumefaciens.gz
 NCBI_NP
 
 gp2protein.cgd.gz
 UniProtKB
 
 gp2protein.chicken.gz
 RefSeq
 UniProtKB
 
 gp2protein.dictyBase.gz
 NCBI_GP
 UniProt
 
 gp2protein.fb.gz
 UniProtKB
 UniprotKB
 protein_id
 
 gp2protein.genedb_spombe.gz
 UniProtKB
 
 gp2protein.geneid.gz
 UniProt
 
 gp2protein.gramene.gz
 
 UniProtKB
 
 gp2protein.human.gz
 RefSeq
 UniProtKB
 
 gp2protein.mgi.gz
 NCBI
 SWP
 TR
 
 gp2protein.refseq.gz
 UniProt
 
 gp2protein.rgd.gz
 NCBI_GP
 UniProtKB
 
 gp2protein.sgd.gz
 NCBI_NP
 
 gp2protein.tair.gz
 NCBI_NP
 
 gp2protein.tigr_Aphagocytophilum.gz
 UniProtKB
 
 gp2protein.tigr_Banthracis.gz
 UniProtKB
 
 gp2protein.tigr_Cburnetii.gz
 UniProtKB
 
 gp2protein.tigr_Chydrogenoformans.gz
 UniProtKB
 
 gp2protein.tigr_Cjejuni.gz
 UniProtKB
 
 gp2protein.tigr_Cpsychrerythraea.gz
 UniProtKB
 
 gp2protein.tigr_Dethenogenes.gz
 UniProtKB
 
 gp2protein.tigr_Echaffeensis.gz
 UniProtKB
 
 gp2protein.tigr_Gsulfurreducens.gz
 UniProtKB
 
 gp2protein.tigr_Hneptunium.gz
 UniProtKB
 
 gp2protein.tigr_Lmonocytogenes.gz
 UniProtKB
 
 gp2protein.tigr_Mcapsulatus.gz
 UniProtKB
 
 gp2protein.tigr_Nsennetsu.gz
 UniProtKB
 
 gp2protein.tigr_Pfluorescens.gz
 UniProtKB
 
 gp2protein.tigr_Psyringae.gz
 UniProtKB
 
 gp2protein.tigr_Psyringae_phaseolicola.gz
 UniProtKB
 
 gp2protein.tigr_Soneidensis.gz
 UniProtKB
 
 gp2protein.tigr_Spomeroyi.gz
 UniProtKB
 
 gp2protein.tigr_Vcholerae.gz
 UniProtKB
 
 gp2protein.unigene.gz
 UniProtKB
 
 gp2protein.uniprot.gz
 UniProtKB
 
 gp2protein.wb.gz
 
 UniProtKB
 WB
 
 gp2protein.zfin.gz
 UniProt