Gp2protein file

From GO Wiki
Jump to: navigation, search

What is a gp2protein file?

A gp2protein file is a tab-delimited file that provides a mapping between database object IDs and protein sequence IDs.

Files contributed from annotation groups are available here

Need for gp2protein file

  • Used for downloading sequences from UniProt/NCBI. These sequences are used for AmiGO BLAST and for phylogenetic inferencing (PAINT)
  • The sequence IDs (UniProt or NCBI) can be used in AmiGO to search for annotations
  • The sequence IDs help with book keeping and tracking annotations, removing duplicates etc

This file is required from all GO annotation groups, unless they already annotate to the UniProt accession numbers identified in the UniProt Proteome files.

Contents of the File

The readme for the gp2protein file fully specifies the required format and is also very briefly described below.

  • Every group contributing an annotation file should contribute a gp2protein file, unless the annotation group directly annotates to UniProtKB accessions included in the UniProt Reference Proteome files.
  • A MOD's gp2protein file should be updated with each annotation file release.

Format

The file should contain 2 columns:

  • The first column must contain all protein-encoding gene or protein identifiers available from the contributing annotation database; even those not annotated to GO in the accompanying annotation file.
  • The second column should provide the mapping to corresponding sequence IDs. This should be to UniProtKB accessions. Ideally a single UniProtKB reviewed/Swiss-Prot accession should be mapped by a database object ID, if not then UniProtKB/TrEMBL accessions can be used. If no UniProtKB accession is available, an NCBI ID can be used (NP_ and XP_ allowed).
  • ncRNA IDs should be provided in a separate gp2rna file.
  • Entries with no sequence available, which might represent a classical mutant not yet associated with a genome sequence, should be listed in the gp_unlocalised file.

gp2protein validation

  • UniProt checks Consortium gp2protein files, supplying a monthly report of any deleted or secondary UniProt accessions included in a group's file with alternative valid UniProt accessions suggested when possible.
  • All gp2protein files located in the gp2protein directory on the GO Consortium site are checked by UniProt and a report is emailed to the contact described in the email_report tag of the gene_association.<group>.conf file, located in the GO submission directory.
  • If your group has any concerns regarding this report, please contact: goa@ebi.ac.uk

In addition, the following UniProt files list all current secondary or deleted UniProt accessions:

Secondary UniProt accessions

Deleted UniProt/Swiss-Prot accessions

Deleted UniProt/TrEMBL accessions