Instructions for providing FASTA file

From GO Wiki
Jump to: navigation, search

From Chris:

As discussed in today's refg call, here is a proposal for the new fasta file. I'm open to persuasion on the naming/structure.

The 12 reference genome species will switch to providing fasta files instead of gp2protein files. Non reference-genome groups can continue to submit gp2protein files.

The database loading pipeline will be extended to allow loading of fasta files. If a fasta file is present for an organism, the gp2protein file will not be used.

As part of the database release process we current produce a fasta file of peptides for all annotated genes. This can be customized, eg, produce an additional fasta containing all peptides (annotated + unannotated genes) for refG species

fasta file naming

There will be one file per species. The file will be named

   aaseq.TAXONID.fasta

Where TAXONID is a number; eg 7227 for Dmel

fasta file location

RefG annotation providers will deposit these in GO CVS:

   go/
       gp2protein/
           gp2protein.uniprot.gz        <-- partially redundant, but that's OK
           gp2protein.mgi.gz            <--- redundant, but allowed
           ..
       fasta/
           aaseq.7227.fasta.gz
           aaseq.10090.fasta.gz
           ..

sequence content

Amino acid sequence of longest peptide for the object that is annotated in the gene association file.

header structure

The fasta header must be consistent in header structure to enable matching of sequences to records in the database

   >SEQID OBJID OPTIONAL

All IDs must be of the standard DB:ACCESSION form. Note that for MODIDs it must match col1 + ":" + col2 in the gene_association file. This means MGI IDs are of the form MGI:MGI:12345

If the SEQID is from UniProt then use an ID of the form UniProt:P12345. If the SEQIDs are MOD-specific, then use these.

OPTIONAL can consist of a list of alternate IDs plus a description in double-quotes.

Examples (accessions invented):

 >UniProt:P12345 FlyBase:FBgn1234567  "blah blah" FlyBase:CG12345
 >ZFIN:ZDB-SEQ:12345 ZFIN:ZFB-GENE-12345

The formal grammar is as follows:

 Header ::= '>' SeqID SPC ObjID SPC OptionalList?
 SeqID ::= SeqDB ':' LocalID
 SeqDB ::=  'RefSeq' | ObjDB
 ObjDB ::= 'UniProt' | ModDB
 ModDB ::= 'FlyBase' | 'SGD' | ...
 OptionalList ::= Optional | Optional SPC OptionalList
 Optional ::= '"' AnyText '"' | AnyID
 AnyID ::= AnyDB ':' LocalID
 SPC ::= ' '

Back to Reference Genome Annotation Project Main Page