Instructions for providing FASTA file
As discussed in today's refg call, here is a proposal for the new fasta file. I'm open to persuasion on the naming/structure.
The 12 reference genome species will switch to providing fasta files instead of gp2protein files. Non reference-genome groups can continue to submit gp2protein files.
The database loading pipeline will be extended to allow loading of fasta files. If a fasta file is present for an organism, the gp2protein file will not be used.
As part of the database release process we current produce a fasta file of peptides for all annotated genes. This can be customized, eg, produce an additional fasta containing all peptides (annotated + unannotated genes) for refG species
fasta file naming
There will be one file per species. The file will be named
Where TAXONID is a number; eg 7227 for Dmel
fasta file location
RefG annotation providers will deposit these in GO CVS:
go/ gp2protein/ gp2protein.uniprot.gz <-- partially redundant, but that's OK gp2protein.mgi.gz <--- redundant, but allowed .. fasta/ aaseq.7227.fasta.gz aaseq.10090.fasta.gz ..
Amino acid sequence of longest peptide for the object that is annotated in the gene association file.
The fasta header must be consistent in header structure to enable matching of sequences to records in the database
>SEQID OBJID OPTIONAL
All IDs must be of the standard DB:ACCESSION form. Note that for MODIDs it must match col1 + ":" + col2 in the gene_association file. This means MGI IDs are of the form MGI:MGI:12345
If the SEQID is from UniProt then use an ID of the form UniProt:P12345. If the SEQIDs are MOD-specific, then use these.
OPTIONAL can consist of a list of alternate IDs plus a description in double-quotes.
Examples (accessions invented):
>UniProt:P12345 FlyBase:FBgn1234567 "blah blah" FlyBase:CG12345
The formal grammar is as follows:
Header ::= '>' SeqID SPC ObjID SPC OptionalList? SeqID ::= SeqDB ':' LocalID SeqDB ::= 'RefSeq' | ObjDB ObjDB ::= 'UniProt' | ModDB ModDB ::= 'FlyBase' | 'SGD' | ... OptionalList ::= Optional | Optional SPC OptionalList Optional ::= '"' AnyText '"' | AnyID AnyID ::= AnyDB ':' LocalID SPC ::= ' '