Running P-POD orthology tool on the reference genomes gene set

From GO Wiki
Jump to: navigation, search

Input sequences

The current plan is to start with the gp2protein files, which will be used to generate fasta files. In the future, we will use fasta files produced by the GO loading scripts after they are modified to export complete fasta files for the Ref. Genome species.

Files downloaded on Feb. 7 (SGD and TAIR) and 8 (the rest, except for rat and E. coli):

   * Arabidopsis thaliana: gp2protein.tair.gz
    
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein.ncbi.rgd.gz (emailed to Kara on Feb. 4)
   * Escherichia coli:  fasta file from Jim Hu and Anand Venkatraman
   * Gallus gallus:  gp2protein.chicken.gz

All of the input files above and the resulting fasta files can be downloaded here.

Notes:

- Uniprot and/or NCBI might be the source of the identifiers. Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate. In the future, it would be great if all the data providers could provide both as a service to users. For our purposes, it would be useful to provide links to both resources from the web interfaces.

- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files. This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful. Ben is currently looking at the code to see what modifications might be necessary. For the first run, we will just parse the fasta files produced here:

/ftp/pub/godatabase/archive/lite/2008-02-03/go_20080203-seqdblite.fasta.gz

Note: this file was incomplete.

Current plan: Kara will use the gp2protein files above and re-write code to retrieve from the sequence databases as appropriate. Sequences were retrieved from NCBI or Uniprot as appropriate. Note that this is a slow process because of rules about bulk retrieval at NCBI. John Matese sped things up by getting a local version of the Uniprot database working, so retrieving those sequences, once that was implemented, was faster.

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[1]. We will also make the BLAST results available separately. We can also do Jaccard Clustering to generate larger families of related sequences, if that is preferable to ortholog identification and/or is useful to have in conjunction with the ortholog families.

Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.