Running P-POD orthology tool on the reference genomes gene set (Retired): Difference between revisions

Revision as of 11:46, 18 January 2008

Input sequences

The current plan is to start with the gp2protein files, which we at Princeton will use to retrieve the actual protein sequences and generate fasta files. We will use the following files from the GO site:

   * Arabidopsis thaliana: gp2protein.tair.gz
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein file from RGD (pending)
   * Escherichia coli:  Uniprot file
   * Gallus gallus:  Uniprot file

Questions still pending:

Will all of these files have both Uniprot and NCBI identifiers available? This would be great because it would allow us to provide useful links to both resources from the results.

Related to the above: should we retrieve the sequences from Uniprot or NCBI?

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described [1]here.

@@ Line 1: / Line 1: @@
+'''Input sequences'''
 The current plan is to start with the gp2protein files, which we at Princeton will use to retrieve the actual protein sequences and generate fasta files.  We will use the following files from the GO site:
@@ Line 19: / Line 21: @@
 Related to the above:  should we retrieve the sequences from Uniprot or NCBI?
+'''Analysis pipeline'''
+The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described [http://ortholog.princeton.edu]here.

Running P-POD orthology tool on the reference genomes gene set (Retired): Difference between revisions

Revision as of 11:46, 18 January 2008

Navigation menu