Running P-POD orthology tool on the reference genomes gene set (Retired): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 4: Line 4:


     * Arabidopsis thaliana: gp2protein.tair.gz
     * Arabidopsis thaliana: gp2protein.tair.gz
    RefSeq (NCBI) identifiers / NCBI need to be used because coverage for Uniprot mappings is not 100%.
     * Caenorhabditis elegans:  gp2protein.wb.gz
     * Caenorhabditis elegans:  gp2protein.wb.gz
     * Danio rerio: gp2protein.zfin.gz
     * Danio rerio: gp2protein.zfin.gz
     * Dictyostelium discoideum: gp2protein.dictyBase.gz
     * Dictyostelium discoideum: gp2protein.dictyBase.gz
     * Drosophila melanogaster: gp2protein.fb.gz
     * Drosophila melanogaster: gp2protein.fb.gz
     * Homo sapiens: gp2protein.human.gz
     * Homo sapiens: gp2protein.human.gz
     * Mus musculus: gp2protein.mgi.gz
     * Mus musculus: gp2protein.mgi.gz
     * Saccharomyces cerevisiae: gp2protein.sgd.gz
     * Saccharomyces cerevisiae: gp2protein.sgd.gz
     * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
     * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
     * Rattus norvegicus:  gp2protein file from RGD (pending)
     * Rattus norvegicus:  gp2protein file from RGD (pending)
     * Escherichia coli:  Uniprot file
     * Escherichia coli:  Uniprot file
     * Gallus gallus:  Uniprot file
     * Gallus gallus:  Uniprot file


Line 20: Line 32:
Will all of these files have both Uniprot and NCBI identifiers available?  This would be great because it would allow us to provide useful links to both resources from the results.
Will all of these files have both Uniprot and NCBI identifiers available?  This would be great because it would allow us to provide useful links to both resources from the results.


Related to the above:  should we retrieve the sequences from Uniprot or NCBI?
Related to the above:  should we retrieve the sequences from Uniprot or NCBI? Notes on identifiers available for each species are above.


We will download the sequences at the end of January based on whatever files are available and will begin the run then.
We will download the sequences at the end of January based on whatever files are available and will begin the run then.

Revision as of 12:05, 18 January 2008

Input sequences

The current plan is to start with the gp2protein files, which we at Princeton will use to retrieve the actual protein sequences and generate fasta files. We will use the following files from the GO site:

   * Arabidopsis thaliana: gp2protein.tair.gz
    RefSeq (NCBI) identifiers / NCBI need to be used because coverage for Uniprot mappings is not 100%.
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein file from RGD (pending)
   * Escherichia coli:  Uniprot file
   * Gallus gallus:  Uniprot file

Questions still pending:

Will all of these files have both Uniprot and NCBI identifiers available? This would be great because it would allow us to provide useful links to both resources from the results.

Related to the above: should we retrieve the sequences from Uniprot or NCBI? Notes on identifiers available for each species are above.

We will download the sequences at the end of January based on whatever files are available and will begin the run then.

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[1]. We will also make the BLAST results available separately.

Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.