Running P-POD orthology tool on the reference genomes gene set (Retired)

From GO Wiki
Revision as of 12:07, 8 February 2008 by Maria (talk | contribs)
Jump to navigation Jump to search

Input sequences

The current plan is to start with the gp2protein files, which will be used to generate fasta files. We will use the following files from the GO site; we will download the sequences at the end of January based on whatever files are available and will begin the run then.

   * Arabidopsis thaliana: gp2protein.tair.gz
    RefSeq (NCBI) identifiers / NCBI need to be used because coverage for Uniprot mappings is not 100%.
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein.ncbi.rgd.gz (emailed to Kara on Feb. 4)
   * Escherichia coli:  Uniprot file
   * Gallus gallus:  gp2protein.chicken.gz

Notes:

- Uniprot and/or NCBI might be the source of the identifiers. Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate. In the future, it would be great if all the data providers could provide both as a service to users. For our purposes, it would be useful to provide links to both resources from the web interfaces.

- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files. This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful. Ben is currently looking at the code to see what modifications might be necessary. For the first run, we will just parse the fasta files produced here:

/ftp/pub/godatabase/archive/lite/2008-02-03/go_20080203-seqdblite.fasta.gz

Note: this file was incomplete.

Current plan: Kara will use the gp2protein files above and re-write code to retrieve from the sequence databases as appropriate.

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[1]. We will also make the BLAST results available separately. We can also do Jaccard Clustering to generate larger families of related sequences, if that is preferable to ortholog identification and/or is useful to have in conjunction with the ortholog families.

Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.