Running P-POD orthology tool on the reference genomes gene set (Retired): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
'''Input sequences'''
'''Input sequences'''


The current plan is to start with the gp2protein files, which will be used to generate fasta files.  We will use the following files from the GO site:
The current plan is to start with the gp2protein files, which will be used to generate fasta files.  We will use the following files from the GO site; we will download the sequences at the end of January based on whatever files are available and will begin the run then.


     * Arabidopsis thaliana: gp2protein.tair.gz
     * Arabidopsis thaliana: gp2protein.tair.gz
Line 28: Line 28:
     * Gallus gallus:  Uniprot file
     * Gallus gallus:  Uniprot file


Notes:  Uniprot and/or NCBI might be the source of the identifiers.  Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate.  In the future, it would be great if all the data providers could provide both as a service to users.  For our purposes, it would be useful to provide links to both resources from the web interfaces.
Notes:   


- Uniprot and/or NCBI might be the source of the identifiers.  Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate.  In the future, it would be great if all the data providers could provide both as a service to users.  For our purposes, it would be useful to provide links to both resources from the web interfaces.


We will download the sequences at the end of January based on whatever files are available and will begin the run then.
- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files.  This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful.  Ben is currently looking at the code to see what modifications might be necessary.


'''Analysis pipeline'''
'''Analysis pipeline'''

Revision as of 15:20, 29 January 2008

Input sequences

The current plan is to start with the gp2protein files, which will be used to generate fasta files. We will use the following files from the GO site; we will download the sequences at the end of January based on whatever files are available and will begin the run then.

   * Arabidopsis thaliana: gp2protein.tair.gz
    RefSeq (NCBI) identifiers / NCBI need to be used because coverage for Uniprot mappings is not 100%.
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein file from RGD (pending)
   * Escherichia coli:  Uniprot file
   * Gallus gallus:  Uniprot file

Notes:

- Uniprot and/or NCBI might be the source of the identifiers. Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate. In the future, it would be great if all the data providers could provide both as a service to users. For our purposes, it would be useful to provide links to both resources from the web interfaces.

- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files. This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful. Ben is currently looking at the code to see what modifications might be necessary.

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[1]. We will also make the BLAST results available separately.

Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.