Running P-POD orthology tool on the reference genomes gene set (Retired): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(13 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Category:PAINT Archived]]
'''Input sequences'''
'''Input sequences'''


The current plan is to start with the gp2protein files, which we at Princeton will use to retrieve the actual protein sequences and generate fasta files.  We will use the following files from the GO site:
The current plan is to start with the gp2protein files, which will be used to generate fasta files.  In the future, we will use fasta files produced by the GO loading scripts after they are modified to export complete fasta files for the Ref. Genome species.
 
Files downloaded on Feb. 7 (SGD and TAIR) and 8 (the rest, except for rat and E. coli):


     * Arabidopsis thaliana: gp2protein.tair.gz
     * Arabidopsis thaliana: gp2protein.tair.gz
   
     * Caenorhabditis elegans:  gp2protein.wb.gz
     * Caenorhabditis elegans:  gp2protein.wb.gz
     * Danio rerio: gp2protein.zfin.gz
     * Danio rerio: gp2protein.zfin.gz
     * Dictyostelium discoideum: gp2protein.dictyBase.gz
     * Dictyostelium discoideum: gp2protein.dictyBase.gz
     * Drosophila melanogaster: gp2protein.fb.gz
     * Drosophila melanogaster: gp2protein.fb.gz
     * Homo sapiens: gp2protein.human.gz
     * Homo sapiens: gp2protein.human.gz
     * Mus musculus: gp2protein.mgi.gz
     * Mus musculus: gp2protein.mgi.gz
     * Saccharomyces cerevisiae: gp2protein.sgd.gz
     * Saccharomyces cerevisiae: gp2protein.sgd.gz
     * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
     * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
    * Rattus norvegicus:  gp2protein file from RGD (pending)
    * Escherichia coli:  Uniprot file
    * Gallus gallus:  Uniprot file


Questions still pending:
    * Rattus norvegicus: gp2protein.ncbi.rgd.gz (emailed to Kara on Feb. 4)
 
    * Escherichia coli:  fasta file from Jim Hu and Anand Venkatraman
 
    * Gallus gallus:  gp2protein.chicken.gz
 
All of the input files above and the resulting fasta files can be downloaded [ftp://gen-ftp.princeton.edu/ppod/go_ref_genome/ here].
 
Notes: 
 
- Uniprot and/or NCBI might be the source of the identifiers.  Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate.  In the future, it would be great if all the data providers could provide both as a service to users.  For our purposes, it would be useful to provide links to both resources from the web interfaces.
 
- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files.  This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful.  Ben is currently looking at the code to see what modifications might be necessary.  For the first run, we will just parse the fasta files produced here:
 
/ftp/pub/godatabase/archive/lite/2008-02-03/go_20080203-seqdblite.fasta.gz


Will all of these files have both Uniprot and NCBI identifiers available?  This would be great because it would allow us to provide useful links to both resources from the results.
Note: this file was incomplete.


Related to the above:  should we retrieve the sequences from Uniprot or NCBI?
Current plan: Kara will use the gp2protein files above and re-write code to retrieve from the sequence databases as appropriate.  Sequences were retrieved from NCBI or Uniprot as appropriate.  Note that this is a slow process because of rules about bulk retrieval at NCBI.  John Matese sped things up by getting a local version of the Uniprot database working, so retrieving those sequences, once that was implemented, was faster.


'''Analysis pipeline'''
'''Analysis pipeline'''


The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[http://ortholog.princeton.edu].  We will also make the BLAST results available separately.
The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[http://ortholog.princeton.edu].  We will also make the BLAST results available separately.  We can also do Jaccard Clustering to generate larger families of related sequences, if that is preferable to ortholog identification and/or is useful to have in conjunction with the ortholog families.


Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.
Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.

Latest revision as of 11:21, 12 April 2019


Input sequences

The current plan is to start with the gp2protein files, which will be used to generate fasta files. In the future, we will use fasta files produced by the GO loading scripts after they are modified to export complete fasta files for the Ref. Genome species.

Files downloaded on Feb. 7 (SGD and TAIR) and 8 (the rest, except for rat and E. coli):

   * Arabidopsis thaliana: gp2protein.tair.gz
    
   * Caenorhabditis elegans:  gp2protein.wb.gz
   * Danio rerio: gp2protein.zfin.gz
   * Dictyostelium discoideum: gp2protein.dictyBase.gz
   * Drosophila melanogaster: gp2protein.fb.gz
   * Homo sapiens: gp2protein.human.gz
   * Mus musculus: gp2protein.mgi.gz
   * Saccharomyces cerevisiae: gp2protein.sgd.gz
   * Schizosaccharomyces pombe: gp2protein.genedb_spombe.gz
   * Rattus norvegicus:  gp2protein.ncbi.rgd.gz (emailed to Kara on Feb. 4)
   * Escherichia coli:  fasta file from Jim Hu and Anand Venkatraman
   * Gallus gallus:  gp2protein.chicken.gz

All of the input files above and the resulting fasta files can be downloaded here.

Notes:

- Uniprot and/or NCBI might be the source of the identifiers. Currently, not all databases provide complete sets of both, so we need to retrieve from both databases as appropriate. In the future, it would be great if all the data providers could provide both as a service to users. For our purposes, it would be useful to provide links to both resources from the web interfaces.

- We have consulted with Ben at SGD, and it seems as though we can leverage the scripts that load the GO database from these files to produce the needed fasta files. This would greatly reduce redundant effort, because the existing script does essentially what we need (along with a lot more) and already has lots of data checks and such that would be very useful. Ben is currently looking at the code to see what modifications might be necessary. For the first run, we will just parse the fasta files produced here:

/ftp/pub/godatabase/archive/lite/2008-02-03/go_20080203-seqdblite.fasta.gz

Note: this file was incomplete.

Current plan: Kara will use the gp2protein files above and re-write code to retrieve from the sequence databases as appropriate. Sequences were retrieved from NCBI or Uniprot as appropriate. Note that this is a slow process because of rules about bulk retrieval at NCBI. John Matese sped things up by getting a local version of the Uniprot database working, so retrieving those sequences, once that was implemented, was faster.

Analysis pipeline

The initial plan is to do all v. all BLAST, OrthoMCL, clustalW, then PHYLIP, as described here[1]. We will also make the BLAST results available separately. We can also do Jaccard Clustering to generate larger families of related sequences, if that is preferable to ortholog identification and/or is useful to have in conjunction with the ortholog families.

Once we get at least the initial run finished, we will explore alternative methods and combinatorial approaches.