RefGenome11Mar08 Phone Conference (Archived)

From GO Wiki
Jump to navigation Jump to search

Tuesday March 11, 1 PM CDT, 11 AM PDT, 7 PM GMT

Present

Suzi Flybase
Michael Flybase
Susan Flybase
Chris NCBO
Kara Princeton
Rex dictyBase
Pascale dictyBase
Petra dictyBase
Siddhartha dictyBase
Fiona AgBase
Victoria RGD
Tanya TAIR
Dong Hui TAIR
Doug ZFIN
Emily EBI
Judy MGI
Mary MGI
David MGI
Stacia SGD
Kimberly WormBase
Ranjana WormBase

Orthology determination

  • Kara: update:

We launched the all-vs.-all BLAST on Feb. 18. I generated fasta files based on the gp2protein files that everyone provided. I saved everything and put it on an ftp site here:

ftp://gen-ftp.princeton.edu/ppod/go_ref_genome/

with 3 subdirectories:

(1) gp2protein: contains the gp2protein files used to generate the protein fasta files for the analysis

(2) error: contains IDs from the gp2protein files that were unable to be retrieved from NCBI or UniProt

  • Issue: may proteins were identified by their secondary IDs (especially human!)

[ACTION ITEM]: Chris/Emily: figure out secondary IDs. Maybe a script can be generated to map IDs?

Chris: The GO DB sequence load handles references to secondary UniProtKB IDs in the gp2protein file, which means that the GO DB fasta exports should be complete in this respect.

If useful, it should be possible for us to produce correct gp2protein files as part of the report process during GO DB loads.

In addition, Dan provides a mapping file: ftp://ftp.ebi.ac.uk/pub/contrib/dbarrell/ uniprot_secondary_to_primary.dat


(3) fasta: contains the fasta files generated from the gp2protein files

  • The BLAST is just about done (a bit ahead of schedule!), and the next step is to start OrthoMCL.
  • Rough time line, depending on cluster usage:

- We'll be able to view the OrthoMCL families in simple list form (query by a gene name, get a list of orthologous genes back) in about two weeks.

- In a month, we will have phylogenetic trees available, and other handy info, as shown in our current, production version of the web interface:

http://ortholog.princeton.edu/findorthofamily.html

Example query result: http://ortholog.princeton.edu/cgi-bin/family.cgi?geneName=FPR1&organism=S_cerevisiae&pipeline=taxman

For this first run, the same basic features will be available for the Ref. Genome stuff. The plan is to send the results around to everyone and see what they think, then we'd collect feedback and suggestions and go from there.

[ACTION ITEM]: Kara, Chris, Mary: Links will be generated to MODs using the dbxref file.


Comments

  • Doug: Zfin – Uniprot IDs go out of sync with the Zfin IDs over time, some that just disappear. Not sure what the problem is, but he wanted other groups to be aware of that.

Curation tool update

  • Chris...making progress, nothing for demo-ing yet. Should have by next

meeting.

  • David, Doug and Pascale have laid out requirements in detail
  • Rex - would it be useful to share requirements more broadly...
  • Suzi - real opportunity will be at the Ref Genome Meeting
  • Pascale - rather basic..check out powerpoint files
  • Shiddartha...will send out file or url (done, email, also on the wiki:

http://wiki.geneontology.org/index.php/Image:Refgene_Database_V3.ppt

Annotation Quality Control

email follow up


Judy: 4a. Missing ortholog completeness...sometimes people seem to look at phylogenetic tree..sanity check...looking for things to anomalous... will help with prioritizing and sorting lists..

sometimes ortholog calls are a bit odd....in turned out the InParanoid ortholog was based on domain. we should be able to add and remove orthologs as necessary, but must be tracked so not reloaded on next build

curator needs to exclude from annotation set, and tag this information in a tracker Kara would prefer to have it more downstream

4b. Outliers- susan tweedie didn't see any...just paucity of information in different species

4c. ISS- in most cases, not ISS transfers being done...Big discussion here...generating breadth for the Reference Genome projects is very important. Need to require that ISS be added as needed following the completion of experimental annotation for those genes that have no experimental data.

4d. Ontology anomalies...

Requirement for report for web interface...have a quick link the genes in 'my' species that have no experimental evidence where all the genes in the ortholog set have been marked as 'complete'.

question of how we handle ISS when there is no other data...need to refine this. Need clarity and explicitness.

AT THE END OF THE DAY: Add ISS for those genes with no experimental literature. These by curation.


Doug: My thoughts on what we would want for a report of genes that should be checked for potential ISS annotation.

A link/button on the Ref. Genome Interface for "My genes needing ISS".

This link would launch a report listing all the genes in my organism where: 1. The 'vast majority' (what that is TBD..maybe a parameter curators can set before running the report?) of the genes in the homology set are marked as comprehensively curated....where "comprehensively curated' means annotated comprehensively for experimentally supported annotations.

2. My gene(s) in the homology set lacks experimental annotation in 1 or more GO aspects. (Even if my gene DOES have experimental annotation, there may still be useful alternative experimentally supported annotations in other species that could be applied to my gene by ISS...so this requirement may be dubious?)

Other requirements: A way to mark and datestamp that we have completed the transfer of experimental data by ISS as of a certain date so the gene no longer shows up in the report. ISS-complete genes may need later review as new experimental annotations are added to the genes in the homology set. This brings up the issue of how to keep all this current. As time goes on we will have an increasingly large re-annotation burden in order to keep things current across all the Ref. Genome genes. It is already impractical from a curation time point of view, which I suspect is a major contributor to the lack of ISS's being completed. Automatic transfer of ISS annotations is the only solution to this, but this is fraught with complexity as discussed at todays phone conf.


David: Doug,

This is great. We would also like to have something in place that removes ISS annotation if original annotations are removed. I was thinking after the call, that a good start to this would be to have a table similar to the ones at the very bottom of Mary's graph page where we could mark the boxes where we wanted to create annotations. Of course we'd have to figure out how to get these annotations easily into our databases. Even just a report we could print out with the pertinent data would be a start. It seems to me that the big picture that Mary's stuff gives is the best way to make a reasonable judgment about whether we would want to make an ISS.


Action items

[ACTION ITEM]: DONE All: please check and comment new version of the graphs http://www.geneontology.org/images/RefGenomeGraphs/

[ACTION ITEM]: IN PROGRESS. All: Annotation Quality control: Please pick an ortholog set from the Curation Targets table http://spreadsheets.google.com/ccc?key=pwOksMOra5uq4vIYjPgefPw

Enter your name in Column K, and open a new item in the SF tracker http://sourceforge.net/tracker/?group_id=36855&atid=1040173

Contact Suzi if you need to be added to this tracker.

Review action items

[ACTION ITEM]: (Chris/AmiGO) Look into loading IEAs for reference genome set into AmiGO [in progress]

  • The new loading cycle will incorporate IEAs from everything except GOA/Uniprot. Human is loaded separately.

[ACTION ITEM]: (Amelia): Fix web page where the number of annotations are to give an estimated number of protein-coding genes; problems: unmapped genes; splice variants; etc. Maybe this should also be on the ref genome page. USE count from gp2protein file-- then it's all consistent.

in progress. Amelia had some questions: what should be taken as the correct number, the number of unique IDs in the first column [the db that produced the file], or the number in the second column [the UniProt or NCBI ID]? I just checked with Dan and he says that the mapping may not necessarily be one to one.

  • Chris/Judy: that may not be a reliable number anyway. At least for human, the proteome is not well documented.
  • best would be total number of gene predictions.
  • Judy: look at Sue Rhee's recent paper

Ongoing action items

[ACTION ITEM]: Mary: Show the date completed on the index page of the graphs

[ACTION ITEM]: Mary: Distinguish 'not yet annotated' from 'no ortholog'

[ACTION ITEM] (Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC

[ACTION ITEM]: DONE Chris: generate new report that would show errors that need fixing for the Orthology determination project

[ACTION ITEM]: DONE Chris will provide date on the ISS outliers query so that we dont always review the same annotations.


[ACTION ITEM]: Mike will set up 'annotation' calls?

[ACTION ITEM]: all: look at Stan's error reports: http://www.geneontology.org/internal-reports/gp2protein/

  • not updated since october

Next conference call

Tuesday April 8, 2008, 10 PM CDT, 8 AM PDT, 4 PM GMT

Return to Reference_Genome_Annotation_Project