RefGenome11Mar08 Phone Conference (Archived): Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 85: Line 85:


==Annotation Quality Control==
==Annotation Quality Control==
4a.  Missing ortholog completeness...sometimes
people seem to look at phylogenetic tree..sanity check...looking for
things to anomalous...
will help with prioritizing and sorting lists..
sometimes ortholog calls are a bit odd....in turned out the InParanoid
ortholog was based on domain.
we should be able to add and remove orthologs as necessary, but must be
tracked so not reloaded on next build
curator needs to exclude from annotation set, and tag this information
in a tracker
Kara would prefer to have it more downstream
4b.  Outliers- susan tweedie didn't see any...just paucity of
information in different species
4c.  ISS-  in most cases, not ISS transfers being done...Big discussion
here...generating breadth for the Reference Genome projects is very
important.  Need to require that ISS be added as needed following the
completion of experimental annotation for those genes that have no
experimental data.
4d.  Ontology anomalies...
*********Requirement* for report for web interface...have a quick link
the genes in 'my' species that have no experimental evidence where all
the genes in the ortholog set have been marked as 'complete'.
question of how we handle ISS when there is no other data...need to
refine this.  Need clarity and explicitness.
AT THE END OF THE DAY:  Add ISS for those genes with no experimental
literature.  These by curation.


==Action items==
==Action items==

Revision as of 06:28, 12 March 2008

Tuesday March 11, 1 PM CDT, 11 AM PDT, 7 PM GMT

Present

Suzi Flybase
Michael Flybase
Susan Flybase
Chris NCBO
Kara Princeton
Rex dictyBase
Pascale dictyBase
Petra dictyBase
Siddhartha dictyBase
Fiona AgBase
Victoria RGD
Tanya TAIR
Dong Hui TAIR
Doug ZFIN
Emily EBI
Judy MGI
Mary MGI
David MGI
Stacia SGD
Kimberly WormBase
Ranjana WormBase

Orthology determination

  • Kara: update:

We launched the all-vs.-all BLAST on Feb. 18. I generated fasta files based on the gp2protein files that everyone provided. I saved everything and put it on an ftp site here:

ftp://gen-ftp.princeton.edu/ppod/go_ref_genome/

with the subdirectories:

gp2protein: contains the gp2protein files used to generate the protein fasta files for the analysis

error: contains IDs from the gp2protein files that were unable to be retrieved from NCBI or UniProt

  • Issue: may proteins were identified by their secondary IDs (especially human!)

[ACTION ITEM]: Chris/Emily: figure out secondary IDs. Maybe a script can be generated to map IDs?

Chris: The GO DB sequence load handles references to secondary UniProtKB IDs in the gp2protein file, which means that the GO DB fasta exports should be complete in this respect.

If useful, it should be possible for us to produce correct gp2protein files as part of the report process during GO DB loads.

In addition, Dan provides a mapping file: ftp://ftp.ebi.ac.uk/pub/contrib/dbarrell/ uniprot_secondary_to_primary.dat


fasta: contains the fasta files generated from the gp2protein files

The BLAST is just about done (a bit ahead of schedule!), and the next step is to start OrthoMCL. Rough time line, depending on cluster usage: We'll be able to view the OrthoMCL families in simple list form (query by a gene name, get a list of orthologous genes back) in about two weeks. In a month, we will have phylogenetic trees available, and other handy info, as shown in our current, production version of the web interface:

http://ortholog.princeton.edu/findorthofamily.html

Example query result: http://ortholog.princeton.edu/cgi-bin/family.cgi?geneName=FPR1&organism=S_cerevisiae&pipeline=taxman

For this first run, the same basic features will be available for the Ref. Genome stuff. The plan is to send the results around to everyone and see what they think, then we'd collect feedback and suggestions and go from there.

What we need from you:

We'd like to link to each MOD (rather than ENSEMBL, which we do in several cases in the current P-POD) from all the protein IDs for curators' convenience.

Links will get generated to MODs. I'm assuming that we should use the IDs in the first column of the gp2protein files, but if that is not the case, let me know.

Curation tool update

  • Chris...making progress, nothing for demo-ing yet. Should have by next

meeting.

  • David, Doug and Pascale have laid out requirements in detail
  • Rex - would it be useful to share requirements more broadly...
  • Suzi - real opportunity will be at the Ref Genome Meeting
  • Pascale - rather basic..check out powerpoint files
  • Shiddartha...will send out file or url (done, email, also on the wiki:

http://wiki.geneontology.org/index.php/Image:Refgene_Database_V3.ppt

Annotation Quality Control

4a. Missing ortholog completeness...sometimes people seem to look at phylogenetic tree..sanity check...looking for things to anomalous... will help with prioritizing and sorting lists..

sometimes ortholog calls are a bit odd....in turned out the InParanoid ortholog was based on domain. we should be able to add and remove orthologs as necessary, but must be tracked so not reloaded on next build

curator needs to exclude from annotation set, and tag this information in a tracker Kara would prefer to have it more downstream

4b. Outliers- susan tweedie didn't see any...just paucity of information in different species

4c. ISS- in most cases, not ISS transfers being done...Big discussion here...generating breadth for the Reference Genome projects is very important. Need to require that ISS be added as needed following the completion of experimental annotation for those genes that have no experimental data.

4d. Ontology anomalies...

                  • Requirement* for report for web interface...have a quick link

the genes in 'my' species that have no experimental evidence where all the genes in the ortholog set have been marked as 'complete'.

question of how we handle ISS when there is no other data...need to refine this. Need clarity and explicitness.

AT THE END OF THE DAY: Add ISS for those genes with no experimental literature. These by curation.

Action items

[ACTION ITEM]: DONE All: please check and comment new version of the graphs http://www.geneontology.org/images/RefGenomeGraphs/

[ACTION ITEM]: IN PROGRESS. All: Annotation Quality control: Please pick an ortholog set from the Curation Targets table http://spreadsheets.google.com/ccc?key=pwOksMOra5uq4vIYjPgefPw

Enter your name in Column K, and open a new item in the SF tracker http://sourceforge.net/tracker/?group_id=36855&atid=1040173

Contact Suzi if you need to be added to this tracker.

Review action items

[ACTION ITEM]: (Chris/AmiGO) Look into loading IEAs for reference genome set into AmiGO [in progress]

  • The new loading cycle will incorporate IEAs from everything except GOA/Uniprot. Human is loaded separately.

[ACTION ITEM]: (Amelia): Fix web page where the number of annotations are to give an estimated number of protein-coding genes; problems: unmapped genes; splice variants; etc. Maybe this should also be on the ref genome page. USE count from gp2protein file-- then it's all consistent.

in progress. Amelia had some questions: what should be taken as the correct number, the number of unique IDs in the first column [the db that produced the file], or the number in the second column [the UniProt or NCBI ID]? I just checked with Dan and he says that the mapping may not necessarily be one to one.

  • Chris/Judy: that may not be a reliable number anyway. At least for human, the proteome is not well documented.
  • best would be total number of gene predictions.
  • Judy: look at Sue Rhee's recent paper

Ongoing action items

[ACTION ITEM]: Mary: Show the date completed on the index page of the graphs

[ACTION ITEM]: Mary: Distinguish 'not yet annotated' from 'no ortholog'

[ACTION ITEM] (Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC

[ACTION ITEM]: DONE Chris: generate new report that would show errors that need fixing for the Orthology determination project

[ACTION ITEM]: DONE Chris will provide date on the ISS outliers query so that we dont always review the same annotations.


[ACTION ITEM]: Mike will set up 'annotation' calls?

[ACTION ITEM]: all: look at Stan's error reports: http://www.geneontology.org/internal-reports/gp2protein/

  • not updated since october

Next conference call

Tuesday April 8, 2008, 10 PM CDT, 8 AM PDT, 4 PM GMT

Return to Reference_Genome_Annotation_Project