RefGenome11Mar08 Phone Conference (Archived)

From GO Wiki
Jump to navigation Jump to search

Tuesday March 11, 1 PM CDT, 11 AM PDT, 7 PM GMT


Suzi Flybase
Michael Flybase
Susan Flybase
Chris NCBO
Kara Princeton
Rex dictyBase
Pascale dictyBase
Petra dictyBase
Siddhartha dictyBase
Fiona AgBase
Victoria RGD
Tanya TAIR
Dong Hui TAIR
Emily EBI
Judy MGI
Mary MGI
David MGI
Stacia SGD
Kimberly WormBase
Ranjana WormBase

Orthology determination

  • Kara: update:

We launched the all-vs.-all BLAST on Feb. 18. I generated fasta files based on the gp2protein files that everyone provided. I saved everything and put it on an ftp site here:

with 3 subdirectories:

(1) gp2protein: contains the gp2protein files used to generate the protein fasta files for the analysis

(2) error: contains IDs from the gp2protein files that were unable to be retrieved from NCBI or UniProt

  • Issue: may proteins were identified by their secondary IDs (especially human!)

[ACTION ITEM]: Chris/Emily: figure out secondary IDs. Maybe a script can be generated to map IDs?

Chris: The GO DB sequence load handles references to secondary UniProtKB IDs in the gp2protein file, which means that the GO DB fasta exports should be complete in this respect.

If useful, it should be possible for us to produce correct gp2protein files as part of the report process during GO DB loads.

In addition, Dan provides a mapping file: uniprot_secondary_to_primary.dat

(3) fasta: contains the fasta files generated from the gp2protein files

  • The BLAST is just about done (a bit ahead of schedule!), and the next step is to start OrthoMCL.
  • Rough time line, depending on cluster usage:

- We'll be able to view the OrthoMCL families in simple list form (query by a gene name, get a list of orthologous genes back) in about two weeks.

- In a month, we will have phylogenetic trees available, and other handy info, as shown in our current, production version of the web interface:

Example query result:

For this first run, the same basic features will be available for the Ref. Genome stuff. The plan is to send the results around to everyone and see what they think, then we'd collect feedback and suggestions and go from there.

[ACTION ITEM]: DONE Kara, Chris, Mary: Links will be generated to MODs using the dbxref file. Mary has sent Kara all the refG MOD link info.


  • Doug: Zfin – Uniprot IDs go out of sync with the Zfin IDs over time, some that just disappear. Not sure what the problem is, but he wanted other groups to be aware of that.

Curation tool update

  • Chris...making progress, nothing for demo-ing yet. Should have by next


  • David, Doug and Pascale have laid out requirements in detail
  • Rex - would it be useful to share requirements more broadly...
  • Suzi - real opportunity will be at the Ref Genome Meeting
  • Pascale - rather basic..check out powerpoint files
  • Shiddartha...will send out file or url (done, email, also on the wiki:

Annotation Quality Control

Some curators have started to look at the graphs to address annotation quality assurance problems (some issues are listed at Annotation_QC. We went through two examples:

1. TAZ


Not a lot of data for this gene, fairly straight forward.

Outliers- susan didn't see any...just paucity of information in different species

Question: how far do we go to find orthologs, eg did blast of chicken and there is probably a chicken ortholog but not annotated yet.

ISS- in most cases, ISS transfers are not being done...Big discussion here...generating breadth for the Reference Genome projects is very important. Need to require that ISS be added as needed following the completion of experimental annotation for those genes that have no experimental data.

Another ISS issue is that some curators seem to be uncomfortable drawing the line in terms of what can and cannot be propagated? This needs more discussion [ADD THIS TO MEETING AGENDA]

There was also a problem of mouse making ISS annotations with human sequence but human wasn't itself annotated with the term - turned out the human annotation (made by MGI) had not been picked up by GOA because the human identifier was not protein. [ACTION ITEM?] Does this need to be resolved? How should this 'missing' human GO data make it into Mary's tables?

Review also picked-up minor problem on graph which has been addressed via sf.

[ACTION ITEM] All. problems in the graph this needs to be looked at and fixed!

[ACTION ITEM] David. There was a specific issue with 'heart development' : looking at the graph was not intuitive to Chris and Pascale.

Judy: not useful to do orthology yet with new system in development, that will come [instances looking at ortholog completeness, they look at the tree for a sanity check. Having the tree would be good to be able to have this to call ‘lost’ orthologs to people’s attention.

Susan: and then we need to distinguish between orthologs that are missing due to lack of information versus truely missing from a species.

Do need to have a mechanism so that we can shortlist the problem areas and bring them to people’s attention. Not only determining orthologs but also GO annotation outliers and ability to make ISS when we decide what species we should transfer from.

2. TNNT2


Just listed groups who listed an ortholog call. Noticed that no orthologs/orthologs as being strange sets. Looks like yeast inparanoid ortholog is incorrect (based on repetitive domain).

Should be automated generated ISS based on tools that we are building but there should be other more general ISS based on motifs or etc. The point is to have the option to use ISS as well as ISO.

But these exercises should be a manual test of ortholog determination. If we find something that seems to be wrong how do we deal with it? Can we manually correct for false positive or false negatives? This information should be feedback to the algorithm so that the algorithm is improved.

Ultimately this set will be in public domain so the set needs to be done in a systematic way not ad hoc, allowing these to creep back in.

Should be able to manually add/remove orthologs so long as it is done in a way that can be tracked.

[ACTION ITEM] : Programmers. Figure out how to manually add/remove orthologs in a way that can be tracked.

Susan – fly gene is almost equally similar to TNNT1, 2 & 3 but is listed only once in the current spreasheet as ortholog of TNNT3 (best score using InParanoid). Not ideal but leaving it this way until we have generated the new ortholog sets. This raises the issue of how to deal with non-1:1 relationships.

We need to be able to show 1:1, 1:many relationships, many:many relationships.Within the ortholog set multiples should be included.

For the Reference Genome Project, need to make sure that once the experimental data annotations are complete, ISSs are made for those genes in the same ortholog set that have no other GO annotation. WE ALL AGREE ON THIS.

Doug – Since we need to capture genes in a species that are a part of the orthologous set that have no other GO annotation, there should be a report to enable people to do this for their species.

There was general agreement that the gene by gene review had been useful and should be continued for the time being.

Emily -- It is a large group and not always easy to discuss things or know that we are approaching annotation in the same way. This exercise is a useful sanity check to ensure that we are all curating consistently and producing a cohesive set of data.

[ACTION ITEM] Chris to look into generating a suitable report.

email follow up

Judy: General observation....

Most of the suggestions are for additional ISS that I see.

We have agreed, now, that ISS should be added after all, or most, groups have completed their experimentally-based annotations.

Then, once we have extablished a common orthology/homology set, each group should be able to decide to which other organisms they are willing to pick up ISS annotations. This would most efficiently be done in a controlled-automated fashion.

For example, mouse would pick up from experimental code annotations from rat and human by ISS, but probably not from more distantly-related organisms. (BYW, I don't think rat picking up ISS from NAS human annotations is a good idea...NAS is not an experimental code. Maybe Vicky can give us some insight into that.)

Bottom line: I think it is inefficient to code ISS manually unless it is incidentally part of the paper you are curating for your resource.

for discussion...

Also, remember that I sent round the Burgess paper about annotation quality analysis. Some of the global methods they developed might be useful for us to consider. Perhaps we could schedule a RefGenome journal club around that paper.


Requirement for report for web interface...have a quick link the genes in 'my' species that have no experimental evidence where all the genes in the ortholog set have been marked as 'complete'.

question of how we handle ISS when there is no other data...need to refine this. Need clarity and explicitness.

AT THE END OF THE DAY: Add ISS for those genes with no experimental literature. These by curation.

Doug: My thoughts on what we would want for a report of genes that should be checked for potential ISS annotation.

A link/button on the Ref. Genome Interface for "My genes needing ISS".

This link would launch a report listing all the genes in my organism where: 1. The 'vast majority' (what that is TBD..maybe a parameter curators can set before running the report?) of the genes in the homology set are marked as comprehensively curated....where "comprehensively curated' means annotated comprehensively for experimentally supported annotations.

2. My gene(s) in the homology set lacks experimental annotation in 1 or more GO aspects. (Even if my gene DOES have experimental annotation, there may still be useful alternative experimentally supported annotations in other species that could be applied to my gene by this requirement may be dubious?)

Other requirements: A way to mark and datestamp that we have completed the transfer of experimental data by ISS as of a certain date so the gene no longer shows up in the report. ISS-complete genes may need later review as new experimental annotations are added to the genes in the homology set. This brings up the issue of how to keep all this current. As time goes on we will have an increasingly large re-annotation burden in order to keep things current across all the Ref. Genome genes. It is already impractical from a curation time point of view, which I suspect is a major contributor to the lack of ISS's being completed. Automatic transfer of ISS annotations is the only solution to this, but this is fraught with complexity as discussed at todays phone conf.

David: Doug,

This is great. We would also like to have something in place that removes ISS annotation if original annotations are removed. I was thinking after the call, that a good start to this would be to have a table similar to the ones at the very bottom of Mary's graph page where we could mark the boxes where we wanted to create annotations. Of course we'd have to figure out how to get these annotations easily into our databases. Even just a report we could print out with the pertinent data would be a start. It seems to me that the big picture that Mary's stuff gives is the best way to make a reasonable judgment about whether we would want to make an ISS.

Action items

[ACTION ITEM]: DONE All: please check and comment new version of the graphs

[ACTION ITEM]: IN PROGRESS. All: Annotation Quality control: Please pick an ortholog set from the Curation Targets table

Enter your name in Column K, and open a new item in the SF tracker

Contact Suzi if you need to be added to this tracker.

Review action items

[ACTION ITEM]: (Chris/AmiGO) Look into loading IEAs for reference genome set into AmiGO [in progress]

  • The new loading cycle will incorporate IEAs from everything except GOA/Uniprot. Human is loaded separately.

[ACTION ITEM]: (Amelia): Fix web page where the number of annotations are to give an estimated number of protein-coding genes; problems: unmapped genes; splice variants; etc. Maybe this should also be on the ref genome page. USE count from gp2protein file-- then it's all consistent.

in progress. Amelia had some questions: what should be taken as the correct number, the number of unique IDs in the first column [the db that produced the file], or the number in the second column [the UniProt or NCBI ID]? I just checked with Dan and he says that the mapping may not necessarily be one to one.

  • Chris/Judy: that may not be a reliable number anyway. At least for human, the proteome is not well documented.
  • best would be total number of gene predictions.
  • Judy: look at Sue Rhee's recent paper

Ongoing action items

[ACTION ITEM]: Mary: Show the date completed on the index page of the graphs

[ACTION ITEM]: Mary: Distinguish 'not yet annotated' from 'no ortholog' in the graphs

[ACTION ITEM] TO DISCUSS AT 2ND REF GENOMES MEETING(Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC

[ACTION ITEM]: DONE Chris: generate new report that would show errors that need fixing for the Orthology determination project

[ACTION ITEM]: DONE Chris will provide date on the ISS outliers query so that we dont always review the same annotations.

[ACTION ITEM]: DONE Mike will set up 'annotation' calls?

[ACTION ITEM]: all: look at Stan's error reports:

  • not updated since october

Next conference call

Tuesday April 8, 2008, 10 AM CDT, 8 AM PDT, 4 PM GMT

Return to Reference_Genome_Annotation_Project