RefGenome08Jan08 Phone Conference (Archived)

Time: 1 PM CDT, 11 AM PDT, 7 PM GMT

Present

Chris NCBO
Pascale dictybase
Petra dictybase
Rex dictyBase
Emily GOA
Doug zfin
David MGI
Tanya TAIR
Donghui TAIR
Stacia SGD
Karen SGD
Fiona AgBase
Ruth UCL
Kimberly wormbase
Victoria RGD
Suzi
Seth BBOP
Judy MGI
Mary MGI

Next Reference Genome Meeting

April 20-21, Salt Lake City

This will be followed by a GO Consortium Meeting on April 22 and 23 in Salt Lake City.

Karen Eilbeck: host

Display of reference genome annotations in AmiGO

Plan for AmiGO

(from Suzi) The next release (1.6) will have these new features (call it the ref genome version)

1. A page for viewing ortholog sets

2. Graphical viewing of annotations (along the lines of what Mary has been producing)

Suggestions

Load IEAs in AmiGO for reference genomes
Filter display of annotations for P, F, C

Orthology determination

Kara set the deadline of Jan. 31 (I think Simon mentioned that rat data would be ready at the end of Jan), at which point, we'll do the analysis with whatever files are available no matter what.

People will get reminded on Jan. 24 : "final" call for the new and improved files.

Victoria: do we need a single sequence per gene?

- Judy: single sequence per gene

- UniProt is preferred although NCBI IDs are accpetable

- Chris says if that's difficult, you can have more than one sequence per gene and they will sort it out (but of course it would be more accurate if the gp2protein file contained the ID to the longest protein product only).

- Judy: Princeton will only take one protein per gene, but that really depends on the structure of your database. Some people cannot select which splice variant to display

- Pascale: what if there are multiple identical genes/protein

- David MGI has a single UniProt ID for calmodulin which corresponds to 3 genes

Rex: would be nice to have the orthology set some time before the meeting so we can play with it

Changing workflow?

Proposal:

Each curator would be given a gene to check for all 12 genomes rather looking at the 20 genes for a single genome.

(did not discuss because Suzi was not able to talk)

Annotation Documentation

IEA,_ISS,_IC_Usage_Discussion

LOADING IEAs?
Chris: last time we tried loading the IEAs it slowed down the system a lot
David: can you only load an IEA if there are not other evidence codes?
Pascale: can we not just load ref genomes IEAs?
Ruth objects: if there are minimal annotations, then it is not so helpful NOT to show the IEAs
Chris: The problem with the volume of IEAs is the non-MODs organisms
Chris: the strategy would be to only load the complete files for the ref genomes
David: genomes in the drop down menu should be treated the same way; otherwise, you may end up with nothing when pick on the genome
Pascale: but that's already improvement
Judy: we need to know how many annotations each of the genomes have

Suzi: the display issue for IEAs for the ref genomes can easily be done
Suzi: So many more issues about annotations. There is no quality control. Some of us should get together and look over the data, to decide whether we are happy with the results we are getting. When this data gets public, we need to be sure we are happy with the data we are displaying. For example: HPRT
[ACTION ITEM] David: lets' do a webex to go over the issues relating to quality control. Suzi will email the group, and we will set up a meeting with all those who want to participate.

other concerns

Fiona: We should show in AmiGO that there are annotations available even if they are not in AmiGO
Emily: the Quick GO browser will soon allow browsing by taxon

New Action Items

[ACTION ITEM] David: lets' do a webex to go over the issues relating to quality control. Suzi will email the group, and we will set up a meeting with all those who want to participate.

[ACTION ITEM]: (Chris/AmiGO) Look into loading IEAs for reference genome set into AmiGO

Review Action Items

[ACTION ITEM]: (Amelia): Fix web page where the number of annotations are to give an estimated number of protein-coding genes; problems: unmapped genes; splice variants; etc. Maybe this should also be on the ref genome page. USE count from gp2protein file-- then it's all consistent.

in progress. Amelia had some questions: what should be taken as the correct number, the number of unique IDs in the first column [the db that produced the file], or the number in the second column [the UniProt or NCBI ID]? I just checked with Dan and he says that the mapping may not necessarily be one to one.

[ACTION ITEM]: all: look at Stan's error reports: http://www.geneontology.org/internal-reports/gp2protein/

[ACTION ITEM]: Chris: generate new report that would show errors that need fixing for the Orthology determination project

[ACTION ITEM]: Chris will provide date on the ISS outliers query so that we dont always review the same annotations.

[ACTION ITEM]: Mary will include IC in the graphs

Would be nice if we had a report describing when genes are 'comprehensively' curated

in progress

[ACTION ITEM]: can Mary show the date completed on the index page? Possibly - she will try

in progress: Email communication: Hi, Pascale. I was in the process of implementing both things when my computer was hit with bad malware, which I am still dealing with.

The IC part was simple and I posted graphs on a temporary site with the IC annotations but realized that some were incomplete because some groups were not putting in EntrezGene ids, which is now the only sure way to connect the orthologs. That led me to contact each groups to make corrections and to ask that they include all EntrezGene ids. At the point I was hit with the malware so I've been somewhat stalled. When I can get that resolved I will be able to update the graphs and include the date on the index page.

Mary

[ACTION ITEM]: Discuss at the GOC meeting whether it would be useful to add the 'comprehensively annotated' tag to all genes, somehow? Either in the gene association file or in the database somehow

[ACTION ITEM]: Mike will set up 'annotation' calls?

[ACTION ITEM]: Mike(pascale) merge two email lists (reference genome and annotation) into 'annotation'

Should we copy them to the SF tracker?

[ACTION ITEM] (Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC: Report by next meeting???

Started, IEA,_ISS,_IC_Usage_Discussion

Ongoing action items

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [DONE but] pombe gene IDs were sent; we need to add them to the 'to do' spreadsheet in the same format

[ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations (potential 'bleeding edge' genes) [ON HOLD] until Varsha is back

[ACTION ITEM]: David will produce some examples of the function-process links.

Not done, but Suzi and Amelia are going to try to mine these from Reactome instead. It would still be interesting to have this information as it should help annotation consistency. Documentation is available Function-Process_Links

[ACTION ITEM]: For orthology determination: Suzi and Karen E will generate a page where all sequences will be available [DONE???] GFF3 for most databases Reference_Genome_sequence_annotation Question: should we add a link to FASTA files there as well?

[ACTION ITEM]: (Judy) contact/meet with people who have made tools for orthology determination on behalf of the GOC to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs)

Compara, Homologene, TreeFam, in paranoid, others?

[ACTION ITEM] (Judy Blake) Contact NCBI/NLM/OMIM to link to reference genome genes

[ACTION ITEM]: Kara, Stacia: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all. Goal was set for February 2008 to include all ref genome sets. [in progress]

[ACTION ITEM]: (developers/software group): consider the potential impact of annotating to different forms of the gene (alternatively spliced, processed, etc). For now we will document how each database deals with those:

[ACTION ITEM]: (all): provide the method you use for capturing the exact gene product being annotated on this page: Variant_annotation [Done]

[ACTION ITEM] (Chris, Mike, Rex): Provide Ref genome reports on a regular basis

[ACTION ITEM] (Donghui): Check which IDs TAIR needs to provide for the reports. [We need to provide the TAIR gene accession ids in the spreadsheet instead of the AGI identifiers. -Tanya]

[ACTION ITEM]: (Chris) generate reports for potential misannotations (ND annotations for completed genes, etc). [DONE] Reference_Genome_Database_Reports We can request different reports. What do we do now?

[ACTION ITEM]: (Pascale) generate list of terms that often have incorrect annotations to check for consistent use of the term In progress, Misused_terms

Next conference call

Tuesday February 12, 10 AM CDT (8 AM PDT, 4 PM BST)