RefGenome11Dec07 Phone Conference (Archived)

Tuesday December 11, 10 AM CDT (8 AM PDT, 4 PM BST)

Present

Emily GOA
Rachael GOA
Petra dictyBase
Pascale dictyBase
Stacia SGD
Mary MGI
Victoria RGD
Donghui TAIR
Tanya TAIR
Fiona AgBase
Susan FlyBase
Siddhartha dictyBase
Chris NCBO
Judy MGI
David MGI
Kimberly wormbase
Amelia GOA

Web presence

Petra, Susan, Amelia, Rex: update [DONE] see http://www.geneontology.org/GO.refgenome.shtml

[ACTION ITEM]: (Amelia): Fix web page where the number of annotations are to give an estimated number of protein-coding genes; problems: unmapped genes; splice variants; etc. Maybe this should also be on the ref genome page. USE count from gp2protein file-- then it's all consistent.

Orthology determination

Kara, Stacia, Chris: update

Last time Chris provided a format for all to submit a FASTA file. We will not do this anymore. The better way to do it is to use the gp2protein file.

Is this okay with everyone??? Rat and E coli files seem to be missing, can GOA provide that?

Emily will provide Victoria with a file that can help RGD produce a gp2protein file
Longest gene products
Kimberly: should gp2protein file only have the longest product?
Chris: no, we can figure that out [how?]
Pascale to Emily: can you also provide the E. coli file?
Emily: talked to Jim, still need to hear back

[ACTION ITEM]: all: look at Stan's error reports: http://www.geneontology.org/internal-reports/gp2protein/

Those include a list of protein IDs included in the gp2protein file that do not map to a gene product ID in the gene association file (file_no_gp.database). Also included are gene product IDs that do not map to a protein ID. Those are in two different files per project (gp_no_seq.database). ***I think the latter is what we have to worry about--- Chris please confirm or correct :)

[ACTION ITEM]: Chris: generate new report that would show errors that need fixing for the Orthology determination project

Annotation consistency

1) ISS outliers: Report

We were all to look at this report and see where the errors came from.
Possible reasons for discrepancies
- the target ("with") gene is not yet annotated
- the annotation of the target ("with") gene has changed

David: most are not legitimate outliers; MGI makes direct annotations to rat/mouse but those somehow are not in GOA/RGD.
Rachael: have a similar problem that GOA annotations are not showing in GOA/RGD. There are other problems as well - which might relate to a problem with set of data used by the query - annotations are there but the ISS outlier query is picking them up (even when displayed in AmiGO, and also if the ISS annotation has been made using a non-RefGenome species.
RGD: ISS are made to RCA, ISS, TAS, ND: this is to be changed such that those annotations are not made to those evidence codes

[ACTION ITEM]: Chris will provide date on the ISS outliers query so that we dont always review the same annotations.

2) Usage of ISS, IEA, IC : Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken were to write up recommendations

See IEA,_ISS,_IC_Usage_Discussion

[ACTION ITEM]: Mary will include IC in the graphs

Would be nice if we had a report describing when genes are 'comprehensively' curated

[ACTION ITEM]: can Mary show the date completed on the index page? Possibly - she will try

[ACTION ITEM]: Discuss at the GOC meeting whether it would be useful to add the 'comprehensively annotated' tag to all genes, somehow? Either in the gene association file or in the database somehow

[ACTION ITEM]: (Chris) Look into loading IEAs for reference genome set into AmiGO

3) Other issues?

misused terms
Reports: IC, etc

Annotation documentation

A reminder... Reference genome group is responsible for annotation matters in general, including annotation documentation. We haven't had much time to talk about this yet; are there ideas or suggestions?
Another related issue is the annotation questions that go to reference genomes or to annotation email lists: it seems like most of them do not get resolved.
- Should we set up separate meetings to settle those?

[ACTION ITEM]: Mike will set up calls?

[ACTION ITEM]: Mike(pascale) merge two email lists (reference genome and annotation) into 'annotation'

- Should we copy them to the SF tracker?

Action items from previous meeting

[ACTION ITEM] all: Plan: everyone will look at the web page draft and Amelia will put it up in ~ 1 week (Nov 21) [DONE] see http://www.geneontology.org/GO.refgenome.shtml

[ACTION ITEM] (Chris) Provide guidelines and a template for the file Kara wants for orthology determination. [DONE], see Instructions for providing FASTA file

[ACTION ITEM] (all) Please send a fasta file of amino acid sequences, and an explanation of the header lines [REJECTED]--- will now use gp2protein file

[ACTION ITEM] (all) look at the results of the ISS outliers report and see how good/bad is it see Report

[ACTION ITEM] (Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC: Report by next meeting???

Started, IEA,_ISS,_IC_Usage_Discussion

Ongoing action items

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [DONE but] pombe gene IDs were sent; we need to add them to the 'to do' spreadsheet in the same format

[ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations (potential 'bleeding edge' genes) [ON HOLD] until Varsha is back

[ACTION ITEM]: David will produce some examples of the function-process links.

Not done, but Suzi and Amelia are going to try to mine these from Reactome instead. It would still be interesting to have this information as it should help annotation consistency. Documentation is available Function-Process_Links

[ACTION ITEM]: For orthology determination: Suzi and Karen E will generate a page where all sequences will be available [DONE???] GFF3 for most databases Reference_Genome_sequence_annotation Question: should we add a link to FASTA files there as well?

[ACTION ITEM]: (Judy) contact/meet with people who have made tools for orthology determination on behalf of the GOC to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs)

Compara, Homologene, TreeFam, in paranoid, others?

[ACTION ITEM] (Judy Blake) Contact NCBI/NLM/OMIM to link to reference genome genes

[ACTION ITEM]: Kara, Stacia: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all. Goal was set for February 2008 to include all ref genome sets. [in progress]

[ACTION ITEM]: (developers/software group): consider the potential impact of annotating to different forms of the gene (alternatively spliced, processed, etc). For now we will document how each database deals with those:

[ACTION ITEM]: (all): provide the method you use for capturing the exact gene product being annotated on this page: Variant_annotation [almost all done]

[ACTION ITEM] (Chris, Mike, Rex): Provide Ref genome reports on a regular basis

[ACTION ITEM] (Donghui): Check which IDs TAIR needs to provide for the reports. [We need to provide the TAIR gene accession ids in the spreadsheet instead of the AGI identifiers. -Tanya]

[ACTION ITEM]: (Chris) generate reports for potential misannotations (ND annotations for completed genes, etc). [DONE] Reference_Genome_Database_Reports We can request different reports. What do we do now?

[ACTION ITEM]: (Pascale) generate list of terms that often have incorrect annotations to check for consistent use of the term In progress, Misused_terms