RefGenome13Nov07 Phone Conference (Archived)
Time: 1 PM CDT, 11 AM PDT, 7 PM BST
Present
- David Hill (MGI)
- Pascale (dictyBase)
- Petra (dictyBase)
- Rex (dictyBase)
- Susan (flyBase)
- Victoria (RGD)
- Emily (uniProt)
- Tanya (TAIR)
- Donghui (TAIR)
- Stacia (SGD)
- Mary (MGI)
- Rachael (UniProt)
- Simon (RGD)
- Kimberly (wormbase)
- Ranjana (wormbase)
- Chris Mungall
- Amelia Ireland
- Suzanna Lewis
Rotation System for selecting genes
- Comments from Tanya and Donghui:
- It was hard to pick human/E coli because we only had ensembl IDs (from in paranoid) but Pascale improved the list (mow refers to Entrez Gene ID and HGNC
- People liked the list
- See Procedure_for_selection_of_target_genes; edit as necessary
- Next month (December) is wormbase
Web presence
Amelia (web page), Susan, Rex, Petra (content)
- Petra describes the page: Ref Gen pub draft
- Emily: might be a good idea to make those annotations available - Rex: yes, I'm sure people would like to download all this data as a single set
- [ACTION ITEM] all: Plan: everyone will look at the draft and Amelia will put it up in ~ 1 week
Orthology determination
1) Kara started to collect sequences from all groups. Questions:
- At the meeting at Princeton, the plan was for each MOD participating in the Ref. Genome project to provide a protein set, in fasta format. Then, as a start, here at Princeton, we'd do an all vs. all BLAST, so that the same BLAST results can be given to the groups developing ortholog identification methods (eg InParanoid, OrthoMCL, etc) with the idea that the results of the methods can then be more easily compared/assessed if the same input sequences and BLAST results were used. It'll probably take 3-4 weeks for the computation to be done on our current system
[ACTION ITEM] (all) Please send a fasta file of amino acid sequences, and an explanation of the header lines
- Can everyone provide the file?
NOT FINISHED
[ACTION ITEM] (Chris) Provide guidelines and a template for the file Kara wants for orthology determination.
2) OMA browser
Doug, Pascale
While talking with Amos Bairoch at the biocuration meeting, Amos informed us of another orthology prediction option that at least I have not heard mentioned before.
OMA browser
This recent pub describes the resource: A Schneider, C Dessimoz and GH Gonnet (2007)
Pros:
- Already includes all the ref. genome species, and a total of 421 species in the most recent release (June '07).
- Is being maintained and updated, next release in Nov. '07
- I have communicated with Christophe Dessimoz to inquire about the use of more recent D.rerio sequences in the next release of OMA and he seemed amiable and willing to field comments and suggestions
Cons:
- At the moment it works primarily with Ensembl gene IDs, Ensembl protein IDs and genbank protein IDs
Perhaps if we each produced a protein file of our choosing, including MOD gene IDs, he would be able to include that in the next release of OMA? He might even be willing to work on interface improvements that would help us out if we ask nicely.
Thought the group might want to look at this resource so we can discuss it at the next Ref. Genome conf. call.
For your reference, I included some raw data for the OMA group corresponding to the Human ACHE gene. ACHE has been included in the Ref. Genome set for a long time. The Ref Genome tables suggest that there are orthologs found in Mouse, Rat, Worm, Fly and zebrafish. The OMA group below only seems to include Human, mouse, zebrafish, and dicty. I didn't check if the OMA-identified ortholog corresponds to the Ref. Genome specified orthologs or look further into the discrepancies.
### OMA group 13441 ### BRARE14851 ENSDARG00000031796; CAC19790.1 ENSDARP00000052988 HUMAN11630 ENSG00000087085; AAH26315.1; AAH36813.1; AAH94752.1 ENSP00000350037 MOUSE21200 ENSMUSG00000023328; AAA53521.1; AAH46327.1; AAK28816.1; BAC31228.1; BAC31641.1; BAC32595.1; BAE24373.1; CAA39867.1 ENSMUSP00000024099 XENTR14267 ENSXETG00000017226 ENSXETP00000037518 CANFA05639 ENSCAFG00000014054 ENSCAFP00000020717 CIOIN01178 ENSCING00000004635 ENSCINP00000009596 FUGRU18051 SINFRUG00000120974 SINFRUP00000127687 BOVIN17961 ENSBTAG00000001139; AAC64270.1; AAI23899.1 ENSBTAP00000001512 PANTR09106 ENSPTRG00000019514 ENSPTRP00000033412 DICDI06813 DICDI_4.1323 XP_638122.1 MONDO08985 ENSMODG00000004882 ENSMODP00000006010 MACMU04752 ENSMMUG00000021257 ENSMMUP00000027995 LOXAF09401 ENSLAFG00000007735 ENSLAFP00000006495 DASNO05327 ENSDNOG00000003865 ENSDNOP00000002974 GASAC21097 ENSGACG00000000728 ENSGACP00000000940 OTOGA01101 ENSOGAG00000017244 ENSOGAP00000015447 TUPGB03889 ENSTBEG00000006472 ENSTBEP00000005591
Quality control: outlying ISS annotations
Doug, Chris
Doug raised an important point on the email list: I was wondering what people think of Ref. Genome annotations by ISS that seem to be off by themselves in a graph? I see these quite often in Mary's graphs, and I always wonder how they got there. Possibilities include: 1: ISS annotations where the thing that was ISS'ed to is now gone (bad bad bad) 2: ISS to genes from a non-ref genome species 3. ISS to genes which would not be considered orthologous to the gene being annotating
Just wondering what others think of these and how we can find cases of #1, which specifically need to be updated.
- Rat has known issues of doing ISS to mouse IEAs; those will be removed
- Chris can make a report : see ISS Outliers report
- Question is how to manage that? Suzi: we'll just start by looking at the lists and see
[ACTION ITEM] (all) look at the results of the ISS Outliers report and see how good/bad it is:
- Go to the link
- Download as excel file
- look at annotations from your database
Quality control: Misused terms
Pascale, David, all
- (Pascale) Started a wiki page Misused terms
Annotation of variants
We need to consider the potential impact of annotating to different forms of the gene (alternatively spliced, processed, etc). For now we will document how each database deals with those: [ACTION ITEM]: (all): provide the method you use for capturing the exact gene product being annotated on this page: Variant_annotation
Review Action items
[ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) [DONE] 'Suggestions' spreadsheet (look for the "conserved Hs-Ec" sheet) Pascale also marked which genes we've already curated. Prioritization is not yet done
[ACTION ITEM] (GOA): Convert the ensembl IDs from the human-E. coli list to UniProt IDs [DONE] (Pascale) I converted to entrez and HGNC. The UniProt ID was not easy to work with since there was often more than one ID. The table now has human gene names and the two IDs.
[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information [DONE but] pombe gene IDs were sent; we need to add them to the 'to do' spreadsheet in the same format
[ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations (potential 'bleeding edge' genes) [ON HOLD] until Varsha is back
[ACTION ITEM]: Emily, David: provide guidelines to submit those annotations to GOA Other taxa annotations. [DONE] see Other_taxa_annotations; also Pascale added a link to this page from our Ref Genome main page under 'Annotation'
[ACTION ITEM] Amelia (web page), Susan, Rex, Petra (content), work on web presence : Report by next meeting
[ACTION ITEM]: David will produce some examples of the function-process links.
Not done, but Suzi and Amelia are going to try to mine these from Reactome instead. It would still be interesting to have this information as it should help annotation consistency. Documentation is available Function-Process_Links
[ACTION ITEM]: For orthology determination: Suzi and Karen E will generate a page where all sequences will be available [DONE???] GFF3 for most databases Reference_Genome_sequence_annotation Question: should we add a link to FASTA files there as well?
[ACTION ITEM]: (Judy) contact/meet with people who have made tools for orthology determination on behalf of the GOC to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs)
- Compara, Homologene, TreeFam, in paranoid, others?
[ACTION ITEM]: Kara, Stacia: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all. Goal was set for February 2008 to include all ref genome sets.
[ACTION ITEM]: (developers/software group): consider the potential impact of annotating to different forms of the gene (alternatively spliced, processed, etc). For now we will document how each database deals with those:
[ACTION ITEM]: (all): provide the method you use for capturing the exact gene product being annotated on this page: Variant_annotation
[ACTION ITEM] (Chris, Mike, Rex): Provide Ref genome reports on a regular basis
[ACTION ITEM] (Donghui): Check which IDs TAIR needs to provide for the reports. [We need to provide the TAIR gene accession ids in the spreadsheet instead of the AGI identifiers. -Tanya]
[ACTION ITEM] (Pascale, Doug, ): Provide guidelines for filling the google spreadsheet (IDs, where to put notes, how many ortholog per row (1), etc) [STARTED] Procedure_for_filling_Genome-Specific_spreadsheets
[ACTION ITEM]: (Chris) generate reports for potential misannotations (ND annotations for completed genes, etc). [DONE] Reference_Genome_Database_Reports We can request different reports. What do we do now?
[ACTION ITEM]: (Pascale) generate list of terms that often have incorrect annotations to check for consistent use of the term
In progress, Misused_terms
[ACTION ITEM] (Tanya Berardini, Emily Dimmer, Pascale Gaudet, David Hill, Chris Mungall, Kimberly Van Auken): Write up recommendations for usage of ISS, IEA, IC: Report by next meeting???
[ACTION ITEM] (Judy Blake) Contact NCBI/NLM/OMIM to link to reference genome genes
New Action items
[ACTION ITEM] all: Plan: everyone will look at the web page draft and Amelia will put it up in ~ 1 week (Nov 21)
[ACTION ITEM] (Chris) Provide guidelines and a template for the file Kara wants for orthology determination. Done, see Instructions for providing FASTA file
[ACTION ITEM] (all) Please send a fasta file of amino acid sequences, and an explanation of the header lines. Contact Kara if you have any questions.
[ACTION ITEM] (all) look at the results of the ISS Outliers report and see how good/bad it is:
- Go to the link
- Download as excel file
- look at annotations from your database