Difference between revisions of "Reference Genome Meeting Minutes April 2008"

From GO Wiki
Jump to: navigation, search
(Software Update)
 
(56 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 +
[[Category:Reference Genome]]
 +
[[SLC GO Reference Genome Project Meeting|Link back to SLC GO Reference Genome Project Meeting page]]
 
==April 20, 2008==
 
==April 20, 2008==
 
 
===Annotation Progress===
 
===Annotation Progress===
 
====Annotation Progress (Mike Cherry)====
 
====Annotation Progress (Mike Cherry)====
*Number of annotated genes per organism by evidence type (overall)
+
*Number of annotated genes per organism by evidence type (overall):  If Compare graphs for Sept 2007 and Apr 2008 see that over all size and size the same, but IEA decreasing.
**Compare graphs for Sept 2007 and Apr 2008 - overall size and size the same, but IEA decreasing  
+
[[Image:ReferenceGenomeMetrics-200805.pdf]]
 +
**Discussion:
 +
***What is effort/person?
 +
***X-axis is absolute number of genes, which doesn't reflect differences in genome size.
  
Discussion:
+
*Number of annotated genes per organism by evidence code for Reference Genome project:  the majority of genes have experimental evidence codes
What is effort/person?
+
**Discussion:
X-axis is absolute number of genes, which doesn't reflect differences in genome size.
+
***Graph needs outline that indicates "no ortholog".  This allows a comparison of the genes present or absent in the reference genome genomes.  It will also show which organisms are lagging behind.  
 
+
***Number of annotations as a metric?  would give a different view of the progress, but too variable b/c of differences in depth of knowledge in different organisms, different areas of the ontology.
*Number of annotated genes per organism by evidence code for Reference Genome project
+
***View progress between Sept 2007 and April 2008 as a % change.  Can see that everyone has doubled experimental annotations, although it doesn't show the starting number of annotations.
** majority of genes have experimental evidence codes
+
***Need to discuss which metrics we want to track and why.  Need consistent measures across groups.
 
+
***How annotations change over time lets you see whether groups are still engaged in the process.
*Discussion:
+
***Would be useful to have a display that shows how much is known about these genes.  Some of this information will come from Chris's reports.
**Graph needs outline that indicates "no ortholog".  This allows a comparison of the genes present or absent in the reference genome genomes.  It will also show which organisms are lagging behind.  
+
***Would be interesting to know the number of genes with ND.
**Number of annotations as a metric?  would give a different view of the progress, but too variable b/c of differences in depth of knowledge in different organisms, different areas of the ontology.
 
**View progress between Sept 2007 and April 2008 as a % change.  Can see that everyone has doubled experimental annotations, although it doesn't show the starting number of annotations.
 
**Need to discuss which metrics we want to track and why.  Need consistent measures across groups.
 
**How annotations change over time lets you see whether groups are still engaged in the process.
 
**Would be useful to have a display that shows how much is known about these genes.  Some of this information will come from Chris's reports.
 
  
 
====Annotation Progress (Chris Mungall)====
 
====Annotation Progress (Chris Mungall)====
*Metrics:
+
*Metrics
 
 
 
**distance to leaf (shows average number for all genes)
 
**distance to leaf (shows average number for all genes)
 
*** didn't change between Jan 2006 to Sept 2007
 
*** didn't change between Jan 2006 to Sept 2007
 
*** consider breaking down by the 3 ontologies, also show % of length to leaf
 
*** consider breaking down by the 3 ontologies, also show % of length to leaf
 
 
**information content  
 
**information content  
 
*** a quality control measure
 
*** a quality control measure
 
 
**coverage (# of nodes covered per gene)
 
**coverage (# of nodes covered per gene)
***as you look at gene in more detial it wil have more coverage
+
***as there is more information about a gene it will have more coverage
 
*** can there be too much coverage?
 
*** can there be too much coverage?
 +
**publications per gene
 +
**GO terms per gene
 +
**'''For each of these metrics: what is appropriate range for each category? need a sense of the scale, perhaps express X-axis as a % rather than an absolute number.'''
 +
*Reference Genome Reports
 +
 +
====Annotation Progress: Discussion of other ideas for measuring progress====
 +
*A measure that shows progress made in curating the experimental literature for reference genes in reference genomes.  This is an aim of the grant.  Can determine number of publications annotated.
 +
*A measure of time spent (% effort) actually doing experimental annotations. Disagreements: Can't do curation w/o ontology development and visa versa. Worried about trying to parse out too much. How to you separate annotation from time spent considering how you do annotations or assessing quality of annotations. 
 +
*A measure of the number of genes that have been comprehensively annotated.
 +
 +
===Annotation Pipeline, Part 1: Generation of protein sets (Suzi Lewis)===
 +
*The issue is determining a procedure to define a coherent set of orthologous proteins. For experimental annotations, want to annotate to isoforms.  But for tree building want longest protein produced from a gene.  So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
 +
*'''Proposal from Chris Mungall for the Gene Association File''':  In column 2 will have the ID for the "canonical" gene. Add an additional column (column 17) to hold the ID for the thing that was annotated (protein/gene/transcript).  Column 17 must match column 12 (SO type).
 +
**Background for the proposal
 +
***Currently there is heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms, some groups have protein IDs, some groups have gene IDS, some groups have a mixture of both. 
 +
**Discussion: 
 +
***Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. It provides the link we want. 
 +
***Alternate proposal from Rex Chisholm: Keep column 2 as it is now, the ID for the thing that was annotated. (In a perfect world, this would be the gene product.) Keep column 12 it is is now, referring to column 2.  Add column 17 for the ID for the "canonical" gene that codes for the product that was annotated.
 +
***Have to look at how any change will affect our users. What do users expect to be in column 2?  they expect canonical ID, but it isn't always the case.
 +
**'''Decision''': Most groups in favor of the proposal of making column 2 the canonical ID and the column 17 the ID for the thing that was annotated.
 +
**What should column 12 refer to?
 +
***'''Decision''':  Column 12 should point to column 17, which means that column 17 must be filled in; it can't be left blank and inferred from column 2.
 +
**Notifying users.  Before change is implemented, should it be discussed with a few users? Need a pushout list to identify users of changes/updates. 
 +
**The header of gene association file should state this file contains annotations for x out of total number of genes estimated in this organism.
 +
*'''Proposal for the gp2protein file''': For every canonical gene ID in the (GAF)(judy:  I think this is meant to be 'for every canonical gene ID in your ''database'') there will be an associated canonical protein ID in the gp2protein file (judy: if such ID exists, field 2 can be NULL).
 +
**Background.  Still need gene to protein association.  This should be in a separate file from the gene association file.  The gp2protein file seems the logical place to have this association.
 +
**Discussion
 +
***What about those cases where gene has been annotated, but there is no known protein sequence associated with it.  Should there be a blank in the gp2protein file? or should the gp2protein file have information that the gene is "uncloned" or codes a functional RNA? Don't want to overload the file (putting non ID information in an ID column).  If needed, should make a separate file or find other ways of dealing with the blanks.  Can generate report that gives type from column 12.
 +
***If gp2protein file has only canonical protein IDs, how do you get information about other protein IDs (column 17)?
 +
***'''syntax for gp2protein file''':  Enter protein accessions as UniProtKB:xxx or NCBI:xxx
 +
 +
*'''Review'''
 +
**Gene Association File (GAF)
 +
#column 2 is canonical gene ID
 +
#column 17 is thing you are annotating (always required)
 +
#column 12 matches column 17 and contains SO ID's
 +
**gp2protein file:
 +
#includes complete gene index (except for pseudogenes and transposons)
 +
#column 1 is canonical gene ID
 +
#column 2 is accession for sequence of longest form of protein from UniProtKB: or NCBI:
 +
**ACTION ITEMS
 +
#update documentation
 +
#write notice of changes to users
 +
#individual data providers make sure that their input matches
 +
#software changes as necessary
 +
#add header to gene association file
 +
#syntex of gp2protein file will be provided by Mike and Chris
 +
 +
===Software Update===
 +
 +
====Demo of RefGenome tracker interface (Siddhartha Basu)====
 +
*RefGenome tracker interface  (database to replace current google spreadsheet)
 +
**For programmers:
 +
#add box for taxon id to the "add target" entry box
 +
#add column with MOD id so curators can link to the MOD rather than NCBI
 +
**Time frame?
  
**publications per gene
+
====RefGenome Graphs (Mary Dolan)====
 +
Presentation: review of graphs including some new features [[Image:RefG_graphs.ppt]]
 +
*Suggestion for detailed annotation table: For ISS include the "with" information
 +
[DONE] See for example: [http://proto.informatics.jax.org/prototypes/GOgraphEX/RefGenomeGraphs/7915.html#Annotations]
 +
 
 +
====AmiGO (Seth Carbon)====
 +
*Summary table of genes in the RefGenomes List
 +
**Want feedback on summary table and visual graphical displays.  Are there other types of visual displays that people need?
 +
**AmiGO will gradually be moving to this structure.
 +
*Cross Products
 +
**Example TAZ gene, various annotations related to heart in different organisms, but couldn't see connections in the graph. Working on version of graph display to show these connections. 
 +
 
 +
====Community Annotation at GONUTS (Jim Hu)====
 +
*Demo of "Create New Gene Page"
 +
*Have a webservice that can be used to connect AmiGO and GONUTS by identify pages in GONUTS that have been annotated by a human being
 +
*Discussion
 +
**two possible types of input:  small, individual annotations; bulk sets of predictions
 +
**could you use it for getting input on IEA annotations?
 +
**will increase input by making it easier for people to provide input
 +
**try connecting Cardio and Immunology pages to GONUTS
 +
 
 +
===Annotation Pipeline, part 2 (Suzi Lewis and Judy Blake)===
 +
*Is there consensus about the steps shown in [[Image:Ref Genome annotation pipeline2008Mar31.pdf]]?
 +
**Need to discuss how the "focal-sets" will be determined
 +
**Step V: Changed to "Curators add/remove proteins to/from the "focal set" based on dialog and agreement"
 +
**What is the purpose of the focal set?  defined at meeting in Princeton, to be able to say that these products have been experimentally annotated across the reference genomes
 +
**Rex: Agrees with procedure but wants to emphasize that it can't be written in stone.  There has to be option for future discussion changes in the procedure.  Suzi: Yes, in light of further knowledge the procedure may change and people involved have to be open-minded and leave behind their preconceived notions.
 +
**Rex: Want to capture both depth and breadth (annotations to as many genes as possible based on exptl. annotations and also ISS). 
 +
**Judy:  Inferential annotations.  how do you transform experimental annotations in one organism to inferential annotations in another organism? what measures are useful? what about large family sets?
 +
**Suzi: QC is something that happens during the entire process, not just at the end of the process. Will be useful to think about QC at each of these steps.
 +
 
 +
 
 +
===Consistency within experimental annotations (Pascale)===
 +
*In the Source Forge tracker for ref genome completion set [http://sourceforge.net/tracker/?atid=1040173&group_id=36855&func=browse], out of 12 sets we've looked at, there were 4 were there was an annotation error (POLA, APOA1, ALAS2, EIF2B2), and at least 10 new SF items opened to fix the ontology. Which means that this exercise was useful and we'll continue as long as it is.
 +
*Examples of common annotation errors
 +
**transient localization vs long-term cellular localization
 +
**secreted protein annotated to secretory process
 +
**IMP evidence code used for results from high through-put experiments
 +
**GO terms for growth, cell growth, cell cycle, cell proliferation frequently used interchangeably
 +
*Ways to prevent errors
 +
**Wiki page of commonly [[Misused_terms]]
 +
***It would help annotators if this page included more information about how the term was misused.
 +
***Where can this information be put to make it more visible? need way of notifying people that something was added to the list.  NOTE: annotaters can sign up for email announcements.
 +
***People should be on the lookout for misused terms and add these to the list along with an explanation.
 +
**[ACTION ITEM] Develop SOPs to prevent future errors.
 +
**[ACTION ITEM for programmers] Pop-ups to warn annotaters they are about to use a GO term that is commonly misused.
 +
*Ways to improve annotation (besides finding/preventing errors)
 +
*Part terms.
 +
**Annotators need to use part terms.
 +
**[ACTION ITEM] Develop SOPs for part terms to help annotators use them correctly.
 +
**Compare annotations of the gene product in your organism with the annotations in other organisms.
 +
**Amazon shopping cart model for improving annotation. Have pop-ups that say XX% of annotators who used this term, also used GO:xxxx, or annotators who used this term never used GO:xxxx.
 +
**[ACTION ITEM] Have AmiGO show co-occurrency terms, similar to function in QuickGO.
 +
*Ways of finding errors
 +
**[ACTION ITEM] There should be a systematic effort to review earlier annotations.
 +
***consensus
 +
**Suggestions
 +
***collect tricky terms, run reports, and email groups asking them to review their annotations. If annotations are correct, can drop term from the tricky list.
 +
***David: once a focal set is annotated, send one of Mary's graph to someone who has published a lot of papers on that gene/protein and ask her/him if they see anything wrong or missing.
 +
***Develop automatic checks that can be done to identify anomalies to be reviewed.
 +
** Compare annotations of the gene product in your organism with the annotations in other organisms.
 +
** Val: SLIM by SLIM matrix to review intersections of different cellular processes and look for unexpected intersections which may identify possible errors
 +
****try applying to function and component terms
 +
****outline cells that you expect to be empty
 +
****[ACTION ITEM for programmers] Can these matrices be generated automatically from the AmiGO database rather than each refgenome group doing it themselves?
 +
**spot checks have to be built into the process, need to build different ways of looking at quality control
 +
**if a GO term or set of GO terms is routinely misused, this may suggest a problem in how it is defined in the ontology.  Review definitions of these terms and make suggestions to improve them.
 +
**Amazon shopping cart model for improving annotation. Have pop-ups that say XX% of annotators who used this term, also used GO:xxxx, or annotators who used this term never used GO:xxxx.
  
**GO terms per gene
+
a) software for generating comatrix b) buddy annotation c) categories of problems, not just endless list of problems d) regular assessment
  
*General Question: what is appropriate range for each category? need a sense of the scale, perhaps express as a %
+
also need regular process for resolving these problems:
  
*Reference Genome Reports
+
===Using Textpresso (Kimberly Van Auken)===
 +
Textpresso for GO annotation key features:
 +
* search through fulltext
 +
* in addition to keyword searches, have category searches based on groups of related words
 +
Wormbase uses Textpresso
 +
* get PDFs
 +
* convert to text
 +
* marked up by Textpresso
 +
Textpresso for wormbase curation presentation.  Example from looking for P granule annotation.  Most papers were P granule mislocalization in mutant.  Need relevance markup.  Hired a student to go through papers and mark up localization based on antibody staining.  219 pubs, 1400 sentences.  Used curation form that divides sentences.  Student checked yes or no for relevance.  Compute word frequency histograms.  Single words or phrases.  Single words worked well and are more efficient than phrases.  Created Textpresso categories
 +
* Cellular component: Adherens junction, nuclei,
 +
* Verbs: localizes to, accumulates
 +
* Other: ... missed this.
 +
Metrics: precision and recall.  First generation categories 75% precision 40% recall.  Could get 80% of the known annotations, thanks to info redundancy.  Building second generation categories.
 +
Curation pipeline.
 +
*Keyword - Ce protein name
 +
*Look for match in 3 categories
 +
Returns matching sentences in documents.  Can browse sentences in context.  Sentences get a score.
 +
Interface - 3 columns: Protein, Textpresso match terms, GO terms from relationship index.
 +
*working on problem of how to identify new associations from those that have been reported before, e.g. commonly used markers
 +
*how much does this increase the efficiency of curation? can't answer right now because still testing
 +
*how does it affect your annotation?  how do you know what you're missing?
 +
*current pipeline is just for cellular component terms, but think it will be amenable to function terms, haven't thought about application to process terms
 +
*editor tool that will let you customize category terms
  
====Annotation Progress: Discussion of other ieas for measuring progress====
+
===Annotation of isoforms (Harold Drabkin)===
*Measure that shows progress made in curating the experimental literature for reference genes in reference genomes.  This is an aim of the grant.  Can determine number of publications annotated.
 
*Measure of time spent (% effort) actually doing experimental annotations. Disagreements: Can't do curation w/o ontology development and visa versa. Worried about trying to parse out too much. How to you separate annotation from time spent considering how you do annotations or assessing quality of annotations. 
 
*Measure of the number of genes that have been comprehensively annotated.
 
  
===Review Annotation Pipeline proposal (Suzi Lewis)===
+
Problem: representation of multiple proteins forms of a gene generated by:  natural variation, alternate splicing, etc.
Step 1: Generation of protein sets
+
This is a problem because the isoforms may have different functions, localization, processes.
  
Step 2: Experimental Annotation
+
MGI database has place for notes that can be mined.
 +
gene product field: annotation was for a specific isoform
  
Step 3: Inferential Annotation
+
different types of information that can be captured:
 +
a) anatomy, specific cell type
 +
b) specific term in evidence code ontology, included transcript & protein ID
 +
c) specific evidence code, cell type, product ID
  
Step 4:  Quality Checks
+
File listing individual isoform annotation contains information that wouldn't show up in GO annotation b/c MGI has
  
====Step 1: Generation of protein sets (excluding functional RNAs)====
+
Challenges
**'''How to define a coherent set'''  For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene.  So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
+
What genes have isoforms?
 +
do the isoforms have ids?
 +
do the isoforms actually exist
 +
Other: function protein domains/fragments; modifications; both of these related to a single isoform by derivation
  
***Currently there is heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms) 
+
Strategies for back populating
****How does UniProt deal with alternate splice forms? Most of the time, there is a 1:1 correspondence between the canonical protein ID and the gene.  Uniprot uses canonical identifier followed by -1, -2, etc to indicate isoforms. But sometimes isoforms are so different that they are given separate accessions.  In that case, what connects them? have to link out to genomic database.
+
3700 uniprot records with isoforms, look for those that have references that were used for GO
****WormBase uses a mixture of gene and protein IDs in column 2.  Which is used depends upon how the experiments were done.  Is this a problem?  Goal would be converge on one type. 
+
have 1693 markers with more than one NM_ record
****MGI uses canonical MGI IDs in column 2.
+
papers with isoform or alternating splicing in title or abstract
  
***Chris's Proposal: Use canonical ID in column 2.  Add additional column for isoforms; put multiple isoform IDs on one line.
 
Column 2:  Use canonical gene ID.  Gene Index
 
Column 17: ID for the thing that was annotated (protein/gene/transcript). Must match column 12 (SO type). 
 
  
****Discussion:
+
Example: Notch1
 +
full protein, transmembrane binds extracellular ligand
 +
intracellular domain is cleaved to give NICD, which goes to nucleus, binds RBJ-1 and functions as txn co-activator
  
Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. It provides the link we want. 
+
Focus:
 +
transporters SLCxyz, ABCxyz
 +
interleukins
 +
Others?
  
Rex's proposal:
+
How to represent xx?
    Column 2:  Keep as is, the ID for thing that was annotated.  (ideally would be the gene product)
+
current practice does not allow dual annotation
    Column 12: keep is it is, because it refers to column 2
+
column 17 may solve this problem
    Add Column 17: Canonical ID for the gene that codes for the product that was annotated.   
 
  
 +
===Anything else to add?===
 +
keep protein isoforms and multiple transcript problems separate
  
Have to look at how any change will affect our users. 
+
==April 21, 2008==
    What do users expect to be in column 2?  they expect canonical ID, but it isn't always the case.
+
===Orthology Sets===
 +
Judy reviewed some aspects of mutliple orthology sets - a lot of different resources, and linking between them.
 +
*Mammalian orthology sets at MGI
 +
*Homologene -nice but doesn't include all reference genomes
 +
*TreeFam - maximum likelihood-based
 +
*PIR homeomorphic protein superfamilies
  
Most groups in favor of the proposal of making column 2 the canonical ID.
+
why do we need another resource (i.e., Kara's)?  the above tools are not comprehensive, and don't start with gene index.
 +
[discussion regarding these reasons: why can't the above tools incorporate our protein sets?  who do we need to contact/work with?]
 +
*we spend a lot of time determining what genes are/not in an ortholog set - is that the best use of our time?  should we use existing resources?
  
****What should column 12 refer to?
+
Judy showed some slides from Kara.
Point to 17, which means that column 17 must be filled in; it can't be left blank and inferred from column 2.)
+
*Kara had offered to address some of these concerns by doing a PPOD run specifically for the refGenome project: [http://ppod.princeton.edu/cgi-bin/ppod.cgi PPOD refGenome database]
  
****Notifying users
+
===Orthology, Paralogy, and GO Annotation===
Before change is implemented, should it be discussed with a few users?
+
Paul Thomas (SRI)
 +
See slides : [http://geneontology.org/meeting/refgenome/04-2008/pthomas-orthology_paralogy_and_GO_annotation.ppt]
 +
*goal of refGenome project is to identify genes in reference genomes that have same or similar functions, so can do comprehensive curation simultaneously
 +
*ortholog = same gene in different organisms separated only by speciation
 +
**orthologs can have different functions
 +
**paralogs can have same functions
 +
What is an ortholog cluster?
 +
*Algorithms make slices through protein trees based on some combo of evolutionary rates and history of duplications/speciation
 +
**They make arbitrary calls based on calculations; different algorithms will do this differently.
 +
**One must still investigate the experimental biology, and make educated judgments
 +
**It can still be quite useful to have comprehensive curation of related genes, even if they don't technically fall into the ortholog cluster (can be "fruitfully" annotated at same time
 +
Tree visualization tool for Ref Genomes - in development
 +
*Pre-computed searchable library of gene trees - modifiable based on curator feedback, includes outgroups
 +
*Visualization tools - trees labeled with GO annotations
 +
*Homology annotations supported by tree evidence, available to scientific community
 +
*HMMs to allow other genome projects to infer GO annotations
 +
Now that we have all protein IDs together, Paul can have a run completed by July (parsimony-based trees)
 +
*ACTION ITEM: Paul will grab gp2protein files on May 1st and begin his run
  
Need a pushout list to identify users of changes/updates.
+
===Moving forward===
 +
We will prioritize genes that are present in Kara's data in all 12 reference genomes - there are 153 of these orthogroups
 +
*we will do 20 genes/month starting at top of alphabetical list (by human gene name)
 +
*we will place more focus on experimental literature and less focus on inferential annotations
 +
We will continue to do QC as one/month/curator (at most, if it takes longer that's ok, must balance with other work tasks)
 +
*using refGenome [http://sourceforge.net/tracker/?func=browse&group_id=36855&atid=1040173 sourceforge tracker]
  
 +
==Review old action items==
 +
'''STILL TO DO'''
 +
*DOCUMENTATION:
 +
[ACTION ITEM](Documentation working group!): Document in SOPs
 +
Another factor we have been tracking is when a curator judges that the curation of a gene is ‘comprehensive’, that is, that is accurately represents the biology (irrespective of the number of papers available or read). This appears in the spreadsheets. The guideline is that when there are few papers, all papers should be read; when there are many (a curator can judge what is too many), then a review should be read to find the important primary literature and decide what information needs to be captured. We don’t keep track of whether or not reviews have been read. Wormbase uses textpresso (PMID 15383839), that helps ensuring curators do not overlook information. The ‘comprehensive’ curation status doesn’t get invalidated when a newer paper is published; however, curators may (and are encouraged to) update the date when the newer literature is curated.
  
***Still need gene to protein associations
+
[ACTION ITEM] (Pascale Gaudet, all) - paper -Topics: talk about goals, process, curation priority, ortholog finding, interesting biology (outliers reflect mistakes in annotations or interesting differences in the biology), benefits (improve annotation consistency and ontology quality)
    right now it is a free-floating column 18
 
  
gene association file should be gene association file
+
[ACTION ITEM] ???(Mike Cherry, who else): Organize a reference genome annotation camp, possibly in spring or summer of 2008.
  
gp2protein file should be separate
+
[ACTION ITEM] (Judy Blake) Contact NCBI/NLM/OMIM to link to reference genome genes
  
Proposal:  The header of gene association file should state this file contains annotations for x out of total number of genes estimated in this organism.
+
----
 +
'''In progress'''
  
gp2protein file : For every canonical gene ID there will be an associated canonical protein ID.
+
Graphs (Metrics):
  
What about those cases where gene has been annotated, but there is no known protein sequence associated with it.  Leave blank? or explicitly state "uncloned?"
+
[ACTION ITEM]  '''In progress '''Re-calculate with is_a only paths (Chris)
  
state that no protein has been identified for gene that was identified
+
[ACTION ITEM]''' In progress''', Re-calculate with experimental codes only; generate several versions of the data classified by different evidence codes?
split out functional RNAs that have been identified
 
  
gp2protein:
+
[ACTION ITEM]  '''In progress''' (Chris) Provide such reports on a regular basis
123 AA sequence Accession (UniprotKB:xxx or NCBI:xxx)
 
456 RNA
 
789 uncloned
 
  
Don't want to overload the file (putting non ID information in an ID column).  If needed, should make a separate file or find other ways of dealing with the blanks.  Can generate report that gives type from column 12.
 
  
If gp2protein file has only canonical protein IDs, how do you get information about other protein IDs (column 17)?
+
[ACTION ITEM] (Software group) '''In progress''': Continue working on development the curation tool
  
review:
+
[ACTION ITEM] (Mary Dolan, Software group) '''In progress''' Continue working on the best way to display annotations graphically.
GAF column 2 is canonical gene ID
 
    column 17 is thing you are annotating (always required)
 
    column 12 matches column 17 and contains SO ID's
 
  
gp2protein file:
 
1) includes complete gene index (except for pseudogenes and transposons)
 
  column 1 is canonical gene ID
 
  column 2 is accession for sequence of longest form of protein from UniProtKB: or NCBI:
 
  
Action items:
+
----
 +
'''DONE:'''
  
1) update documentation
 
  
2) write notice of changes to users
+
[ACTION ITEM]: '''Done'''.  (everybody): We will add categories of genes to annotate in addition to ‘disease genes’. We will choose five genes from each of the following four groups: 1. diseases 2. biochemical/ signaling pathways 3. bleeding edge list:  4. conserved genes/unannotated genes.  This will be done on a rotation basis from all databases.
  
3) individual data providers make sure that their input matches
+
'''THREE NEXT ONES ARE TAKEN CARE OF WITH THE NEW ORTHOLOGS LIST'''
  
4) software changes as necessary
+
[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information
  
5) add header to gene association file
+
[ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) DONE, see 'Suggestions' spreadsheet (look for the "conserved Hs-Ec" sheet)
  
6) syntex of file will be provided by Mike and Chris
+
[ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations
  
===Software Update===
 
  
====Software demo - Reference Genome DB extension of GOdb (Siddhartha Basu)====
+
[ACTION ITEM] '''Done''' Amelia (web page), Susan, Rex, Petra (content), work on web presence
*RefGenome tracker interface (database to replace current google spreadsheet)
 
    Browse: List Target, List Target (from db), List Ortholog
 
    Search: search by id, name, target, taxon
 
    Report:
 
    Curation: add target, import spreadsheet (interim feature)
 
  
**For programmers:
+
[ACTION ITEM] '''In progress ''': contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs; Compara: Emily? Homologene: Judy? TreeFam  in paranoid others?
#add taxon to "add target" box
 
#add column with MOD id so curators can link to the MOD rather than NCBI
 
  
**Time frame?
+
[ACTION ITEM]: '''In progress''', Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all. Goal was set for February 2008 to include all ref genome sets.
  
  
====RefGenome Graphs (Mary Dolan)====
+
[ACTION ITEM] '''Done''' (Chris): pull out ND annotations and report to each group, see Reference_Genome_Database_Reports
*Comparison matrix of GO terms across organism
 
entry indicates that ortholog exists for this gene
 
colored entries indicate experimental annotation
 
parentheses indicate ISS annotation only
 
no ortholog "X"
 
*Graph
 
*Look at annotations directly
 
*PPOD graphs
 
*compare PPOD clusters with MOD calls
 
**Suggestion: For ISS include the "with" information
 
  
====AmiGO (Seth Carbon)====
 
  
 +
[ACTION ITEM] '''Rejected''',  (Pascale Gaudet): Add to RefGenome curation practices SOPs: please enter your unique gene_id in the google spreadsheet (Makes it easier to parse)
  
 +
[ACTION ITEM]: '''Rejected ''' (Pascale Gaudet) Generally, provide guidelines for filling the google spreadsheet (IDs, where to put notes, etc)
  
====xx (Jim Hu)====
+
Counting papers, assessing completeness/comprehensive annotation status

Latest revision as of 07:25, 14 July 2014

Link back to SLC GO Reference Genome Project Meeting page

April 20, 2008

Annotation Progress

Annotation Progress (Mike Cherry)

  • Number of annotated genes per organism by evidence type (overall): If Compare graphs for Sept 2007 and Apr 2008 see that over all size and size the same, but IEA decreasing.

File:ReferenceGenomeMetrics-200805.pdf

    • Discussion:
      • What is effort/person?
      • X-axis is absolute number of genes, which doesn't reflect differences in genome size.
  • Number of annotated genes per organism by evidence code for Reference Genome project: the majority of genes have experimental evidence codes
    • Discussion:
      • Graph needs outline that indicates "no ortholog". This allows a comparison of the genes present or absent in the reference genome genomes. It will also show which organisms are lagging behind.
      • Number of annotations as a metric? would give a different view of the progress, but too variable b/c of differences in depth of knowledge in different organisms, different areas of the ontology.
      • View progress between Sept 2007 and April 2008 as a % change. Can see that everyone has doubled experimental annotations, although it doesn't show the starting number of annotations.
      • Need to discuss which metrics we want to track and why. Need consistent measures across groups.
      • How annotations change over time lets you see whether groups are still engaged in the process.
      • Would be useful to have a display that shows how much is known about these genes. Some of this information will come from Chris's reports.
      • Would be interesting to know the number of genes with ND.

Annotation Progress (Chris Mungall)

  • Metrics
    • distance to leaf (shows average number for all genes)
      • didn't change between Jan 2006 to Sept 2007
      • consider breaking down by the 3 ontologies, also show % of length to leaf
    • information content
      • a quality control measure
    • coverage (# of nodes covered per gene)
      • as there is more information about a gene it will have more coverage
      • can there be too much coverage?
    • publications per gene
    • GO terms per gene
    • For each of these metrics: what is appropriate range for each category? need a sense of the scale, perhaps express X-axis as a % rather than an absolute number.
  • Reference Genome Reports

Annotation Progress: Discussion of other ideas for measuring progress

  • A measure that shows progress made in curating the experimental literature for reference genes in reference genomes. This is an aim of the grant. Can determine number of publications annotated.
  • A measure of time spent (% effort) actually doing experimental annotations. Disagreements: Can't do curation w/o ontology development and visa versa. Worried about trying to parse out too much. How to you separate annotation from time spent considering how you do annotations or assessing quality of annotations.
  • A measure of the number of genes that have been comprehensively annotated.

Annotation Pipeline, Part 1: Generation of protein sets (Suzi Lewis)

  • The issue is determining a procedure to define a coherent set of orthologous proteins. For experimental annotations, want to annotate to isoforms. But for tree building want longest protein produced from a gene. So for ortho sets want a unique protein/gene ID for the "canonical" gene/protein.
  • Proposal from Chris Mungall for the Gene Association File: In column 2 will have the ID for the "canonical" gene. Add an additional column (column 17) to hold the ID for the thing that was annotated (protein/gene/transcript). Column 17 must match column 12 (SO type).
    • Background for the proposal
      • Currently there is heterogeneity in column 2 of Gene Association files (see Annotation of alternate spliceforms, some groups have protein IDs, some groups have gene IDS, some groups have a mixture of both.
    • Discussion:
      • Add a column that is always for a gene. A gene is a "concept", it's a lumping term that reflects biological reality. It provides the link we want.
      • Alternate proposal from Rex Chisholm: Keep column 2 as it is now, the ID for the thing that was annotated. (In a perfect world, this would be the gene product.) Keep column 12 it is is now, referring to column 2. Add column 17 for the ID for the "canonical" gene that codes for the product that was annotated.
      • Have to look at how any change will affect our users. What do users expect to be in column 2? they expect canonical ID, but it isn't always the case.
    • Decision: Most groups in favor of the proposal of making column 2 the canonical ID and the column 17 the ID for the thing that was annotated.
    • What should column 12 refer to?
      • Decision: Column 12 should point to column 17, which means that column 17 must be filled in; it can't be left blank and inferred from column 2.
    • Notifying users. Before change is implemented, should it be discussed with a few users? Need a pushout list to identify users of changes/updates.
    • The header of gene association file should state this file contains annotations for x out of total number of genes estimated in this organism.
  • Proposal for the gp2protein file: For every canonical gene ID in the (GAF)(judy: I think this is meant to be 'for every canonical gene ID in your database) there will be an associated canonical protein ID in the gp2protein file (judy: if such ID exists, field 2 can be NULL).
    • Background. Still need gene to protein association. This should be in a separate file from the gene association file. The gp2protein file seems the logical place to have this association.
    • Discussion
      • What about those cases where gene has been annotated, but there is no known protein sequence associated with it. Should there be a blank in the gp2protein file? or should the gp2protein file have information that the gene is "uncloned" or codes a functional RNA? Don't want to overload the file (putting non ID information in an ID column). If needed, should make a separate file or find other ways of dealing with the blanks. Can generate report that gives type from column 12.
      • If gp2protein file has only canonical protein IDs, how do you get information about other protein IDs (column 17)?
      • syntax for gp2protein file: Enter protein accessions as UniProtKB:xxx or NCBI:xxx
  • Review
    • Gene Association File (GAF)
  1. column 2 is canonical gene ID
  2. column 17 is thing you are annotating (always required)
  3. column 12 matches column 17 and contains SO ID's
    • gp2protein file:
  1. includes complete gene index (except for pseudogenes and transposons)
  2. column 1 is canonical gene ID
  3. column 2 is accession for sequence of longest form of protein from UniProtKB: or NCBI:
    • ACTION ITEMS
  1. update documentation
  2. write notice of changes to users
  3. individual data providers make sure that their input matches
  4. software changes as necessary
  5. add header to gene association file
  6. syntex of gp2protein file will be provided by Mike and Chris

Software Update

Demo of RefGenome tracker interface (Siddhartha Basu)

  • RefGenome tracker interface (database to replace current google spreadsheet)
    • For programmers:
  1. add box for taxon id to the "add target" entry box
  2. add column with MOD id so curators can link to the MOD rather than NCBI
    • Time frame?

RefGenome Graphs (Mary Dolan)

Presentation: review of graphs including some new features File:RefG graphs.ppt

  • Suggestion for detailed annotation table: For ISS include the "with" information

[DONE] See for example: [1]

AmiGO (Seth Carbon)

  • Summary table of genes in the RefGenomes List
    • Want feedback on summary table and visual graphical displays. Are there other types of visual displays that people need?
    • AmiGO will gradually be moving to this structure.
  • Cross Products
    • Example TAZ gene, various annotations related to heart in different organisms, but couldn't see connections in the graph. Working on version of graph display to show these connections.

Community Annotation at GONUTS (Jim Hu)

  • Demo of "Create New Gene Page"
  • Have a webservice that can be used to connect AmiGO and GONUTS by identify pages in GONUTS that have been annotated by a human being
  • Discussion
    • two possible types of input: small, individual annotations; bulk sets of predictions
    • could you use it for getting input on IEA annotations?
    • will increase input by making it easier for people to provide input
    • try connecting Cardio and Immunology pages to GONUTS

Annotation Pipeline, part 2 (Suzi Lewis and Judy Blake)

  • Is there consensus about the steps shown in File:Ref Genome annotation pipeline2008Mar31.pdf?
    • Need to discuss how the "focal-sets" will be determined
    • Step V: Changed to "Curators add/remove proteins to/from the "focal set" based on dialog and agreement"
    • What is the purpose of the focal set? defined at meeting in Princeton, to be able to say that these products have been experimentally annotated across the reference genomes
    • Rex: Agrees with procedure but wants to emphasize that it can't be written in stone. There has to be option for future discussion changes in the procedure. Suzi: Yes, in light of further knowledge the procedure may change and people involved have to be open-minded and leave behind their preconceived notions.
    • Rex: Want to capture both depth and breadth (annotations to as many genes as possible based on exptl. annotations and also ISS).
    • Judy: Inferential annotations. how do you transform experimental annotations in one organism to inferential annotations in another organism? what measures are useful? what about large family sets?
    • Suzi: QC is something that happens during the entire process, not just at the end of the process. Will be useful to think about QC at each of these steps.


Consistency within experimental annotations (Pascale)

  • In the Source Forge tracker for ref genome completion set [2], out of 12 sets we've looked at, there were 4 were there was an annotation error (POLA, APOA1, ALAS2, EIF2B2), and at least 10 new SF items opened to fix the ontology. Which means that this exercise was useful and we'll continue as long as it is.
  • Examples of common annotation errors
    • transient localization vs long-term cellular localization
    • secreted protein annotated to secretory process
    • IMP evidence code used for results from high through-put experiments
    • GO terms for growth, cell growth, cell cycle, cell proliferation frequently used interchangeably
  • Ways to prevent errors
    • Wiki page of commonly Misused_terms
      • It would help annotators if this page included more information about how the term was misused.
      • Where can this information be put to make it more visible? need way of notifying people that something was added to the list. NOTE: annotaters can sign up for email announcements.
      • People should be on the lookout for misused terms and add these to the list along with an explanation.
    • [ACTION ITEM] Develop SOPs to prevent future errors.
    • [ACTION ITEM for programmers] Pop-ups to warn annotaters they are about to use a GO term that is commonly misused.
  • Ways to improve annotation (besides finding/preventing errors)
  • Part terms.
    • Annotators need to use part terms.
    • [ACTION ITEM] Develop SOPs for part terms to help annotators use them correctly.
    • Compare annotations of the gene product in your organism with the annotations in other organisms.
    • Amazon shopping cart model for improving annotation. Have pop-ups that say XX% of annotators who used this term, also used GO:xxxx, or annotators who used this term never used GO:xxxx.
    • [ACTION ITEM] Have AmiGO show co-occurrency terms, similar to function in QuickGO.
  • Ways of finding errors
    • [ACTION ITEM] There should be a systematic effort to review earlier annotations.
      • consensus
    • Suggestions
      • collect tricky terms, run reports, and email groups asking them to review their annotations. If annotations are correct, can drop term from the tricky list.
      • David: once a focal set is annotated, send one of Mary's graph to someone who has published a lot of papers on that gene/protein and ask her/him if they see anything wrong or missing.
      • Develop automatic checks that can be done to identify anomalies to be reviewed.
    • Compare annotations of the gene product in your organism with the annotations in other organisms.
    • Val: SLIM by SLIM matrix to review intersections of different cellular processes and look for unexpected intersections which may identify possible errors
        • try applying to function and component terms
        • outline cells that you expect to be empty
        • [ACTION ITEM for programmers] Can these matrices be generated automatically from the AmiGO database rather than each refgenome group doing it themselves?
    • spot checks have to be built into the process, need to build different ways of looking at quality control
    • if a GO term or set of GO terms is routinely misused, this may suggest a problem in how it is defined in the ontology. Review definitions of these terms and make suggestions to improve them.
    • Amazon shopping cart model for improving annotation. Have pop-ups that say XX% of annotators who used this term, also used GO:xxxx, or annotators who used this term never used GO:xxxx.

a) software for generating comatrix b) buddy annotation c) categories of problems, not just endless list of problems d) regular assessment

also need regular process for resolving these problems:

Using Textpresso (Kimberly Van Auken)

Textpresso for GO annotation key features:

  • search through fulltext
  • in addition to keyword searches, have category searches based on groups of related words

Wormbase uses Textpresso

  • get PDFs
  • convert to text
  • marked up by Textpresso

Textpresso for wormbase curation presentation. Example from looking for P granule annotation. Most papers were P granule mislocalization in mutant. Need relevance markup. Hired a student to go through papers and mark up localization based on antibody staining. 219 pubs, 1400 sentences. Used curation form that divides sentences. Student checked yes or no for relevance. Compute word frequency histograms. Single words or phrases. Single words worked well and are more efficient than phrases. Created Textpresso categories

  • Cellular component: Adherens junction, nuclei,
  • Verbs: localizes to, accumulates
  • Other: ... missed this.

Metrics: precision and recall. First generation categories 75% precision 40% recall. Could get 80% of the known annotations, thanks to info redundancy. Building second generation categories. Curation pipeline.

  • Keyword - Ce protein name
  • Look for match in 3 categories

Returns matching sentences in documents. Can browse sentences in context. Sentences get a score. Interface - 3 columns: Protein, Textpresso match terms, GO terms from relationship index.

  • working on problem of how to identify new associations from those that have been reported before, e.g. commonly used markers
  • how much does this increase the efficiency of curation? can't answer right now because still testing
  • how does it affect your annotation? how do you know what you're missing?
  • current pipeline is just for cellular component terms, but think it will be amenable to function terms, haven't thought about application to process terms
  • editor tool that will let you customize category terms

Annotation of isoforms (Harold Drabkin)

Problem: representation of multiple proteins forms of a gene generated by: natural variation, alternate splicing, etc. This is a problem because the isoforms may have different functions, localization, processes.

MGI database has place for notes that can be mined. gene product field: annotation was for a specific isoform

different types of information that can be captured: a) anatomy, specific cell type b) specific term in evidence code ontology, included transcript & protein ID c) specific evidence code, cell type, product ID

File listing individual isoform annotation contains information that wouldn't show up in GO annotation b/c MGI has

Challenges What genes have isoforms? do the isoforms have ids? do the isoforms actually exist Other: function protein domains/fragments; modifications; both of these related to a single isoform by derivation

Strategies for back populating 3700 uniprot records with isoforms, look for those that have references that were used for GO have 1693 markers with more than one NM_ record papers with isoform or alternating splicing in title or abstract


Example: Notch1 full protein, transmembrane binds extracellular ligand intracellular domain is cleaved to give NICD, which goes to nucleus, binds RBJ-1 and functions as txn co-activator

Focus: transporters SLCxyz, ABCxyz interleukins Others?

How to represent xx? current practice does not allow dual annotation column 17 may solve this problem

Anything else to add?

keep protein isoforms and multiple transcript problems separate

April 21, 2008

Orthology Sets

Judy reviewed some aspects of mutliple orthology sets - a lot of different resources, and linking between them.

  • Mammalian orthology sets at MGI
  • Homologene -nice but doesn't include all reference genomes
  • TreeFam - maximum likelihood-based
  • PIR homeomorphic protein superfamilies

why do we need another resource (i.e., Kara's)? the above tools are not comprehensive, and don't start with gene index. [discussion regarding these reasons: why can't the above tools incorporate our protein sets? who do we need to contact/work with?]

  • we spend a lot of time determining what genes are/not in an ortholog set - is that the best use of our time? should we use existing resources?

Judy showed some slides from Kara.

  • Kara had offered to address some of these concerns by doing a PPOD run specifically for the refGenome project: PPOD refGenome database

Orthology, Paralogy, and GO Annotation

Paul Thomas (SRI) See slides : [3]

  • goal of refGenome project is to identify genes in reference genomes that have same or similar functions, so can do comprehensive curation simultaneously
  • ortholog = same gene in different organisms separated only by speciation
    • orthologs can have different functions
    • paralogs can have same functions

What is an ortholog cluster?

  • Algorithms make slices through protein trees based on some combo of evolutionary rates and history of duplications/speciation
    • They make arbitrary calls based on calculations; different algorithms will do this differently.
    • One must still investigate the experimental biology, and make educated judgments
    • It can still be quite useful to have comprehensive curation of related genes, even if they don't technically fall into the ortholog cluster (can be "fruitfully" annotated at same time

Tree visualization tool for Ref Genomes - in development

  • Pre-computed searchable library of gene trees - modifiable based on curator feedback, includes outgroups
  • Visualization tools - trees labeled with GO annotations
  • Homology annotations supported by tree evidence, available to scientific community
  • HMMs to allow other genome projects to infer GO annotations

Now that we have all protein IDs together, Paul can have a run completed by July (parsimony-based trees)

  • ACTION ITEM: Paul will grab gp2protein files on May 1st and begin his run

Moving forward

We will prioritize genes that are present in Kara's data in all 12 reference genomes - there are 153 of these orthogroups

  • we will do 20 genes/month starting at top of alphabetical list (by human gene name)
  • we will place more focus on experimental literature and less focus on inferential annotations

We will continue to do QC as one/month/curator (at most, if it takes longer that's ok, must balance with other work tasks)

Review old action items

STILL TO DO

  • DOCUMENTATION:

[ACTION ITEM](Documentation working group!): Document in SOPs Another factor we have been tracking is when a curator judges that the curation of a gene is ‘comprehensive’, that is, that is accurately represents the biology (irrespective of the number of papers available or read). This appears in the spreadsheets. The guideline is that when there are few papers, all papers should be read; when there are many (a curator can judge what is too many), then a review should be read to find the important primary literature and decide what information needs to be captured. We don’t keep track of whether or not reviews have been read. Wormbase uses textpresso (PMID 15383839), that helps ensuring curators do not overlook information. The ‘comprehensive’ curation status doesn’t get invalidated when a newer paper is published; however, curators may (and are encouraged to) update the date when the newer literature is curated.

[ACTION ITEM] (Pascale Gaudet, all) - paper -Topics: talk about goals, process, curation priority, ortholog finding, interesting biology (outliers reflect mistakes in annotations or interesting differences in the biology), benefits (improve annotation consistency and ontology quality)

[ACTION ITEM] ???(Mike Cherry, who else): Organize a reference genome annotation camp, possibly in spring or summer of 2008.

[ACTION ITEM] (Judy Blake) Contact NCBI/NLM/OMIM to link to reference genome genes


In progress

Graphs (Metrics):

[ACTION ITEM] In progress Re-calculate with is_a only paths (Chris)

[ACTION ITEM] In progress, Re-calculate with experimental codes only; generate several versions of the data classified by different evidence codes?

[ACTION ITEM] In progress (Chris) Provide such reports on a regular basis


[ACTION ITEM] (Software group) In progress: Continue working on development the curation tool

[ACTION ITEM] (Mary Dolan, Software group) In progress Continue working on the best way to display annotations graphically.



DONE:


[ACTION ITEM]: Done. (everybody): We will add categories of genes to annotate in addition to ‘disease genes’. We will choose five genes from each of the following four groups: 1. diseases 2. biochemical/ signaling pathways 3. bleeding edge list: 4. conserved genes/unannotated genes. This will be done on a rotation basis from all databases.

THREE NEXT ONES ARE TAKEN CARE OF WITH THE NEW ORTHOLOGS LIST

[ACTION ITEM] (Val): Provide the list of 207 genes conserved between pombe and human with no annotation/information

[ACTION ITEM] (Jim): Provide the set of conserved genes found by InParanoid that are conserved in all 12 species (660 or so); we might want to prioritize this list by ascending order of number of annotations to target unannotated genes (who can do that?) DONE, see 'Suggestions' spreadsheet (look for the "conserved Hs-Ec" sheet)

[ACTION ITEM] (Ruth): send the HGNC list of genes with few annotations


[ACTION ITEM] Done Amelia (web page), Susan, Rex, Petra (content), work on web presence

[ACTION ITEM] In progress : contact/meet with people who have made tools for orthology determination to see if they can help us (that possibly includes re-running the analyses using the most recent set of sequences and proper IDs; Compara: Emily? Homologene: Judy? TreeFam in paranoid others?

[ACTION ITEM]: In progress, Kara: run the P-POD over the full ref genomes set? analysis on the ref genome data set. Need computational pipeline with existing resources. Currently takes 3 weeks to do 8 species all v all. Goal was set for February 2008 to include all ref genome sets.


[ACTION ITEM] Done (Chris): pull out ND annotations and report to each group, see Reference_Genome_Database_Reports


[ACTION ITEM] Rejected, (Pascale Gaudet): Add to RefGenome curation practices SOPs: please enter your unique gene_id in the google spreadsheet (Makes it easier to parse)

[ACTION ITEM]: Rejected (Pascale Gaudet) Generally, provide guidelines for filling the google spreadsheet (IDs, where to put notes, etc)

Counting papers, assessing completeness/comprehensive annotation status