Difference between revisions of "20th GO Consortium Meeting Minutes"

From GO Wiki
Jump to: navigation, search
(Mike: Annotation Progress)
Line 312: Line 312:
 
* if you have suggestions, please sent them along
 
* if you have suggestions, please sent them along
  
== Suzi: Discussion ==
+
 
 +
== Reference Genome Annotation Discussion ==
 
(has slides)
 
(has slides)
  
* uniprot; complete proteome project
+
It was decided that the process for selecting the initial genes to determine the RefGenome sets would not change.  Pascale would continue to provide the gene list at the beginning of each month.
* protein curator; how to efficiently incorporate input from all MODs
 
* how to deliver resulting homol-based annotations to MODs
 
* judy: doesn't like gp to protein files
 
** what's in a name
 
* judy: MODs are building gp to protein, how to work with uniprot?
 
** judy: eventually they will be working together
 
  
 +
It is not necessary to discuss the problems with the tree as a group and feedback can be given to Paul and Kara.  Paul proposed that this discussion could be tacked on at the end of an ejamboree session
  
* kimberly: want reference protein with all the exons?
+
For the gp2protein input, we want to have a reference protein with all the exons (Kimberley)
** yes
 
  
==== ACTION ITEM ====
 
Generate definition for what each of the files needs to be and generate tags
 
(paul and kara)
 
  
 
* judy: multiple reps of protein families, want to take advantage of them, like swissprot
 
* judy: multiple reps of protein families, want to take advantage of them, like swissprot
Line 337: Line 329:
 
** we aren't in a hurry
 
** we aren't in a hurry
  
==== ACTION ITEM ====
 
(Paul and Kara) draft and m.a. will take pass afterwards
 
  
== Suzi: Mor Discussion ==
+
* Kimberley: How do you handle new literature that comes to the MODs?  Do you revisit?
(has slides)
+
* Rex: As we identify a new paper that might cause us to revise an annotation, it needs to get fed into the system.
 +
* Judy: The group needs to be notified if there is a new function that has been identified.
 +
* Rex: if the change is an outlier, then the group needs to talk about it.
 +
* Kara: In the proposal, reports will be generated.
 +
* Mike: Are there equivalog sets that you will exclude that the function evolves faster than the sequence?
 +
* Paul: There will be a proposal from a pfc and the group will be able to talk about it. And the curator will be able to make a judgement on what level of annotation to transfer. There is something you can say, it may be more specific, it may be less specific, but that's where the biological expertise comes in.
 +
* MA: this is so much rigorous than what we've done before, and everything is traceable.
 +
* Mke: pfc will define annotation set.  How do you envision MOD interaction on a given set b/c
 +
* Kara: there will be a given set and the MOD curators will make the annotations and the pfc will mediate
 +
* Mike/Kara/Rex: This will be an iterative process b/c some MODs will want to change their annotations based on what other groups have annotated.
 +
* Rex: to make this work, there needs to be a defined time frame so the iterative process can happen.
 +
* MIchelle: are bacterial species included?
 +
* Paul: They already are, but horizontal gene transfer does make this more difficult.
 +
* Mike: are you going to take existing annotations to help make trees?
 +
* Judy: Mouse could provide information about duplication, that could be an extension of this work, but this would be a place to start.
 +
* Paul: Sometimes the tree will be wrong because the data is problematic.  Chicken genes are problematic.  But it is traceable where the tree breaks down and can report back to the MOD so they can look at their data
 +
* Judy: look at this side by side with Mary's trees that look at annotations within the GO structure
 +
* MA: The underlying structure of the tree is the species tree? (yes) What happens when the species tree changes?
 +
* Paul: we just have to figure out how that affects the nodes that have been annotated, but I don't think that these are going to be dramatic changes.  We look at how this changes the distribution of annotations for predicted groups. The question is "does the ancestor change"?
 +
 
  
* let's talk about the flow
 
** should we do it as in the slides?
 
*** no complaints
 
* paul: should we try this as part of electronic jamboree? we can try and time it
 
** judy:  a set of genes in a month
 
** pascale: want to change how we do it
 
***
 
** judy: how are we selecting genes ...?
 
*** suzi: we use the trees
 
**** judy: hat selection is dependent on exp anno
 
** paul: we wont use current anno to select genes
 
** pascale: disease genes, no anno, highly conserved
 
*** some might be easier (pascale's display)
 
** judy: are we going to contine with what we're doing?
 
*** rotation, etc.
 
** pascale: pascale and kara will define the set
 
*** judy: set the priority sets
 
**** rex: let's just move forward
 
** judy: can we do this next week?
 
*** yes
 
  
==== ACTION ITEM====
 
pascale will provide list at the beginning of each month
 
  
 +
* How can we get better about generating the protein sets?
 +
** work in collaboration with UniProt proteome project?
 +
* uniprot; complete proteome project
 +
* protein curator; how to efficiently incorporate input from all MODs
 +
* how to deliver resulting homol-based annotations to MODs
 +
* judy: doesn't like gp to protein files
 +
** what's in a name
 +
* judy: MODs are building gp to protein, how to work with uniprot?
 +
** judy: eventually they will be working together
 +
* Judy: we're also including functional RNAs, you're calling them gp2protein files
 +
* Paul: This is a legacy name issue
 +
* Judy: given that we are building these files on a MOD by MOD basis, how are we going to interface with UniProt since their effort is the same.
 +
* Judy: the goal is to annotate one gene representitive and one protein representitive
 +
* Pascale: UniProt are the missing proteomes mouse, zebrafish, chicken, rat.
 +
* Judy: the goal of the project should be that the UniProt and GO files
 +
* Suzi: We have loaded the gp2protein file, but we need an additional file for spliceforms, and ones that has RNAs, etc.
 +
* Kimberley: c elegans sent the longest, but what you really want is all the exons
 +
* Paul: The SwissProt representation is ideal.
 +
* Judy: in terms of isoforms, there are multiple ways of representing ptn family sets, so what are the thoughts of having multiple groups overlayed on these trees?
 +
* Paul: This system is completely extensible.
 +
* Rex: we should be careful about what you're calling protein families, really they are gene trees.
 +
* MA: we should prepare a document for what the GO means and what we should do.  And what protein sets we are going to coordinate from the MODs, what they are going to be used for, what are they going to include
 +
* How to most efficently incorporate input from all MOD curators?
 +
** proposal--protein family curator
 +
* How are resulting homology-based annotations delivered to MODs?
 +
* Judy: how do we decide which genes get chosen?
 +
** Is it working, do we want to change anything?
 +
* Judy: it would be good to set the priority sets.
 +
* Rex: We've discussed this every time and the priority changes, it doesn't matter how we pick them, just pick them.
 
* suzi: give a gene/focal point, what from every species is the protein that you want to include--that will be the tree stuff
 
* suzi: give a gene/focal point, what from every species is the protein that you want to include--that will be the tree stuff
 
* suzi: no discussion about the seed; but we may have one about the set.
 
* suzi: no discussion about the seed; but we may have one about the set.
Line 372: Line 388:
 
* kara: will have much improvement once we start using trees
 
* kara: will have much improvement once we start using trees
 
* pascale: any problems should bump back to paul and kara, not necessary to discuss as a group
 
* pascale: any problems should bump back to paul and kara, not necessary to discuss as a group
* debbie: more fruitful to have calll just about inference set after jamboree
+
* Debby: Should we have the discussion about the inferences at the ejamboree?--NO.  the curators might want to go back and do additional annotations based on the dicussion and it would be more fruitful to have a separate call about the inferences.
* kara: want to just hammer things out with pascale and have a concrete system
+
* Kara: We'd like to sit down and create a proposal about how to do the reviewal without making extra work. want to just hammer things out with pascale and have a concrete system
 
* judy: MODs should incorporate new inf instead of GO
 
* judy: MODs should incorporate new inf instead of GO
** rex: as req of being part of RG, need to add infs quickly
+
* Rex: requirement of the ref gen group, to quickly incorporate the inferencial annotations by each MOD to their GA file.
** m.a.:
 
*** judy: wary of paralog groups
 
**** paul: that's what we're doing
 
***** crosstalk
 
 
**** paul: MODs should check tree, until comfortable
 
**** paul: MODs should check tree, until comfortable
 
*** paul: how does PF interface with mods?
 
*** paul: how does PF interface with mods?
**** david: not conference, but one-on-one would be better
+
* David: The most efficient way would be for the pfc to contact each of the MOD individually.
***** rex: have them review, and if they have a problem, the onus is on them
+
* Rex: turn it around.  Have the MODs review and if there is a problem, then they get in touch with the pfc in a defined amount of time.
*** suzi: (on diagram)
+
* David: only people who have experimental data should get together and annotate
**** pascale gives example with diagram
+
* Pascale: these are only predictions. We want to be conservative in propagating these functions.
** m.a.: need to be in MOD and GO db; what is route?
+
* tree rebuild will happen every 6 mos or so.
*** paul: we can put the non-MOD inferences in GOA?
+
 
**** yes
+
 
 +
currently do not have complete protein files for mouse, rat, chicken, zebrafish
  
==== ACTION ITEM ====
 
(pascale) picks gene by magic
 
  
==== ACTION ITEM ====
+
Mechanics
set done by kara and paul, complaints to them
+
There were two options proposed for inputting ancestrally inferred annotations:
 +
# These annotations would be provided back to the MODs, and the MODs would incorporate them into their gene association submissions to the GO consortium
 +
# They would be directly inputted into the GO database with a filtering script
  
* trees done every six months
+
Although it was brought up that a downside to the first option would be a delay in incorporation of annotations, people much preferred the first option for the following reasons:
 +
* it is consistent with the current policy of each MOD being the definitive source of annotations for their organisms. 
 +
* most of the MODs also have systems in place to load external annotations (e.g. GOA). 
 +
* it will ensure that annotations remain in sync between the MODs and the GO consortium files
  
==== ACTION ITEM ====
+
Judy asked what the evidence code and source for the ancestrally inferred annotations would be.  RefGenome was suggested as a source and this was considered favorably as it would increase visibility of the project.  We could also version these annotations by the date.  Suzi said the evidence code discussion should wait until there were annotations that could be discussed.
one-on-one MOD discussion (if problem)
 
  
* mechanics of how this all gets into MOD and GO db
 
** suzi: possibilites:
 
*** opt 1: if it goes back to MODs from PF DB
 
**** there may be a delay
 
**** rex: timelime agreement could solve it
 
*** opt 2: done in DB load script
 
*** rex: likes first; if we need police, so be it; opt 2 not good--mods shouls control their own data
 
**** judy: agrees
 
*** pascale: like opt 2
 
**** crosstalk
 
*** david: opt 1
 
**** nervous about sync delays
 
*** eurie: likes opt 1
 
*** emily: opt 1
 
  
==== ACTION ITEM ====
 
(RG) MODs will fold in changes and pass them on to GO
 
  
* judy: what is evcode, what is source?
 
* judy: this will make versioning easier
 
* rex: ref gen as source
 
** increase visability--branding!
 
** in twice yearly tree change, how big are they usually?
 
*** paul: not too likely; and an auditable process
 
*** pascale: let's say we have 5% of 10000 trees change, can we identify them?
 
**** paul: since we;re just interested in local properties, not a big deal
 
**** pascale: if there is new info in mouse for that tree ...
 
***** need to make sure that annotations are current
 
  
* suzi: evcode discussion should wait until we have something to discuss
+
=== ACTION ITEMS ===
 +
* Draft a document about coordinating the GO gp2protein files with the UniProt proteome project (Paul/Kara).  Look over it and bring it to Amos as a proposal (Michael A.).
 +
* For GO, generate a list of what files are needed (gp2protein, spliceforms, all gene products), define what these files should be, and build a file structure. (Paul/Kara)
 +
* Pascale will provide the seed genes for the annotation set at the first of every month, and the sets are to decided by Paul and Kara's trees.
 +
* If there are problems with the tree, the MODs will correspond with Paul and Kara.  This will be done on a one-on-one as-needed basis.
 +
* Annotations made by ancestral inference will get fed back to the MODs in gaf format for them to incorporate into their sets to submit to the GO database.  'RefGenome' will become the source of these annotations.
  
* everybody has warm fuzzies about the PF/RG developments
 
  
 
== Mike: Annotation Progress ==
 
== Mike: Annotation Progress ==

Revision as of 12:13, 22 October 2008

Ontology content development

Overview (Midori)

Most of the report is on the wiki. A lot has been accomplished. Highlights include the following:

  • closed more SF items them opened since last meeting (~200)
  • peptidase reorg is finished: After SLC meeting, MEROPS database curators were contacted and they made recommendataions. Those recs have been acted upon and reorg is finished.

There still are ~200 open items. All those that are more than 6 months old have been assigned but maybe those should be reviewed to see if the priority should be changed. Also, David mentioned that many of the items are being taken care of the in large chunks with ontology changes, such as "biogenesis and organization" terms. Some are stuck because no consensus can be reached.

The majority of the ontology content section will talk about future work and the links that will be made between function and process ontologies.

ACTION ITEMS

(everybody) SF items that cannot be closed due to lack of consensus should be put onto a wiki page so they can be resolved at an upcoming GOC meeting.  Midori said to email the editors and they'll take care of it.

Theory and examples of function and process links (Harold, Jen)

(get his slides)

Many groups have been working on systems for links trying to see how it works since it does represent biology. Harold showed examples of cross-products using biochemical pathways, defining a start and end, selecting paths, and using common resources. Done manually, the links looked OK but labor intensive. Could it be done automatically?

Problems with doing it automatically:

  • missing DBXREFs
  • too many DBXREFs
  • creates links in the ontology that are "corret", but not always helpful to a given question for a human - like BP "carbohydrate metabolism"is linked to all glycolysis MF annotations.

Moving forward, using the dbxrefs seems to be the way to go but we will have to go in manually to make them more complete.

(from chris' and Jen's talk)

So why should we even bother?

  • It will improve the GO because we need to be specific
  • It will help fill in annotation gaps - such as a MF "kinase activity" should be made to the BP "phoshorylation" - as well as provide ways to make suggestions new annotations.
  • It will allow better integration of pathway databases with GO.

Chris has been try to use Reactome to make mappings between function and process and has come across the following issues:

  • DBXREFs not necessarily equivalent
  • There are some reactions that always occur in a given process for a particular species and others that do not and this is more difficult to mine from reactome.

There are also gotchas from biology because there could be multiple variations for lysine biosynthesis that include mix-and-match reactions and variations of those reactions. A combinatorial explosion.

The proposal to deal with this:

  • When functions and process are closely related, like kinase and phosphorylation, can make a "part_of" annotation.
  • new relationship "sometimes part_of" when automated mappings are brought in which will avoid true path violations.

Function and process link discussion

David asked if every function should be a "part_of" process? In theory, there should be a link between each Molecular Function term to a Biological Process term. And there was general acceptance of this theory.

Eurie asked if MF enzyme terms made consistently in order to best make the links easy/consistent? Amelia pointed out there is also another issue that enzyme terms are usually forward and backward but we need separate terms. Harold also pointed out that we copied from EC but this may mean that two GO terms may exist solely on the basis of cofactors. So all these may contribute to issues in creating an automated mapping.

Suzi and Peter discussed that the definition of pathways between Go and Reactome are different, using apoptosis as an example. There are good examples where start and end may be different from organisms to an organism. Ingrid mentioned John Ingram (an experienced physiologist) said a metabolic pathway should begin and end with a central metabolite. Then pathways can feed into a common point that can then go to a central metabolite. But there is probably less consensus for models that are still being developed. All agree that a discussion needs to occur to work on coming to a common agreement.

Paul pointed out there we were discussing two extremes: an uncurated automated link and curated links. The curated links are the ultimate goal but there could be a compromise in the middle. The relationship linked between MF and BP go through KEGG, that evidence trail is documented. This is something better than "sometimes part_of". The concern with this is that changes other groups make need to be propagated.

There was a discussion about the impact on curation. With these inter-ontology links, you have to take the links in account as a true path rule. In addition, how much evidence do you need to make those annotations? That is why there is the "sometimes_part_of" but what if that pathway doesn't exist in your organism?

ACTION ITEMS

* Add obvious part_of links, like MF "kinase" and BP "phosphorylation"; will be rolled out after regulates is released in Feb 2009
* The sometimes_is_part relationship was agreed as a good idea. We should try mining pathways for sometimes_part_of relationships using glycolysis, nucleotide metabolism, apoptosis first
* Agree on beginnings, middles, and ends of pathways/processes between Reactome and GO
* Examine impact on annotation priorities and implememntations
* Can we source our relationships as well as our term definitions.
** (david: this is about pushing the work onto the ontology developers and not the annotators) 
* assign process to every molecular function.
* deferred: co-annotation 'has function as part of this process'

New relationship type (David)

New relationships will be released to the public in Feb 2009. This is the first cross-ontology links between BP and MF. It will occur between the BP "regulation of catalytic activity" and MF "catalytic activity". Those functions that regulate function terms will get the regulates relationship.

One major consequence is that all groups have to take into account relationships. The BP "negative regulation of kinase activity" is part_of "kinase activity", but the slimming will make them "kinase activity". Need to be careful about this.

Michael was concerned about whether the meaning of "part_of" was being overloaded. David replied that we probably are but practically, it may not matter because the child term really cannot be part of the both parents at the same time.

We will have to make sure that GO tools support these links. In addition, we need to make sure that users who develop tools are aware of these changes. Jane emphasized that we couldn't do testing for all tools but the users need to test.

ACTION ITEM

(tanya) Send out function process email again.
(chris) Release examples of relationship usage for software development.

Quality Control (Tanya)

Much of the information is on the wiki. For regulation terms, the reasoner looks at regulation terms and then at corresponding process terms, checks if the structures match or if relationships missing. These were all reviewed.

Ontology developers will continue to review Chris' reports - it's becoming part of the process of ontology development since it is part of OBO-edit.

OBO-Edit (Amina)

(has slides)

Priority should be testing and bug fixes. This version doesn't need more features but all the new features need to be tested, tested, tested.

Reports (Jane)

(has slides)

PAMGO

This is an ongoing process. Lots of "regulates" terms.

Organization and biogenesis of cellular components

(has slides)

All "organization & biogeneis" terms will be changed to "organization" with the proposed high level structure:

organization
-[i] assembly
-[i] disassembly
-[i] maintenance
-[i] morphogensis
biogenesis
-[p] assembly
-[p] part biosynthesis

ACTION ITEM

(ontology dev) Continue work on "organization and biogenesis" terms.  Maybe biogenesis & organization should be switched at the higher level but this is up for discussion.

Signaling (Jen)

(has slides)

Is responding to the signal the same to the reception of the signal? Currently defined as within the realm of reception of the signal?

Future content meeting discussion

The discussion of signaling touches on every species. David pointed that some of these are huge issues - signaling alone can be roughly categorized into

  • g-protein coupled receptor signaling
  • calcium signaling
  • tyrosine kinase singaling
  • MAP kinase cascade

ACTION ITEMS

Pursue an ontology development meeting one or two:
   (Brenley, Kimberley, Candice, Michelle, Jane) Viral processes
   (Pascale, David, ???) GPCR
(?) A couple of GO meeting with major meetings on these topics
(?) Investigate funding sources

Annotation checking by trigger file (Jen)

(has slides)

Of 47,000 errors in the GOA file (0.14% of total), only 5 manual annotations were flagged suggesting that the graph may need improvement. The rest were IEAs.

Trigger file is being used by GOA and MGI for consistency. There was discussion whether this use should be expanded and run against all annotations when the files are submitted. This would address a QC aspect.

Emily said that it can be used as feedback to InterProt (for the InterProt to GO mappings) to update mappings because old mappings are causing problems. Dan also suggested that it could be integrated into QuickGO as a public resource.

ACTION ITEMS

(?) remove sensu synonyms
(Jen) Continue implementation of trigger system
(Dan) Make GOA quickgo checking available to the public
(?) write up for near future news letter

General Annotation Issues

Evidence code ontology (ECO) (Michelle)

Michelle is taking over managing the Evidence Code Ontology.

Goals:

  • Correct incosistencies in the ECO with GO
    • ECO exists as its own and includes things other than GO and GO pulls from the ECO, using a subset

Mike asked is ECO the responsibility of this community? Michael says yes because we started it. We have not used it because we wanted to start out pretty easy. TAIR then wanted a much richer set of evidence codes. Michael and Sue did a mapping. But when TAIR reports to GO, they collapse the evidence codes down. Those arguments are still valid. If curators were faced with more evidence codes, it would take longer. IDA could be expanded to a zillion codes.

Michael thought we should integrate GO evidence codes into the ECO because other people might use them and that GO should use a subset/slim of ECO (i.e. the ones that we are using the ones we use now.) Judy agreed and said if an individual MOD wants to make use of the granular codes, they can, but they must be mapped up to the higher-level codes used by the GO.

Pascale asked if we needed to use the the more granular terms adopted by GO (ISM, ISO, ISA) or if we could keep to the higher level terms. This was deemed fine--Suzi pointed out you should annotate to the degree of knowledge available and this might end up being to the more general EV code. Also this allows for not having to retrofit older annotations as brought up by Harold.

Suzi brought up there could be various slim sets for various projects - AmiGO, Ref Genome, etc. for display purposes in the interfaces.

In response to a question from Eurie, it was stated that the GOC would only set standards for the GOC accepted codes, not all codes. Each database could have use more if they wanted, but would have to convert it into the standard set for the GA file.

Maybe EXP would be better when there is no consensus on which evidence code should be made. This may prevent spinning it. For use of ref genome, maybe have to have additional standards available. There were concerns from Peter about overloading the EXP term and/or losing accuracy. There is a lot of time spent debating which lower code to use and Rex would rather have people agree to EXP and spend more time annotating.

Any evidence code that is in the ECO could be submitted to the GOC for adoption. We could also explore writing software to slim the evidence codes used.

ACTION ITEMS

(suzi) We will use the ECO and create an EV code ontology tracker
(everybody) people can use the ECO in its entirety but they have to map up to the GO set of EV codes.

Separating annotation method from experimental method

In principle, most everyone was supportive of the idea to separate the evidence used to make the annotation and the curation method to make that annotation. There was, however, much discussion on the scope and implementation of this proposal.

Two implementation methods were proposed:

  1. A new column that contains a text describing the annotation process
  2. Creating a cross-product between the evidence code branch of the ECO and a methods branch of the ECO. This ID would be used in the GAF.

Because proposal 2 does not create a new column and is expandable to multiple combinations, it was favored.

Proposed methods were not agreed upon but words that were used to describe included

  • curator reviewed
  • not curator reviewed
  • electronic
  • manual

A direct consequence of this would be that IEA would go away. Emily proposed the following mappings:

  • Interpro2GO -> ISS/ISM, not reviewed
  • keyword mappings -> TAS, not reviewed

IEAs are stripped out after a year. How would those "not curator reviewed' be treated? Since the intent was to remove annotations based on sequences, maybe only those should be removed because they can be easily recomputed. Large-scale/high-volume/high-throughput experiments won't be repeated but still are experimental. We would have to consider exceptions for these.

There was some discussion about what users wanted and how users were taking advantage of the evidence codes. There is a range - some people just strip evidence codes and do not consider them. However, others would take advantage of additional levels of information. Both advisors mentioned that what users wanted was a level of confidence. This, however, would be difficult to do.

It was agreed that multiple issues need to be considered when dealing with this new qualifier of evidence codes:

  1. the experimental evidence
  2. was there judgement involved
  3. was there a review of the data

ACTION ITEM

(Suzi, Michelle, Judy, Pascale, Emily, Eurie) Example cases of annotations and implementation into the ECO

PAMGO (Michelle/Candace)

Candance gives an overview of the new terms. Project is coming to an end - funding is coming to an end. New gene association files have been submitted.

Successes

  • PAMGO terms outside of PAMGO: viruses, c. albicans, p. falciparum, t. cruzi, t.brucei.

Issues

  • incorrect uses also.
  • there are a few terms where it is ambiguous whether or not the process is for the host or the virus side.

Future directions

  • fix virus terms
  • add comments
  • adopt more descriptive form for annotations.

Fixes

  • missing taxon ids for dual taxons
  • need a way to capture "acted_upon" annotations

Dual taxon IDs

  • still not displayed in AmiGO due to technical issues

ACTION ITEM

Check your annotations to terms under the 'symbiosis' and 'interaction with host' branches to make sure that there aren't any problems.

Cross products: Column 16 (Tanya)

Initially proposed in Jan 2007.

Reminder: this is extra information to combine multiple terms in a single annotation. These are GOIDs that we don't want to encode links in the ontology. If there is more than 1 localization, they can be piped and several different ontologies can be piped in the same row.

  • You are not restricted to one ontology in column 16
  • Column 16 is optional
  • Column 16 can also be used to identify a target i.e. regulation of transcription (Note that the current documentation states that column 16 is only for external ontologies.) or a chebi ID for a chemical when annotating "response to drug".

There were two proposed solutions on the table:

  • simple solution
  • expressive solution

No one had concerns about adding this column and a few people spoke up in favor of the expressive because it would allow for more information and no need to retrofit.

ACTION ITEM

(everybody) go ahead with the implementation of the expressive model in column 16
  * get more examples 
  * get the documentation together

Transitive Relationships in GO

(has slides)

Relationships in terms, it's wrong to just slim terms given the different relationship types. Unless you're careful, you will violate true path rules.

  • The composition of is_a and part_of need to be taken into consideration for true path violations.
  • If you regulate a process, you regulate part of that process, not that whole process.
  • As you add more relationships, you need to create these transitive closures.
  • And as you take these into consideration, the slimming can become more sophisticated.

Judy proposed that we need a tool to help with this.


Reference Genome

ACTION ITEMS

  • Draft a document about coordinating the GO gp2protein files with the UniProt proteome project (Paul/Kara). Look over it and bring it to Amos as a proposal (Michael A.).
  • For GO, generate a list of what files are needed (gp2protein, spliceforms, all gene products), define what these files should be, and build a file structure. (Paul/Kara)
  • Pascale will provide the seed genes for the annotation set at the first of every month, and the sets are to decided by Paul and Kara's trees.
  • If there are problems with the tree, the MODs will correspond with Paul and Kara. This will be done on a one-on-one as-needed basis.
  • Annotations made by ancestral inference will get fed back to the MODs in gaf format for them to incorporate into their sets to submit to the GO database. 'RefGenome' will become the source of these annotations.

How curators can use evo trees (Paul)

(has slides)

Paul presented a proposal for the new process. Highlights of new process include the following:

  • Trees will be overlayed with the OrthoMCL "ortholog clusters" to find "equivalogs" - the equivalent gene in all organisms.
  • Finding these groups will allow annotations to be inferred to the shared ancestor protein
  • Annotations can then be propagated to the extant protein so that annotations are concurrent and consistent in the context of the evolutionary tree.
  • There is an evidence trail that is documented.
  • The strength of this approach allows annotation of organisms where there are no curators. GOA could take these annotations.
  • Includes bacteria and archaeal sequences but horizontal transfer makes it harder.
  • The update would occur twice a year.

Paul showed screen shots of the tool that had GO annotations overlayed on the tree. The GO annotations displayed are mapped up the tree. Additional information can also be applied to the tree, such as bootstrap values or Interpro domains. This tool can be updated to reflect current GO annotation and GO tree structure.

The proposal also included a change to the gp2protein files. Most are complete but if genes are missing, they were supplemented with the ENSEMBL or Entrez Gene ID (based on the group). However, these genes are represented by a single representative protein sequence so this is not a long term solution. The proposal was to switch to the Swiss-Prot canonical protein sequence which is mapped to individual UniProt IDs and has instructions on how to generate all isoforms. There was agreement that two groups (MODs and SwissProt) should not work independently to create the same file and that they should communicate.

There was some concern that we were overloading the purpose of the gp2protein file. And that the GOC still needs additional files to keep track of all gene products in a genome as well as a mapping between genes and IDs for all isoforms.

The annotation process (Kara)

(has slides)

Kara then presented how this pipeline would work. The major points of the new pipeline include the following:

  • a new curator, known as a protein family curator, will suggest protein based on tree
  • MOD curators annotate all experimental data to completion
  • the protein family curator mediates/coordinates review of experimental based annotation review
  • the protein family curator also creates inferrence of annotations to equivalent genes

There was some initial discussion about how the interaction between the protein family curator and the MOD curator would work at each of the rounds. The process is intended to be an iterative process that requires feedback at each round. For example, curators may want to adjust experimental-based annotations after seeing another MOD's annotations. But there needs to be a fixed timeline to finish the experimental-based curation in order for propagation of annotation to occur in a timely fashion

This pipeline can generate reports to help get outlier annotations and other oddities. Automated checks can be done to alert groups when a MOD has added a new annotation or the equivalog tree has been updated or changed, particularly if a common ancestor has changed.

Kara also presented new features on P-POD:

  • multiple genes can be input in the search
  • tree display is based on Notung with an interactive applet
  • lists publications with functional complementation
  • has links to GO MGI and AmiGO graphs
  • if you have suggestions, please sent them along


Reference Genome Annotation Discussion

(has slides)

It was decided that the process for selecting the initial genes to determine the RefGenome sets would not change. Pascale would continue to provide the gene list at the beginning of each month.

It is not necessary to discuss the problems with the tree as a group and feedback can be given to Paul and Kara. Paul proposed that this discussion could be tacked on at the end of an ejamboree session

For the gp2protein input, we want to have a reference protein with all the exons (Kimberley)


  • judy: multiple reps of protein families, want to take advantage of them, like swissprot
    • completely extensible, can put in as many as you want
  • need to be careful with definitions, maybe we should be better
  • m.a.: we should go slow and prepare a document for this and make sure that we have everybody with us
    • we aren't in a hurry


  • Kimberley: How do you handle new literature that comes to the MODs? Do you revisit?
  • Rex: As we identify a new paper that might cause us to revise an annotation, it needs to get fed into the system.
  • Judy: The group needs to be notified if there is a new function that has been identified.
  • Rex: if the change is an outlier, then the group needs to talk about it.
  • Kara: In the proposal, reports will be generated.
  • Mike: Are there equivalog sets that you will exclude that the function evolves faster than the sequence?
  • Paul: There will be a proposal from a pfc and the group will be able to talk about it. And the curator will be able to make a judgement on what level of annotation to transfer. There is something you can say, it may be more specific, it may be less specific, but that's where the biological expertise comes in.
  • MA: this is so much rigorous than what we've done before, and everything is traceable.
  • Mke: pfc will define annotation set. How do you envision MOD interaction on a given set b/c
  • Kara: there will be a given set and the MOD curators will make the annotations and the pfc will mediate
  • Mike/Kara/Rex: This will be an iterative process b/c some MODs will want to change their annotations based on what other groups have annotated.
  • Rex: to make this work, there needs to be a defined time frame so the iterative process can happen.
  • MIchelle: are bacterial species included?
  • Paul: They already are, but horizontal gene transfer does make this more difficult.
  • Mike: are you going to take existing annotations to help make trees?
  • Judy: Mouse could provide information about duplication, that could be an extension of this work, but this would be a place to start.
  • Paul: Sometimes the tree will be wrong because the data is problematic. Chicken genes are problematic. But it is traceable where the tree breaks down and can report back to the MOD so they can look at their data
  • Judy: look at this side by side with Mary's trees that look at annotations within the GO structure
  • MA: The underlying structure of the tree is the species tree? (yes) What happens when the species tree changes?
  • Paul: we just have to figure out how that affects the nodes that have been annotated, but I don't think that these are going to be dramatic changes. We look at how this changes the distribution of annotations for predicted groups. The question is "does the ancestor change"?



  • How can we get better about generating the protein sets?
    • work in collaboration with UniProt proteome project?
  • uniprot; complete proteome project
  • protein curator; how to efficiently incorporate input from all MODs
  • how to deliver resulting homol-based annotations to MODs
  • judy: doesn't like gp to protein files
    • what's in a name
  • judy: MODs are building gp to protein, how to work with uniprot?
    • judy: eventually they will be working together
  • Judy: we're also including functional RNAs, you're calling them gp2protein files
  • Paul: This is a legacy name issue
  • Judy: given that we are building these files on a MOD by MOD basis, how are we going to interface with UniProt since their effort is the same.
  • Judy: the goal is to annotate one gene representitive and one protein representitive
  • Pascale: UniProt are the missing proteomes mouse, zebrafish, chicken, rat.
  • Judy: the goal of the project should be that the UniProt and GO files
  • Suzi: We have loaded the gp2protein file, but we need an additional file for spliceforms, and ones that has RNAs, etc.
  • Kimberley: c elegans sent the longest, but what you really want is all the exons
  • Paul: The SwissProt representation is ideal.
  • Judy: in terms of isoforms, there are multiple ways of representing ptn family sets, so what are the thoughts of having multiple groups overlayed on these trees?
  • Paul: This system is completely extensible.
  • Rex: we should be careful about what you're calling protein families, really they are gene trees.
  • MA: we should prepare a document for what the GO means and what we should do. And what protein sets we are going to coordinate from the MODs, what they are going to be used for, what are they going to include
  • How to most efficently incorporate input from all MOD curators?
    • proposal--protein family curator
  • How are resulting homology-based annotations delivered to MODs?
  • Judy: how do we decide which genes get chosen?
    • Is it working, do we want to change anything?
  • Judy: it would be good to set the priority sets.
  • Rex: We've discussed this every time and the priority changes, it doesn't matter how we pick them, just pick them.
  • suzi: give a gene/focal point, what from every species is the protein that you want to include--that will be the tree stuff
  • suzi: no discussion about the seed; but we may have one about the set.
  • rex: let's just truct paul's trees, good enough
  • kara: will have much improvement once we start using trees
  • pascale: any problems should bump back to paul and kara, not necessary to discuss as a group
  • Debby: Should we have the discussion about the inferences at the ejamboree?--NO. the curators might want to go back and do additional annotations based on the dicussion and it would be more fruitful to have a separate call about the inferences.
  • Kara: We'd like to sit down and create a proposal about how to do the reviewal without making extra work. want to just hammer things out with pascale and have a concrete system
  • judy: MODs should incorporate new inf instead of GO
  • Rex: requirement of the ref gen group, to quickly incorporate the inferencial annotations by each MOD to their GA file.
        • paul: MODs should check tree, until comfortable
      • paul: how does PF interface with mods?
  • David: The most efficient way would be for the pfc to contact each of the MOD individually.
  • Rex: turn it around. Have the MODs review and if there is a problem, then they get in touch with the pfc in a defined amount of time.
  • David: only people who have experimental data should get together and annotate
  • Pascale: these are only predictions. We want to be conservative in propagating these functions.
  • tree rebuild will happen every 6 mos or so.


currently do not have complete protein files for mouse, rat, chicken, zebrafish


Mechanics There were two options proposed for inputting ancestrally inferred annotations:

  1. These annotations would be provided back to the MODs, and the MODs would incorporate them into their gene association submissions to the GO consortium
  2. They would be directly inputted into the GO database with a filtering script

Although it was brought up that a downside to the first option would be a delay in incorporation of annotations, people much preferred the first option for the following reasons:

  • it is consistent with the current policy of each MOD being the definitive source of annotations for their organisms.
  • most of the MODs also have systems in place to load external annotations (e.g. GOA).
  • it will ensure that annotations remain in sync between the MODs and the GO consortium files

Judy asked what the evidence code and source for the ancestrally inferred annotations would be. RefGenome was suggested as a source and this was considered favorably as it would increase visibility of the project. We could also version these annotations by the date. Suzi said the evidence code discussion should wait until there were annotations that could be discussed.



ACTION ITEMS

  • Draft a document about coordinating the GO gp2protein files with the UniProt proteome project (Paul/Kara). Look over it and bring it to Amos as a proposal (Michael A.).
  • For GO, generate a list of what files are needed (gp2protein, spliceforms, all gene products), define what these files should be, and build a file structure. (Paul/Kara)
  • Pascale will provide the seed genes for the annotation set at the first of every month, and the sets are to decided by Paul and Kara's trees.
  • If there are problems with the tree, the MODs will correspond with Paul and Kara. This will be done on a one-on-one as-needed basis.
  • Annotations made by ancestral inference will get fed back to the MODs in gaf format for them to incorporate into their sets to submit to the GO database. 'RefGenome' will become the source of these annotations.


Mike: Annotation Progress

(has slides) Showed more graphs since people like them. Bottom line is that progress is being made. The numbers for the Ref Genome genes are a bit off since it's not easy to get the list of genes so current numbers are a bit off. Paul has number on his FTP site.

ACTION ITEM

(berkeley) automate mike's graph (as in proposal), need to be able to see progress through time

David: RG from ont dev prespective

(has slides)

  • we work through SF
    • RG request are prioritized
      • two flavors
        • new term for RG
        • problem areas in ont
          • slower
  • need annotors to get info about "response to" terms
  • doing signalling now
  • ...argh...
  • please use SF and mark as RG
  • pascale: documentation and anno consis
    • when doing big branch of ont, more discussion with RG group
  • jen: doc for every big reorg, sometimes I don't know about a change
  • debbie: def is sometimes unclear
    • ex: ATP binding
    • pascale: "binding", "regulating", ev--these are always a problem
    • pascale: we need a group to make a proposal to get the defs down flat
    • peter: ont distinguishes binding from catalysis
    • kimberly: we're not consistant about how we use binding
    • rex: documentation is the core issue
      • docs will solve all of the above

ACTION ITEM

come up with rational plan for documentation and indexing (including rational and examples)
  • suzi: we can ask the SAB about this as well
  • ingrid: i look at term def
  • how should we do this?
  • seth: why not GONuts and not muck-up the data
  • richard: examples would be very valuable, including counter examples
  • pascale: add more fields that didfferent people could see
  • alex: users should see the information
  • jen: wants one single information resource
  • mike: this curator info shouldn't be private--this is good stuff
  • debbie: seconds mike, and GONuts
  • emily: seconds
    • children as well
  • jen: graph can be weird
    • david: will fill in missing bits
  • jane: docs go bad over time need freshness
  • david: great defs vs. curator judgement--still an art
    • harold: seconds; always constrained by training

...

ACTION ITEM

Final

Look at action items from last meeting

  • push forward not dones
  • carry forward documentations issues

Previous Incomplete Action Items

Status Responsible Party Task Comments
In Progress Documentation working group Document annotation SOPs Another factor we have been tracking is when a curator judges that the curation of a gene is ‘comprehensive’, that is, that is accurately represents the biology (irrespective of the number of papers available or read). This appears in the spreadsheets. The guideline is that when there are few papers, all papers should be read; when there are many (a curator can judge what is too many), then a review should be read to find the important primary literature and decide what information needs to be captured. We don’t keep track of whether or not reviews have been read. Wormbase uses textpresso (PMID 15383839), that helps ensuring curators do not overlook information. The ‘comprehensive’ curation status doesn’t get invalidated when a newer paper is published; however, curators may (and are encouraged to) update the date when the newer literature is curated.
Chris Mungall Re-calculate with is_a only paths
Chris Mungall Re-calculate with experimental codes only generate several versions of the data classified by different evidence codes?
Chris Mungall Provide such reports on a regular basis
Judy Blake Contact NCBI/NLM/OMIM to link to reference genome genes
In progress Documentation working group Document Changes to Gene Association File (GAF) column 2 is canonical gene ID; column 17 is thing you are annotating (always required); column 12 matches column 17 and contains SO ID's; add header to gene association file
In progress Documentation working group Document Changes to gp2protein file includes complete gene index (except for pseudogenes and transposons); column 1 is canonical gene ID; column 2 is accession for sequence of longest form of protein from UniProtKB: or NCBI; syntax of gp2protein file will be provided by Mike and Chris
In progress (Jane) write notice of changes to GAF and gp2protein to users
In progress MODs + Ben Hitz make sure that their input matches new GAF and gp2protein requirements
Seth Carbon Have AmiGO show co-occurrency terms similar to function in QuickGO.
Seth Carbon & Val Wood SLIM by SLIM matrix Would be used to review intersections of different cellular processes and look for unexpected intersections which may identify possible errors. Try first applying to function and component terms; outline cells that you expect to be empty, Have these matrices generated automatically from the AmiGO database.
Ben & Mike Get isoforms into GO database
MODs & Chris Consistent use of IMP "with" column Chris will be talking to individual groups with how they use the with column for IMP. Each MOD groups needs to respond to this for Chris.
Michelle Implement Michelle's proposal decide whether to put 'response to drug' ID in column 16 or is separate IC annotation. Annotate to chemical term ‘response to cocaine’, co-annotate with chemical term for now, then later when available, put GO ID for “response to drug’ in column 16 (or separate IC annotation).
pending Midori, David, Chris, Mike Bada Chemical derivatives and metabolism terms Need input from Chris and Mike on how much can be automated; possibly also current and near-future state of ChEBI
MODs & Pascale All groups to check on how they use IGI and update annotations as per Princeton discussion.
Val Circulate draft doc on how contributes_to can & can't be used Will include: "Would this annotation make sense if this subunit was" ... [thought not finished; might be something like "viewed by itself"].
MODs & Pascale Check existing annotations for "contributes_to" with IDA We think only allow contributes_to with IDA. Look into adding to annotation checking script to flag contributes_to.
Jen Implement rules and software for sanity checking automated annotations (species-based trigger file).

New Items

Status Responsible Party Task Comments
empty empty empty empty