GO Consortium Meeting 2007

From GO Wiki
Jump to: navigation, search

NO MORE EDITS PLEASE

Downloads

Topics

GO Team and other Status Reports

Monday, December 8th

  1. 8:30 Reference Genomes - Rex&Karen
    • Summaries of the genomes - the display metrics
  2. 9:00 Ontology Content - David&Midori
    • IS_A complete
    • regulates (note: Chris has agreed to a joint ont dev/software perspective on this - mah)
    • Cell Ontology links
    • Collaboration with Jonathan Liu and MIT
  3. 9:30 Ontology & Software - Chris&Ben/Mike
    • Includes OBO-Edit working group report

break

  1. 10:30 Annotation outreach - Jen&Michelle
  2. 11:00 User Advocacy - Eurie&Jane
    • Includes AmiGO working group report
    • Includes 'Hub' report?
  3. 11:30 Operations Summary - Suzi
  4. Publications/Presentations/Tutorials/Posters (handout)

Issues to be addressed, ordered with harder topics first

Monday, January 8th after lunch & Tuesday, January 9th morning

Discussions

  1. GO policy on incorporating GOA annotations into MOD annotations (Evelyn and Mike/Judy?)
    • GO annotations have been stripped out of GOA-UniProt (all species file) and all other gene association files by using taxid stated within the file. This is defined within the GOC documentation at: http://www.geneontology.org/GO.annotation.shtml#script The plan was for each member group to integrate annotations that were being filtered out. To date this is only happening at MGI. The result is that annotations from GOA for experimentally results are being lost. GOA receivies a lot of user questions about how to get complete annotation datasets. The unstripped GOA-UniProt file is available on EBI and GOC FTP sites (however in the later is not clearly stated in our documentation).
    • GOA now integrates all experimental data from each GOC member on a monthly basis.
    • We have a GO policy on this. Perhaps, if a GOC member cannot integrate the manual annotation from GOA and others that taxid should not be filtered from the other gene association files?
    • In practice, the MOD groups identified need to be contacted to find out how they are doing in incorporating appropriate annotations in their files.
    • GOA PDB gene association file - should this file be stripped on GOC site by taxon id? File created by GOA and InterPro3D, special pipeline used as PDB entries do not map 1:1 to UniProt/other identifiers (Dan).
  2. Prioritize list for next ontology development meetings. Do we need to do these in sequence or parallel. Many of the same ontology developers are always involved {David, Midori, Jen, Jane}. However, there are cases where others are involved. Some prioritization may come for GO-engineering collarboration with MIT. At present, the sorted list is as follows.
    1. is_a complete (hopefully done by GOC meeting)
    2. some component of development and physiology of cardiovascular system (May)
    3. muscle development {suggest by Erika and colleagues}
    4. peripheral nervouse system {continuation of early CNS work}
    5. DNA repair? (perhaps Eurie could organize this?)
    6. Transport (suggested by Val)
    7. How do we give credit to external contributors to GO (Midori)
  3. Piped data for IPI, need consistency in usage (Evelyn)
    • IGI data allows piped accessions in the 'with' columns to capture the fact that two or more genes may be interacting simultaneously. IPI data also allows piped accessions in with column but some GOC members here use the pipe to specifically say that in a given paper that protein A, B and C precipitated together or form part of a complex others I think use it also for circumstances where 2 separate experiments in the same paper showed protein A interacted with protein B and to protein C. GOA prefers using it like IGI for a specific circumstance otherwise information is lost? Others??
    • Related Issue: GOA has decided for the moment not to pipe several protein binding interactions simply because it comes from same paper. We unwrap piped data from MODs because of inconsistency in usage and because this data not normalised (causes problems of database and web services)
    • Karen C adds: I think the same issues apply to IGI, so whatever we do should apply to the with column when used for either IPI or IGI, or perhaps for any use of the with column.
  4. Discussion of 'anatomical processes' such as 'heart pumping' in the process ontology. Should we add terms like this, how are we going to do it? If we are not, can we express these anatomical processes in another way?
    1. Add these terms and then make non-anatomical processes part_of them. This will create a lot of true path violations if different anatomical structures in different organisms carry out the same process. We would also have to make specific children.
    2. Create a method for 'annotating' anatomical structures from other ontologies with GO biological processes.
  5. Overlap/connections between GO and SO?
    Emily Dimmer submitted a SF item asking if GO would want to have terms in the component ontology to represent situations such as the finding that human myosin 6 coimmunoprecipitates with RNA pol II at the promoters and/or intragenic regions of active genes. After an email discussion between Karen E and Karen C, the question boils down to whether/how to make such a connection between SO and GO.
    • On the one hand, it seems redundant to repeat the terms in both places. In general we are trying to avoid overlap between the ontologies.
    • On the other hand, it seems that SO is used for the annotation of the sequences with respect to what they are, while GO is used for the annotation of gene products with respect to where they are located for component terms. Thus, I don't want to start mixing my annotations of gene products with SO terms as well as GO terms. If we want to be able to annotate these types of sequence locations as places where gene products can be localized, I'd rather do it in a way where there is a term in GO that has some relationship to a term in SO.
    The consensus is that we should discuss this issue at the GOC meeting. The SF item is here: https://sourceforge.net/tracker/index.php?func=detail&aid=1587313&group_id=36855&atid=440764
    This may also help with a question from Michelle about the provirus and viral genome terms: https://sourceforge.net/tracker/index.php?func=detail&aid=1571666&group_id=36855&atid=440764
  6. Do we want all groups to be able to provide structured notes, or do we want to proliferate GO terms for things like cell types? See https://sourceforge.net/tracker/index.php?func=detail&aid=1598448&group_id=36855&atid=440764 and https://sourceforge.net/tracker/index.php?func=detail&aid=1587269&group_id=36855&atid=440764
  7. Change in interpretation of the database identifier in DB column of association files (Emily). Change suggested so that the combination of the DB (column 1) and DB_Object_ID (column 2) fields provide a globally unique and resolvable identifier, rather than naming database submitting file (as currently defined). The ASSIGNED_BY column will still state from where the annotation originated.

Things that have been agreed, just need to do

  1. All MODs should provide a file with all protein sequences. Also the known UniProt or NCBI accessions should be included in the gp2protein file.
    • Each MOD has the goal of annotating all the gene products within their genome of interest. Thus each MOD has a dataset of proteins, even those that have not yet been annotated. This dataset should be provided from the MOD site, and from the GOC site. The dataset should include the UniProt or RefSeq accession if known.
    • The gp2protein file should include all the accession numbers even the accessions for proteins that have not yet been annotated.
    • The International sequence databases have an ownership system in place that limits who can make changes to the sequence or its annotations. Sometimes the MOD has newer information that is available from GenBank/EMBL/DDBJ because the authors are slow depositing updates. (Mike)
  2. Make our choice for on-line meeting support software (John D-R)

Put in Reports Session

  1. Do we want have time to update other GOC members on GO related grants that have been submitted or to be submitted or do we leave this info for project reports?(Evelyn)
  2. Perhaps part of Outreach Grp, would like to discuss experiences of GOc member with getting feedback from community on annotations, what works best, wiki, face2face chats, e-mail, online forms etc..(Evelyn)
  3. I would like an update on complex GO annotations (nomenclature, when and when not to request a term), GO collaborations with IntAct and CheBI etc...(Evelyn)

Need proposals for the new evidence code definitions

  1. Resolution of several Evidence Code issues from Annotation Camp (Karen & Evidence Code documentation committee)
    • What evidence code to use for profile HMM based annotations.(Michelle)
    At the annotation camp a proposal was raised to use RCA for profile HMMs while Michelle has argued that these should remain ISS. There is agreement that the models used for things like TMHMM and SignalP might better belong as RCA. However, there is disagreement about the the HMMs in the TIGRFAM and Pfam sets. The proposal says RCA, others argue it should be ISS.
    • (Note added by Val.) The original proposal was that ISS should only be used when transferring annotations to orthologs. This isn't always practical (or possible), as for some domains (i.e. F-box), we know they all act as as substrate specific adaptors for ubiquitin ligases, but we cannot unambiguously assign them to a characterised ortholog. However, the protein is clearly a family member (judged by assessing the alignment -ISS), has been named as an F-box by the laboratories studying these proteins (but are currently unpublished). I could leave this as IEA, but I wan't to show that this has been manually assessed. This is the only way we can weed out false positives from the electronic mappings (I have reported ~260 so far see https://sourceforge.net/tracker/?group_id=36855&atid=605890) Also using our protocols manual assignment overrides other possibly less granular redundant IEAs.
    The same would apply to many zf-fungal Zn(2)-Cys(6) binuclear cluster domain. All proteins with this domain are transcription factors, and based on the fact that they are members of this family (based on the multiple alignment-ISS). Sometimes the orthologs cannot be unambiguously identified (because of multiple deletions and duplications), for others the S. cerevisiae orthologs are not studied or annotated. However every single one characterised so far is a transcription factor. I don't see a problem with annotations ISS to the Pfam alignment for the functions which apply to ALL family members. In fact, with an ISS to a multiple alignment (as previously pointed out by Michelle) you can have greater confidence than an ISS to only a pairwise alignment. I see far more problems with ISS annotations which are not supported by anything in the 'with' column (too many to even provide feedback on). Converting IEA to ISS involves many things (selecting the correct degree of granularity, checking the alignment, checking that all proteins with the domain studied so far have this function, community feedback). But essentially these are ISS, not RCA.
    • (Karen C adds) At the recent Annotation Camp, we also agreed to use RCA for things like tRNA scan and the snoRNAs, but the more I think about it, I really think this is purely sequence based and thus should be given ISS, not RCA. We would also need to resolve what, if anything, could appropriately be put in the with column.
    • Flip side of the issue: What should RCA cover?
    At the Annotation Camp, we proposed to use RCA for a number of purely ISS-based methods where it was difficult/impossible to fill in the with column. Firstly, Michelle Gwinn has objected to disallowing use of ISS for purely sequence based methods. Secondly, RCA was initially proposed for computational methods that combined multiple data types and then performed some analysis that could be used to make predictions for GO terms. At the St. Croix GOC meeting, it was mentioned that the docs currently state that RCA should be for non-sequence based, but that it should probably be expanded to allow inclusion of sequence based data, provided that the computational method was not purely sequence based.
    • Boundary between ISS/RCA/IEA
    Once the above issues on what ISS and RCA should cover, we may also want to make sure we are clear on what is the policy for promoting an IEA to the appropriate curator reviewed code. The Annotation Camp minutes note that "There seems to be a lack of clarity on the proposed new boundaries between ISS, RCA, and IEA, particularly RCA and IEA. Even just the above two paragraphs leave me confused as to where one would use IEA versus RCA for an HMM-based method. The group as a whole may need to discuss this further." I'll also add that while the original boundary between IEA and ISS made a statement about curatorial review of that particular annotation, the guidelines for use or RCA stated only that the method have been reviewed and validated, not that each individual annotation be validated by a curator.
    • Clarification of TAS and NAS
    1. TAS - At the Annotation Camp, we agreed to limit use of TAS to situations where you can say "Paper A that I was annotating referred to paper B as the source of this statement". This would exclude the historical usage of TAS for common knowledge statements. Basically, this code would only be for cases where you can go the paper cited for the annotation and trace the statement to a cited reference. To use TAS, there is no requirement to go to the cited paper and confirm that it contains experimental characterization of the species of interest, because that defeats the purpose of the TAS code. However, recognizing that authors are not always precise with respect to species when citing references, Reference Genomes have agreed to avoid use of this code whenever possible. We should probably add documentation about this issue with the recommendation that tracking down the cited reference and annotating from it is recommended when possible.
    2. NAS - At the Annotation Camp, we agreed that NAS should be used in all cases where the author makes a statement that a curator wants to capture but cannot be traced to a specific publication and this should apply to both peer reviewed papers and information from textbooks.
    NAS and proposed use of with column - An example of when to use NAS and what to put in the WITH column was provided by David H at the 2006 annotation camp as follows: "If I draw the conclusion that a transcription factor is in the nucleus then it is IC; if the author draws that conclusion then it is NAS. The WITH field would contain the GOID for 'transcription factor activity' in each of these cases. Note that this is an expansion of the use of the WITH field for the NAS evidence code."
    • IEP - may be some need to clarify usage of this code (note that this comes from Evidence Code Group discussion, not from Annotation Camp per se, will check with group and add to/remove this particular point as appropriate).
    • ND - (this wasn't part of the annotation camp discussions, just tagging it on the end!) Most of the annotations that were formerly to the 'unknown' terms but are now to the root nodes have the evidence code ND. The use of ND is useful for identifying these annotations, but it seems that there are some 'unknown' annotations that have other evidence codes (e.g. TAS and NAS where an author has stated in a paper that there is no data available). Should we standardize all of these to use ND? There are about 50-60 in total from all groups (Emily and Jane).

Need more detail on the proposal

  1. Response to drug
    Erika Feltrin has a proposal to overhaul the area of the ontology under 'response to drug', and the plan will also affect the 'drug transport' and 'xenobiotic' terms. The ontology working group have held an online content discussion meeting and agreed that this material should be presented to the consortium meeting if time allows.
For a summary, see http://gocwiki.geneontology.org/index.php/Response_to_drug

No Discussion Needed

  1. Evaluation of project tracking methods
    • Not sure what this would be? This needs more definition. (Mike)
  2. Handling multiple identifiers for gene products and sequences
  3. The issue of using the GO_REF vs extension of the evidence codes to amplify upon the method that is used.
    • (Question from Val) Does this include the proposal for introduction of a code to distinguish HTP experiments discussed at the curation meeting? if not can it be included?
    • Need more specifics about this item. I do not believe the intent was to discuss HTP but this needs to be stated. (Mike)
  4. Hide comments in AmiGO. There is a conflict between the AmiGO browser as a tool for biologist users and the AmiGO browser as a tool for annotators. The 'commments' often are directed to annotators and can thus be considered either irrelevant or confusing to biologist users. In the case of obsoletes, one should just be directed to suggested terms. Annotators might better use OBO-Edit to see comments. So, should we suppress display of comments on AmiGO?
    • Suggest that this be a topic for the AmiGO working group. (Mike)
  5. GO Consortium Tools (Evelyn, Emily)
    • GOA feels that GOC should not have tools on GO tool page unless they are maintained or at least highlight that fact, we also feel that we should consider perhaps a top 10 GOC reviewed set of tools that we can recommend and liase with on a regular basis. GOA can do that independently of GOC if GOC does not want to take such a position. Most users want advice on GO tools and presenting them with over 100 is not overly helpful. We also need to consider how to modify next GO users/tool meeting (already discussed on GO management I think?)
    • This is a resource issue. It would certainly be a good idea to have a small number of selected tools. However, how had the time or wants to take the time to handle this? (Mike)
  6. Since Alex is unable to attend the meeting, perhaps we can arrange a time to have a web conference with him. This will show the group how we have been working from distributed sites and we can get an update on the immunology stuff. I suggest we use whatever technology we have found to be the best by the time of the meeting. Then we can discuss whether it is good enough to buy. etc. [submitted by dph].

New proposals

Tuesday, January 9th after lunch

  1. Protein Family based annotation tool - Suzi
  2. Term history tracking capability - John/Chris/and OBO-Edit group
  3. Incorporation of all gene product sequences and IDs into GO database and fasta files. How are we to accomplish this.
  4. New set of high-level terms for cellular component: fixes the problems of terms not being 'cellular components', allows alignment with CARO - Jane (in collaboration with Melissa)
  5. GO development "training": At the October 11 managers' conference call, David, Midori and Jen proposed an informal training session for ontology development, so that more GO annotators will be able to work directly on the ontologies. We would cover using OBO-Edit and CVS in the GO context. David plans to stay on an extra day to work with the GO editors, and other annotators who want to do ontology development would be welcome.
  6. Future users meetings - Jane and Eurie.

Wednesday, January 10th morning

  1. Unfinished topics from previous afternoon
  2. Summary and wrap-up
  3. Next consortium meeting


Proposed Discussion Topics

  1. 'response to drug' SF 1242405
  2. difference between function and process