InterPro2GO Session October 4th 2011

From GO Wiki
Jump to: navigation, search

A face-to-face meeting at the EBI between GOA curators, GO editors and the InterPro curation team, to go through the InterPro 2 GO mapping process, problematic mappings, relationships between GO terms and InterPro domains etc etc...

Agenda

  1. InterPro to give an overview of the InterPro2GO mapping procedure [1]
  2. Jane to give an overview of the multi-organism process node (GO:0051704) in GO, and how to use the terms for annotation. [2]
  3. Jane to give an overview of the relations that are being developed for GO annotations, and how they'll be used, including the membrane terms. [3]
  4. Problematic areas of InterPro to cover:
    1. When to use particular membrane-associated component mappings (see the recent PAINT paper)
    2. Use of the protein-binding term ('protein binding ; GO:0005515' and 'binding ; GO:0005488' should only be used for annotation when an identifier is present in the with column (cases where the identifier is absent are stripped out of the GOA files): are there more specific terms InterPro can use instead?).
    3. How to GO map proteins that form complexes (the relationship ontology might help here)
    4. GO mapping proteins that have different functions according to the component they're present in.
    5. Component mappings in general (ie, should we be mapping terms based on proteins that are *only* found in a particular location, or do we map proteins that have been observed in that location at some stage?)
    6. Clarification on how GOA use the 'NOT' qualifier - are there implications that we need to be aware of in InterPro?
    7. The idea of using blacklists to prevent erroneous mappings to sequences based on InterPro matches
      1. Relating to black-lists, revisit protein-kinase catalytic domain entry (IPR000719), which maps the terms GO:0006468 protein phosphorylation, GO:0004672 protein kinase activity and GO:0005524 ATP binding to ~100K sequences in UniProt. However, among these are members of the tribbles family, which are psuedo-kinases. So are there sensible ways we can handle this kind of situation without sacrificing large numbers of true positive mappings.

NB: See Minutes for the discussions/resolutions on these points.


Problematic InterPro Mappings

  • IPR000402. ATPase and ATP metabolism terms. [4]
  • IPR000342. Signal transducer activity. [5]

* IPR024738

    • It represents the Ada1/Tada1 subunit of SAGA-like complex
    • The SAGA complex is a transcriptional coactivator (involved in regulation of transcription by RNA polymerase II). Should be map it with: Contributes_to transcription coactivator activity (MF) GO:0003713 following the GO complex annotation guidelines?

DISCUSSION: It was agreed that 'contributes_to' is the correct annotation here. For now InterPro may have to store the 'contributes_to' information internally. Also look at the GO protein complex terms for an appropriate CC term.

  • IPR018767
    • Nucleus export protein Brr6. It is mapped to GO:0016021 integral to membrane. Should be instead mapped to: GO:0005635 nuclear envelope?

DISCUSSION: InterPro should submit a SF item requesting an 'integral to nuclear membrane' term. The more specific 'integral to nuclear inner membrane' and 'integral to nuclear outer membrane' terms already exist.


  • Ribosomal Proteins
    • In the database we have many entries for ribosomal proteins, but perhaps we are not mapping them correctly. Understanding better the relationships between MF and BP could help.
    • Example: IPR000439 Ribosomal protein L15e. At the moment it is mapped as:
    • Process GO:0006412 translation
    • Function GO:0003735 structural constituent of ribosome
    • Component GO:0005840 ribosome

DISCUSSION: It was agreed that the BP 'translation' can be applied to all ribosomal proteins.


  • NOT qualifier
    • I think the example of the pseudokinase TRIBBLES should be a candidate for the NOT qualifier. It matches IPR000719 and IPR017442, protein kinase domains. And the presence of a kinase domain is characteristic of TRIBBLES, it is only that they have lost their catalytic activity, that’s why they are called pseudokinases.
    • A different case is IPR000014. It integrates 3 signatures. One of them, SMART, is the one giving problems as it hits false positives (Q9C1W9). This is a signature for a PAS domain (mapped to signal transduction), and Q9C1W9 is a DNA ligase. SM00091 hits 44433 proteins in total.


Minutes

Present

  • Jane Lomax (GO)
  • Rebecca Foulger (GO)
  • Emily Dimmer (GOA)
  • Alex Mitchell (InterPro curation co-ordinator)
  • Amaia Sangrador (InterPro curator)
  • Craig (InterPro Bioinformatician)
  • Siew-Yit Yong (InterPro curator)


Changes In InterPro Annotation Policy

  • Amaia gave an overview of the annotation strategy of InterPro moving to giving domain-specific annotations for domain entries instead of carrying over whole protein mappings to domains.
  • We discussed the two-tier approach of InterPro: 1) mapping the functions of individual domains, and 2) mapping the processes and functions that the whole protein is involved in. Alex is going to check with Sarah H. to see whether she's in favour of the two-tier approach.
  • InterPro are considering using 'contributes to/involved in' qualifier to inherit MF and BP terms up to the whole protein level. This would mean current annotations wouldn't be lost (contributes_to could be applied en-masse to all existing InterPro domain-based mappings' (i.e., InterPro would convert all of our current domain-based mappings to contributes_to, then work through them, adding domain-specific functions).
  • AI: GO and InterPro to come up with a definition for 'contributes_to' before they start using it. Jane and Emily will talk to the other GO managers about this.

Multi-organism Processes

  • Jane gave an overview on the multi-organism node in GO.


GO Membrane Terms

  • Jane's presentation overviews the membrane terms in GO. The plan is to change the existing terms to capture the integral/intrinsic/peripheral information at the annotation stage, but don't worry about this for the moment as it can be automatically mapped in the future.
  • Documentation on the current membrane terms in GO is here

Diag-membrane.gif


Browsing GO

OBO-Edit is a very nice tool for browsing GO. It can be downloaded here. There are UNIX, MAC and WINDOWS versions available.

You can view GO without having to checkout the ontology files by doing the following:

  1. Choose the "File -> Load..." menu option
  2. Choose the OBO Flat File Adapter
  3. Click 'Add' to create a new profile
  4. Type http://www.geneontology.org/ontology/gene_ontology_edit.obo into the 'Path or URL' box.
  5. Click 'OK'. (it may take a little time to load)

There is a tutorial here about using OE. You can ignore all the editing section, but it may be useful to get you started on a layout. If you've got any questions about using OBO-Edit for browsing or have any problems with loading or views, just pop along to the GO office and we can give you a demo.

PROTEIN BINDING

GO pointed out the various terms under 'protein binding ; GO:0005515' and 'protein domain specific binding ; GO:0019904'. InterPro may be able to shift many of their mappings to these more specific terms. The plan is to shift the protein binding terms to be binding to specific families, but at present there isn't a good source of families to use.


PROTEIN COMPLEXES

GO pointed out the many terms under 'protein complex ; GO:0043234'. InterPro may be able to use many of these in their CC mappings, and can submit requests for more terms if required. The protein complex terms in GO aren't ideal as they are basically just a list. Eventually the plan would be for these to be created and maintained by other databases (E.g. PRO/Intact etc) but this will be some time off. The GO protein complexes should be species generic: if InterPro find that any definitions are too specific, just submit a SF item and we'll take a look.


MAPPING PROTEINS THAT HAVE DIFFERENT FUNCTIONS IN DIFFERENT COMPONENTS

At present, this should be recorded in internal notes. The plan is to capture this information using additional columns at the annotation stage. E.g. x activity is dependent on/observed when in x location.


NOT vs BLACKLISTS

Emily gave an overview of the difference between a blacklist and a NOT annotation. The 'NOT' qualifier is used when there is experimental or sequence evidence that a particular protein does not have a specific activity or is not found in a particular cellular location or is not involved in a specific process. The blacklist would be a list of proteins that should never be associated with a GO term. Tony is currently testing out filtering these out of the UniProt annotation set.


Useful Links