Extension of Protein2GO to non-UniProtKB Identifiers

From GO Wiki
Jump to: navigation, search

Conference Call Agenda

13 December 2013

Google Spreadsheet


What types of entity identifiers might be needed?

  • Proteins not in UniProtKB
  • ncRNAs
    • Examples
      • C. elegans gene lin-4 encodes a miRNA that regulates gene expression during larval development - Currently annotations are made to the WB gene ID
  • Orphan genes
    • Examples
      • C. elegans gene abc-1 is an uncloned locus defined by a variation that results in defective chromosome segregation - Currently annotations are made to the WB gene ID
  • Protein complexes

Knowledge Representation

  • What kind of biological statements do we want to make?
  • Given these statements, what is the appropriate resource for the entity IDs?
  • How will this be represented in the GAFs/GPADs?

Practical Considerations

  • How many of each type?
  • ID stability - if there is churn, can IDs be mapped forward, not go stale?

Overview of representation of complexes in ontology


In attendance: Chris, Fiona, Harold, Judy, Kimberly, Paul S., Petra, Rama, Sandra

Unfortunately, people from the UK were not able to call in to the conference line, so we were missing Rachael, Tony, Claire, Susan, amongst others.

  • Summary of issue: some entities that curators would like to use for GO annotation cannot currently be used with Protein2GO
    • Examples:
      • Proteins that don't have UniProtKB IDs
      • Gene IDs
      • ncRNA IDs
      • Protein Complex IDs
 Response from UniProt: We propose that a common identifier is used for each entity type as follows:
 1. Proteins: UniProt accessions
 2. Functional RNAs: RNA central identifiers
 3. Protein Complexes: IntAct complex identifiers
 This would simplify greatly the maintenance of Protein2GO with regards to sanity checking individual identifiers. 
 It will also be clearer for the users to determine which entity type the annotation is referring to. We are willing 
 to assist groups in mapping their identifiers to any of the proposed common identifiers and adding missing protein 
 accessions to UniProt. Protein2GO already has a 'lookup' facility for mapping MOD IDs (currently FlyBase, SGD and 
 WormBase, but can be extended) to UniProt accessions.
 The orphan/unlocalised genes seem to vary widely in their state of characterisation and sequencing, so we are not 
 considering this as a high priority. We would prefer to get the above three categories in place first.
  • There are two parallel issues here wrt curation:
    • What entities are needed and how to use them in Protein2GO
    • How to consistently represent annotations to other entities (e.g. protein complexes) across the consortium
  • Issue #1
    • It seems that there is agreement on which database IDs would be helpful to start:
      • MOD gene IDs
 Response from UniProt: If we can map these to a common identifier for a particular entity type 
 (as listed above), is there any need to use MOD gene IDs?'
      • NCBI RefSeq IDs
        • Note that UniProt is willing to add protein IDs for proteins not currently represented
 Response from UniProt: We would like to reiterate that UniProt will, wherever possible, add missing 
 accessions in a timely manner. We are also able to import any sequence from RefSeq. Groups need to 
 communicate the missing proteins directly to us, together with evidence.
      • Protein complex IDs - IntAct, RACE-PRO
        • IntAct is in discussion/development with UniProt about using web services to annotate to protein complex IDs
      • MOD protein IDs
 Response from UniProt: If we can add missing proteins to UniProt, is there any additional need to use MOD protein IDs?
      • RNA Central

Action Item: Kimberly will follow up with an email to ask Tony what would be most helpful for groups to provide wrt external IDs.

  • Issue #2
    • Representing annotations to entities like protein complexes in the GAFs/GPADs
 Response from UniProt: We are not sure why this is an issue. The protein complex identifier would be in 
 column 1 and 2 of the GPAD file and in the GPI file it would indicate the DB_Object_Type as 'protein complex'. 
 The GPI file could also contain the component parts of the complex.
    • This issue will also be discussed on the Tuesday, GO annotation calls
    • We want to make sure we're consistent across GO about propagating BP, MF, and CC annotations to complex members
    • Annotations to protein complex identifiers - should these be in a separate GAF?
 Response from UniProt: The GAF has column 12, DB_Object_Type, so it should be able to handle different 
 entity types.


In attendance: Suzi, Judy, Paul S., Claire, Maria, Tony, Rachael, Chris, Kimberly, Rama, David, Harold

The discussion focused on annotating to ncRNAs, protein complexes, and resolving mapping issues between MOD IDs and UniProtKB IDs.

In preparation for annotating to other entities:

  • ncRNAs
    • Curation tool will use RNAcentral IDs (http://rnacentral.org/)
    • MODs will need to communicate with RNAcentral to establish pipeline for sharing IDs for ncRNAs
  • Protein complexes
    • Several projects curate protein complexes - IntAct, PRO
    • Sandra from IntAct will be presenting a curation pipeline to UniProt re. creating, and annotating to, protein complexes
    • Need to establish a workflow for GO curators to create complex entities and annotations
    • IntAct and PRO would like to work together to share information, entity IDs
  • Resolving ID issues between MODs and UniProtKB IDs
    • Some proteins are not yet represented in UniProt
    • Need to work through the specific cases - Michele M. is doing some of this now; some are easy to resolve, others not so much
    • Other sources, e.g. RefSeq, may be used to fill in missing objects
    • MODs will need to work with UniProt to resolve these issues