Extension of Protein2GO to non-UniProtKB Identifiers: Difference between revisions

From GO Wiki
Jump to navigation Jump to search
Line 42: Line 42:
'''Response from UniProt:''' ''We propose that a common identifier is used for each entity type as follows:''
'''Response from UniProt:''' ''We propose that a common identifier is used for each entity type as follows:''


''* Proteins: UniProt accessions''
''1. Proteins: UniProt accessions''


''* Functional RNAs: RNA central identifiers''
''2. Functional RNAs: RNA central identifiers''


''* Protein Complexes: IntAct complex identifiers''
''3. Protein Complexes: IntAct complex identifiers''


''This would simplify greatly the maintenance of Protein2GO with regards to sanity checking individual identifiers. It will also be clearer for the users to determine which entity type the annotation is referring to. We are willing to assist groups in mapping their identifiers to any of the proposed common identifiers and adding missing protein accessions to UniProt.''
''This would simplify greatly the maintenance of Protein2GO with regards to sanity checking individual identifiers. It will also be clearer for the users to determine which entity type the annotation is referring to. We are willing to assist groups in mapping their identifiers to any of the proposed common identifiers and adding missing protein accessions to UniProt.''

Revision as of 07:46, 12 December 2013

Conference Call Agenda

Google Spreadsheet

https://docs.google.com/spreadsheet/ccc?key=0Aiei4RvoiQdqdHBFVEcwXzRvcW94V2JOLVFSNjJaTHc&usp=drive_web#gid=0


What types of entity identifiers might be needed?

  • Proteins not in UniProtKB
  • ncRNAs
    • Examples
      • C. elegans gene lin-4 encodes a miRNA that regulates gene expression during larval development - Currently annotations are made to the WB gene ID
  • Orphan genes
    • Examples
      • C. elegans gene abc-1 is an uncloned locus defined by a variation that results in defective chromosome segregation - Currently annotations are made to the WB gene ID
  • Protein complexes

Knowledge Representation

  • What kind of biological statements do we want to make?
  • Given these statements, what is the appropriate resource for the entity IDs?
  • How will this be represented in the GAFs/GPADs?

Practical Considerations

  • How many of each type?
  • ID stability - if there is churn, can IDs be mapped forward, not go stale?

Overview of representation of complexes in ontology

Minutes

In attendance: Chris, Fiona, Harold, Judy, Kimberly, Paul S., Petra, Rama, Sandra

Unfortunately, people from the UK were not able to call in to the conference line, so we were missing Rachael, Tony, Claire, Susan, amongst others.

  • Summary of issue: some entities that curators would like to use for GO annotation cannot currently be used with Protein2GO
    • Examples:
      • Proteins that don't have UniProtKB IDs
      • Gene IDs
      • ncRNA IDs
      • Protein Complex IDs

Response from UniProt: We propose that a common identifier is used for each entity type as follows:

1. Proteins: UniProt accessions

2. Functional RNAs: RNA central identifiers

3. Protein Complexes: IntAct complex identifiers

This would simplify greatly the maintenance of Protein2GO with regards to sanity checking individual identifiers. It will also be clearer for the users to determine which entity type the annotation is referring to. We are willing to assist groups in mapping their identifiers to any of the proposed common identifiers and adding missing protein accessions to UniProt.

The orphan/unlocalised genes seem to vary widely in their state of characterisation and sequencing, so we are not considering this as a high priority. We would prefer to get the above three categories in place first.

  • There are two parallel issues here wrt curation:
    • What entities are needed and how to use them in Protein2GO
    • How to consistently represent annotations to other entities (e.g. protein complexes) across the consortium
  • Issue #1
    • It seems that there is agreement on which database IDs would be helpful to start:
      • MOD gene IDs

Response from UniProt: If we can map these to a common identifier for a particular entity type (as listed above), is there any need to use these?

      • NCBI RefSeq IDs
        • Note that UniProt is willing to add protein IDs for proteins not currently represented

Response from UniProt: We would like to reiterate that UniProt will, wherever possible, add missing accessions in a timely manner. Groups need to communicate the missing proteins directly to us, together with evidence.

      • Protein complex IDs - IntAct, RACE-PRO
        • IntAct is in discussion/development with UniProt about using web services to annotate to protein complex IDs
      • MOD protein IDs

Response from UniProt: If we can add missing proteins to UniProt, is there any additional need to use these?

      • RNA Central

Action Item: Kimberly will follow up with an email to ask Tony what would be most helpful for groups to provide wrt external IDs.

  • Issue #2
    • Representing annotations to entities like protein complexes in the GAFs/GPADs

Response from UniProt: We are not sure why this is an issue. The protein complex identifier would be in column 1 and 2 of the GPAD file and in the GPI file it would indicate the DB_Object_Type as 'protein complex'.

    • This issue will also be discussed on the Tuesday, GO annotation calls
    • We want to make sure we're consistent across GO about propagating BP, MF, and CC annotations to complex members
    • Annotations to protein complex identifiers - should these be in a separate GAF?

Response from UniProt: The GAF has column 12, DB_Object_Type, so it should be able to handle different entity types.