LEGO July 25, 2016

From GO Wiki
Jump to: navigation, search

Bluejeans URL

https://bluejeans.com/969313231

Agenda

UK Training Session

  • Doodle poll for dates from late-August through late-October
  • Claire will be back this week; need her input
  • Looks like best times right now are:
    • Tuesday, August 30th - Thursday, September 1st (seems to be the best time overall for poll respondents)
    • Monday, September 5th - Thursday, September 8th
    • Tuesday, September 13th - Wednesday, September 14th **I can't make this -- Chris**
    • Wednesday, September 28th - Thursday, September 29th (the 26th - 28th was one of Claire's initial suggestions, but David H. cannot attend then)

Training Documenation

  • Kimberly drafted the beginnings of a Quick Start guide
  • If the right idea, can flesh out a bit more, finish up, and create a page to link from Noctua homepage?
  • Also use as guide for what videos to make?

Software Updates

NEO Overview and GPI Files

  • Chris to provide an overview of what NEO is and how it's constructed - try again this week?
  • GPI files - examples on Google spreadsheet
  • All entries in the spreadsheet now follow gpi file format 1.2
  • MGI submitted their new gpi file last week
  • Questions, issues still to be sorted out?
    • We have entries for:
      • Genes
      • Proteins
      • Transcripts
      • ncRNAs
      • Protein Complexes
    • Need clarification on this: If groups (MODs, AGR members) have internal IDs for proteins or ncRNAs, should they be including UniProtKB and RNAcentral accessions as well? What are the implications, then, for what entities are available for curators to use in Noctua?
    • What is the purpose of the db_xref column and how will it be used wrt NEO and Noctua?
    • Mapping all IDs in gpi file back to GCRP accession? Can this be done, and if so, how? Should this be the default db_xref in each groups' gene entry?
    • If groups don't have parent transcript or protein IDs, what ID should be used in Noctua and with what relation?
      • For example, if a curator needs to specify any mRNA transcript of a gene to add context to an MF annotation, should they use:
        • has_input(WB:WBGene00004804) OR has_input_some_product_of (WB:WBGene00004804) OR has_input_some_mRNA_transcript_of (WB:WBGene00004804)
      • Use case for this: WormBase skn-1 gene and protein identifiers in Google spreadsheet; the GCRP accession for SKN-1 is UniProtKB:P34707
 WormBase Proposed gpi:
 DB    DBID           Symbol        Name               Syn.    Type    Taxon           WB Parent ID            dx_xref
 WB    WBGene00004804 skn-1         skn-1                      gene    taxon:6239                             UniProtKB:P34707
 WB    T19E7.2a       skn-1         skn-1, isoform a           transcript taxon:6239   WB:WBGene00004804      ????
 WB	WP:CE27591     SKN-1 (?)     SKN-1, isoform a		protein	taxon:6239	WB:WBGene00004804      UniProtKB:P34707-1
 WB	WP:CE49174     SKN-1 (?)     SKN-1, isoform d		protein	taxon:6239	WB:WBGene00004804      UniProtKB:V6CLA3
 UniProt GCRP gpi (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/WORM/)
 UniProtKB	P34707	skn-1	Protein skinhead-1	SKN1_CAEEL|skn-1|T19E7.2	protein	taxon:6239		 db_subset=Swiss-Prot
  • Should UniProt add MOD gene IDs as db_xrefs for the GCRP gpi file (and also the isoform gpi file)?
  • Next steps - documentation of contents, communication of pipeline to other groups

MGI Meeting Follow Up

  • Review the list of software and annotation issues that were discussed at the MGI training session, June 15th-16th.
  • See the Google doc
  • Some specific follow-up:
    • GAF/GPAD output is probably highest priority
      • Remaining issues:
        • How to handle causal chains
        • Multiple evidence = multiple lines in the GAF
    • Using a limited set of relations in Noctua to make it easier for curators to find what they need github ticket 165

Minutes

  • On call: Chris, Dan, David H., David O-S., Giulia, Helen, Kimberly, Melanie, Paul T., Ruth, Sabrina, Seth

UK Training Session

  • Looking like last week of August or first week of September will be best
  • Need to get input from Claire and others at UniProt and then we'll make a decision
  • Documentation - ever growing; the more we have, the harder it is to keep track of everything and make sure it's all up-to-date
  • Seth proposed some built-in documentation approaches; he will investigate

Software Updates

  • Seth - small update this week; provides a toe-hold for TextpressoCentral roundtrip

Noctua Entity Ontology (NEO)

  • Chris described what NEO is and how it's currently generated
  • NEO essentially makes every entity a class in a big, flat OWL ontology; Noctua then creates instances of those classes
  • github repository for NEO is here: https://github.com/geneontology/neo
  • Annotation entities right now are derived from existing entries in Columns 2 and 17 of each group's GAF
  • Entity IDs for annotation are still generally gene or protein IDs which are neutral with respect to the exact entity they represent; the specificity comes from the context of the annotation
  • Going forward, however, NEO will be created from the gpi file that each curation group submits
    • The implication for this is that the entities to which annotations are made will be those that the species groups (e.g., MGI) submit, however the UI could enable searches on other IDs, as long as there is a mapping, in the gpi file, to the primary annotation ID
  • The gpi file will contain genes, transcripts, ncRNAs, proteins, macromolecular complexes
  • The gpi file name will need to be standardized; we should use the file naming system that UniProt is using right now
  • We examined entries in the mouse gpi file
    • Mouse gene IDs in Column 2 have db_xrefs to the UniProtKB GCRP accession - this provides the needed mapping for Panther and PAINT pipelines, but note that these are not equivalent types of entities - Column 2 is a gene whereas the db_xref is a protein
    • We also looked at lines representing generic protein products of a gene versus lines representing specific protein isoforms - see the examples for mouse GNAS in MGI, PRO, and UniProt
    • We need to agree on how these will be represented - PRO and UniProt have different solutions
    • Does everyone UniProt entry at least have a -1 if there is only one isoform?
  • We also need to agree on how curation will work in Noctua so that we keep the semantics very clear
    • Use appropriate relations, e.g. 'has input some product of' while curating?
    • Have Noctua handle populating the appropriate relations behind the scenes based on the type of gene, e.g. protein-coding, and ontology term?