Non-GOC Contributions Call 2016-04-27
Back to Data Capture Working Group
Data Capture Call 2016-04-27
- Discussion of goals and use cases
- Individuals and individual annotations
- GAF-formatted data, ontologies, and other bulk data
- Others? MC: we should discuss our criteria for taking on datasets: is syntactic check enough? Do we want to run semantic ones, check the biology manually ones? We also need to discuss maintenance of datasets long term, what happens when annotations become stale and submitting groups have disappeared?
- Training - depending on the level of participation that groups/individuals want, how should we handle training in GO best practices? Who will be responsible for training? Should we develop a standard set of training materials, including well annotated papers for new curators to initially practice/train on?
- It might be helpful to develop a flowchart that guides potential contributors through the process and includes questions like, What stable identifiers will/can you use? Are the processes you want to annotate normal biological processes or related to disease pathology? These are all issues that have come up wrt helping potential contributors in the past.
- A proposal for a repository/data format for bulk data.  (Seth)
- "Atomic" resources or "grouping" resources
- Required and optional fields
- Discussion of internal GO mechanisms to handle user data
- Using drupal internally to manage external repository stages
- "Spreadsheet" type collection
- Discussion of timeline to accomplish the above
Present: Judy, Melanie, Ruth, Seth, Kimberly, Pankaj, Moni. Regrets: Claire, Chris, Valerie.
Notes by Moni.
In order to define the users and contributors we are planning to serve, let’s clarify the goals and possible scenarios. We can receive annotations from groups that conduct their own, and/or help groups to produce their annotations, filling in gaps for curation that hasn’t reached the GO database. Also, we would like to assist people who want us to incorporate their paper into our curation pipeline.
- Leading GO curators towards the individual annotations they’d like to see.
- “One time” Groups who wish to submit a GAF file with annotations.
- Non-GOC Groups who wish to regularly contribute annotations using one of the available tools.
- Contributions to the ontology: capturing ontology edits.
- Offering Training: Kimberly offered example about a group wants to annotate lncRNAs - Rama started working with them. They have protein-to-GO accounts. Moni: Training and checking their annotations takes a lot of time. We will ensure that we improve on available documentation and create necessary training materials to facilitate this for other communities. Also we will need to ensure programatic QC methods to facilitate that process as well.
- We should encourage metadata collection at time of publication. Authors could suggest GO terms that would be appropriate for their data. We will have to check them; good to keep in mind.
Melanie: We should also consider how annotations are going to be maintained in the future. Is there someone at GOC going to take care of the submitted annotations in the future?
Kimberly: certain GOC members will take responsibility for certain contributed data and their annotations.
Judy: Yes, it is true for some organisms. For other groups, e.g. synapse data, the ontology development is being done by us, and not much annotation is going to be done on that. The experts in Amsterdam and Broad (non-GOC) are going to be in charge of the annotations. It plays by ear. Who is available when they come.
Kimberly: Who takes care of old data / annotations? Old doesn’t mean wrong, but needs to be maintained.
Pankaj: Planteome stepped in (NSF funded), and they have gone out to collect the various genomes that have been published (~80), collected data from ~60 of them, including transcriptomes (useful for traits and functions and evolutionary analyses). They now have GO annotations for ~60 genomes + transcriptomes. They run a pipeline for all of them, then make functional assignments. They are planning to do this on a yearly basis. About 2M gene products. ~6-10X annotations coming out of that work. These receive annotations from GO as well as phenotype, etc. Offers suggestion to work with GOC as a conduit to collect and grab majority of the plant community data and feed to GO consortium. Data are always open (goes to SVN and version control) and can also be accessed directly by GOC. In the process of cleaning some of those datasets. As they mature, will do their best to get closer and closer to meet GOC requirements. Suggestion to ask analysis platforms to offer data in GOC-compatible formats, e.g. GAF format.
Judy: Quality checks must be put in place before we can capture all these data to be incorporated to GO database. Perhaps we could offer a link to a repository where people can place the results of all their analyses. But we would check quality before incorporating to GO database. Encourages the use of InterPro2GO for annotations.
Seth: there will be various kinds of data absorbed: no reason why we couldn’t expose different levels of data based on evidence codes. Example: “Default: curated data”, then, flip some switches and now expose other data. We could offer an ecosystem that supports all of these aspects. Seth proposes a system for doing this that also keeps track of how far people have gone to make quality annotations. This is more on the technical side, but there has been some thought on how this could look like.
Kimberly: Would be great to have flowcharts on the GO website to assist users on the steps necessary to produce quality annotations that will pass the GOC test. Outlining the aspects we take for granted - e.g. you need identifiers - etc. This as an end-product. Also some people want to do annotations, but they are not necessarily biological processes: need to clarify this up front. “What you cannot do with GO annotations”.
Seth: basic idea. For people trying to drop off data or making modifications. We propose a directory layout part of GitHub, web-accessible way of doing this. A metadata file and a very minimal directory layout. Anyone who does this will offer us material our tools can read. They fork the repo, drop in a GAF, change metadata fields, and commit to GH - this will allow us to do integrations tests and give them feedback about QC and content right away. This would allow us to distribute what Jenkins does to anyone who clones our repo. Seth set up a quick draft on GitHub. The idea would be to create a method of people being able to get feedback with little or no required interaction from GOC members, and giving us a way of contacting them. This offers initial feedback without much overhead. This also solves the problem of “where do we put these files”? We don’t put them “anywhere”, rather we create a repository. We would be able to tune this for anyone to come in, get basic QC; also, we would learn about who is trying to do this via ‘call home’ features, and we may choose to contact them or not.
Kimberly: is there an option for someone to go on our website and do this using a form?
Seth: this could be done as well, but the concern is that if we start spreading out, and without a larger framework, too many methods to gather data, effort may be too confusing. We should prepare that workflow for interested users, a training with video for the same thing at the beginning so everyone knows how to do it.
Ruth: Cytoscape tool goes off to loads of different resources. PSICQUIC (http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml) site selects which set of annotation or interaction data you want to incorporate in your network. What we suggest could encourage the use of tools to let people choose which datasets they want to include in their file.
Seth:Precisely. We will have the ability to offer highly curated material to “everything”. For example: “I’d like to have an AmiGO instance that only looks at this set of the ontology, or only looks at these fields from a GAF.” And then you’d discard it. Please take a look at the draft I proposed on GH (https://github.com/geneontology/contributor-data-pool). We think this is doable and easy. What needs to be talked about is how to keep track of how good are the data that are coming in.
And how do we do this?:
- Who they are gets them in the door?
- Can they move it from one stage to another?
- Do people advance in their level of curation?
Defining what state data have - we can use GH repo. If people are interested in exploring stand alone data repos, we can begin doing this now.
Judy: Pankaj’s Planteome datasets are a great starting point for testing.
Seth: as part of the larger spec, I’ll work with Justin on the data. Will need input from the biologists about the types of data we will capture. What is the minimum amount of data someone can ‘get away with’?
Suggestion from the room:
- GO Term
- Evidence Code
- Authority (e.g. PubMed ID)
There rest is metadata around the annotation.
Seth: The second order thing that gets picked up with unit tests is also important: OrcidID? email? Stage of annotation? Which of these fields are required? If someone sets up an independent repository of data, what metadata do we need to ensure tractability, licensing, and quality?
Pankaj: if we can start with standard GO annotation workflow e.g. HMM beans annotation or InterPro Scan annotation. Getting the data output in a GAF format from those tools.
- Check data from Planteome with Justin, to test method for consumption.
- Start drafting the flowchart that we will offer users.
- Please take a look at the draft Seth proposed on GitHub (https://github.com/geneontology/contributor-data-pool). We need to discuss how to keep track of the quality of the data we consume.
Back to Data Capture Working Group