Data Capture Call 2016-05-10

Just as a reminder from the previous notes, these are the tasks and scenarios we wish to tackle:

Generating documentation to lead GO curators towards the individual annotations they’d like to see. (?)
Assisting “one time” Groups who wish to submit a GAF file with annotations.
Assisting non-GOC Groups who wish to regularly contribute annotations using one of the available tools.
Assisting groups who wish to make contributions to the ontology: capturing ontology edits.
Offering Training: Training and checking annotations from non-GOC groups takes time. We need to improve on available documentation and create necessary training materials to facilitate this process for other communities. Also we will need to ensure programatic QC methods to facilitate that process as well.
Encourage metadata collection at time of publication: e.g. authors could suggest GO terms that would be appropriate for their data. This probable needs to be done in coordination with other consortia involved in standards.

Agenda

1. Taking into account the draft Seth proposed on GitHub (https://github.com/geneontology/contributor-data-pool):

Discuss how to keep track of the quality of the data we consume:
- Syntactic check ?
- Semantic check ?
- Check the biology manually ?
Discuss maintenance of datasets in the long term: what happens when annotations become stale and submitting groups have disappeared?

2. Flowchart. To improve on the "decision tree" currently available on the GO Website at "Contributing to GO", we should add:

Information about which identifiers can be used
Information to assist contributors in deciding what to do depending on the nature of their data. For example, whether they are contributing annotations for "normal biological processes" or "disease pathology".

3. Report from Seth + Planteome data export experiment is postponed for next call.

Proposed next steps:

1. Update the 'Ad hoc' contributions page: In this page http://geneontology.org/contribute-annotations-adhoc add the form Ruth & Seth designed with the following functionality will be exported into a GitHub issue directly via (e.g.) a cron job.
- The form can be seen here: http://bit.ly/go-ann-request-demo
- The table that results from submitting the form in its current style is here: https://docs.google.com/spreadsheets/d/1QhS3lhrkfJjHZ4CF5EIRwcOPc1KmyzX4ZViGXLZnzuY/edit#gid=750930053
- The fields “Assigned to curator”, “Assigned to group” fields should change. Instead capture info from GitHub for user submitting the data. We’ll assign (GOC) curators once it is on GH.
- ToDo: Seth will take a look at how to plug in the form into GitHub.

Questions:

What is the scale we should consider for the submissions?
What are some examples of one-timer groups?
Currently, people can submit via SVN; syntactic checks are added. Is there someone checking the annotation for whether they conform to our data requirements? Whether they conform to the biology? Will we be doing all syntactic clean up?

Answers, Proposals:

2. Scale:
- one paper with one or two annotations: no GAF; Just GH/form.
- large scale: GAF/GPAD
- ToDo: At the beginning of the flowchart, add a question about whether groups plan to submit a “one-time” dataset or recurring submissions.

3. Examples of groups wanting to submit a “one time” file. Many scientists may want to push their data to GO, but can’t necessarily commit to becoming GO curators.
- For example:
  - newly sequenced genomes
  - groups that made some annotations and don’t want them to get lost so want to push them out
  - groups which funding disappeared
  - single paper annotations.
  - From Ruth: e.g. Manuel Mayr wants to avoid losing his data

4. Submitting GAF files: People submitting GAF files will encounter automated checks (syntactic and semantic) that will inspect the data to give a “clean bill of health” for structure. The following rules will apply:
- We would be running these checks for the groups.
- The main GO pipeline loading and AmiGO loading will require human involvement.
- If someone submits something on GH, we can set it up so we can run it on Jenkins as part of that pipeline.
- On the annotation tracker we will add a tag to indicate the group in charge of taking each annotation to completion, for example Noctua, Protein2GO, MGI, etc.

Considerations:

We will have to assign mentors to new groups.
Training new groups is time consuming. It is a long-term commitment.
We must also discern how much time/efforts to dedicate for training GO curators and how much for training curation teams outside the GOC.
Guidance & training must take place early: we need to get them early and guide them at early stages of their annotation effort to prevent mistakes. (has happened before!).
If there are just a few annotations, it is likely that we’ll have to do it ourselves to be more effective.
The grant renewal proposal includes an objective to interact and assist other communities.

Proposal:

5. Improving documentation: Improve the training materials by bringing documentation to a centralized repository, collecting documents from around the GOC.
- ToDo: To start, let’s gather materials currently available from Claire, Ruth, Melanie, Moni.

Example decision tree from Ruth:

This is an example of a more detailed decision tree focused just on getting input from scientists about the accuracy of the existing GO annotations/ensuring GO describes known biology. This figure was submitted to GO book (Ruth Lovering) but people are welcome to edit it (Example Decision Tree - Lovering). Because there isn't really room to put in all the other decision aspects trees, we might need to think about a general tree that branches out to separate decision trees rather than trying to put everything on one page. Perhaps this current tree has too much information for what we want.
ToDo: Use this tree as a starting point to draw our own.

Seth will take a look at how to plug in the form into GitHub.
At the beginning of the flowchart, add a question about whether groups plan to submit a “one-time” dataset or recurring submissions.
Gather training materials currently available from Claire, Ruth, Melanie, Moni.
Use Ruth's tree as a starting point to draw our own.

We will get together again on 2016-05-25 to draw a decision tree together in a shared document. We will create a GOC Data Capture Decision Tree here.