Reference Genome Database Requirements Discussion

From GO Wiki
Jump to: navigation, search

This is the place to discuss features and requirements for the Reference Genome Database being designed to replace the Google Spreadsheet system currently in use.

(here's one to get us started --chris):

Ensures consistent use of identifiers

Identifiers must unambiguously identify a single entry in a database.

Identifiers should conform to the following syntax:

 DBAuthority : LocalID

DBAuthority should be in the GO xrefs metadata list:



Curators should not be expected to memorise identifiers, so a data entry system should allow them to enter symbols etc and have this resolved as an ID eg using some automatic lookup mechanism

Should allow loading of MOD reports

The database should allow MODs to submit their metrics via a tab-delimited file that can be automatically downloaded from their ftp site. The file should contain columns for the Reference gene, the organism's ortholog/orthologs, date genes have been completed and reference counts for total number of papers associated with a gene etc.

In the future we may want to add capability to determine when genes that have been completed but have new references associated with them.

References should be compiled in a central location, so once a paper is curated, it is somehow flagged that it has been done.

ZFIN does this for our own local publication database. It is very handy since one paper typically touches on several genes. Completing GO for a paper means it doesn't need to be looked at again if it is associated with another gene on the ref genome list. This would be useful for groups that don't have such a mechanism in-house. MODS could provide a tab delimited file like geneID | PUB ID | OMIM ID | pub status [curated/not curated]. If the Ref. Genome interface had this data, it would also be possible to show a report of which pubs still need to be examined for a specific ortholog. One possible complication is that a single pub may deal with the genes from multiple species. Would need to track curation of this pub separately for each species I think. -Doug

We need to decide whether we will allow individual users to modify the database a record at a time or whether the database should only be populated with files from each MOD.

I think both methods would be useful (Doug)

Should track that no ortholog was found

There should be a mechanism for indicating that a curator has looked and no ortholog could be located as of a certain date. For genomes that are not yet completely sequenced, we will want to revisit these when a new genome build is released. It would be nice to have a free text note field associated as well so we can leave notes regarding the analysis that was performed. -Doug

Should provide reports to focus curation effort

The interface should provide reports that will help focus curator effort. One example might be to provide a facility to search the data for species-specific orthologs where curation is not 'comprehensive'...these are the genes we should be working on. Another example might be to provide a report of genes where no ortholog was determined yet. It would be nice to be able to alter the sort order of the results in such reports by the following parameters: date Human Gene was added to the Ref. Genome set, ortholog ID, OMIM ID, most papers associated, difference between papers associated with a gene and papers read for GO for that gene (biggest difference to smallest)...maybe others... It would also be good to have a report generated by a query on OMIM ID that shows which species have 'comprehensive' annotation done for their ortholog(s)-Doug

Should record that curation is 'comprehensive' as of a certain date

Curators should be able to mark that curation of an ortholog is 'comprehensive' as of a certain date. It would be good to be able to generate a report to look for cases where the 'comprehensive' curation date is getting old. These may need to be reviewed and updated. -Doug

Should allow a 1:many relation between Human gene and MOD ortholog

Perhaps this is self-evident, but it mucks up reports and things pretty nicely...

For SGD, I've also seen cases where we would need a many:1 relation between the Human genes and the MOD ortholog. -Karen

Should record orthology determination method

If the group decides on a single method we can all adhere to for ortholog determination, then this may not be required. I think this is unlikely however, in which case it may be useful to record HOW orthology was determined. Perhaps a set of checkboxes beside the various methods can be used to record what tools/methods supported the orthology call. These would include YOGY, TreeFam, InParanoid, etc. as well as manual BLAST and synteny analysis, and perhaps 'established from literature' for cases where orthology was already recorded by a MOD from a published paper. If we really want to be careful, it might be good to record which build of the databases behind the tools was being used as which Sanger build, which InParanoid version, etc.? -Doug

Rather than having a tickbox for YOGY, which is a compilation of 4 tools and Val's curated orthology between S. pombe and S. cerevisiae, I'd like to see individual tick boxes for the 4 tools: 1) KOGs, 2) Inparanoid, 3) HomoloGene, and 4) OrthoMCL. For cerevisiae, I see different results between the 4 tools. In addition to tickboxes, it might also be nice to have a free text field for notes relating to the orthology call. -Karen

I agree with Karen that we should have tickboxes for each of the methods included in YOGY as well as for Treefam. For cases where an ortholog is not present, it would be helpful to be able to record that information as well. I've also been recording the results of reciprocal BLAST searches and would like to continue doing so with this curation tool. --Kimberly

Only one index page

Ideally one main page with options for the pages to visit or output required. Also all pages to have links back to index and to any pages the page is linked from, ie if you edit a page you can't use the back button and don't want to have to keep looking for a bookmarked page (Ruth)

Easy data input

The administrators and curators need an easy way to add 20 genes at a time to a table. The curators need an easy way to view the genes done/to do and to edit as they go.

Would it be possible to that when the admin create a new gene record automatically 12 pages are created for a gene in all of the other species?

Then either each gene page could have a table listing a synopsis of the annotation so far achieved in all other species; or just the human gene page would have this data: eg 1 row (unless paralogs) per species column headings: for each species: annotator to contact for gene discussions, in progress/date completed, annotations added.

as well as fields with dropdown choices "paralog, ortholog, ancestral gene" fields for metric data, dropdown choice "annotations added: yes/no" curator assigned to gene etc

Obviously also need option of duplicating page in cases where there are paralogs, duplication would ensure link to initial gene page is maintained. (Ruth)

Administrator output for easy data retrieval

Would it be possible to select different output options, ie html or excel?

The advantage of excel is that people can manipulate the data as they wish, unless a variety of outputs, eg graphs, data collation can be included in the outputs of this tool.

I would suggest that the administrators would appreciate an output table which is similar to the original google spreadsheet. With each human gene listed in separate rows (in cases where there are paralogs there will obviously be multiple rows/gene), and the accession number and the metrics data and the date completed for human, and all other species listed in columns.

However, I don't think they will want to view the table as a whole every time they look at it. Especially in a couple of years time when there are 500 genes on the list.

Therefore could there be drop down options: eg having selected "metrics table" and then "edit" or "view", then for view have options "excel" or "html" then next options are: "all data", "by date added to table", "only genes comprehensively annotated in all species", "newest genes", "genes not yet annotated" this would I guess lead to the further option of dates, "2006", "Aug06", "Sept06"..., alphabetically. Perhaps the choices should be decided once people work out what data they want. (Ruth)

Curator output for easy view of data

I think curators would appreciate options (similar to above) for viewing a "spreadsheet". I don't think they will want to look at the whole table every time.

Therefore could there be drop down options: eg having selected "curator table" and then "edit" or "view", (for view have options "excel" or "html") then next options: choice of "species", human, mouse etc; then choice of "genes", "all genes", "by date added to table", "only genes comprehensively annotated in all species", "newest genes", "genes not yet annotated" "genes not yet assigned to curator", "genes assigned to curator...Ruth" this would I guess lead to the further option of dates, "2006", "Aug06", "Sept06"..., alphabetically. Perhaps the choices should be decided once people work out what data they want.

Ideally the species specific spreadsheets would contain all the species specific data available in the individual gene records so that people could edit the spreadsheet rather than use the gene records if they wanted to. (Ruth)

Comments on prototypes

Prototypes for the curation tool can be found here:

Please add your comments & suggestions here

1. Curator Central - adding orthologs: To test this, I made (and destroyed) a new Drosophila ortholog (of HK001?) called Stuff - this worked fine. I then looked in Admin Central and was surprised to see that Stuff was present in the 'listing of target genes' - shouldn't this have disappeared for this listing as I had already 'destroyed' the ortholog in the other table? I was also suprised to see my new Dros ortholog in this table of target genes because I had expected 'target genes' to be the original list of human genes. Are we going to distinguish between 'target genes set' (jn human) and 'target genes identified' (orthologs in other species)? If not then we need to be able to view by species. - Susan

2. A related issue applies to the table for 'Listing Curation Status'. I assume we will view 'Listing Curation Status' for one species at a time? For those of us not curating human papers, I'd like to see 2 columns on the left - one for the human gene symbol and one with the ortholog symbol. The current spreadsheets include many symbols for the human gene - I assume the human symbol used as in these summary sheets will be something standard like such as the valid HGNC symbol? - Susan

3. Should 'Complete curation' be entered as a date rather true/false - Susan

1. Will there be a page like this for each ortholog, or for each organism?

  • Sohel's answer

There will be. I'm guessing for each organism to mirror the current spreadsheet , but it might better by ortholog so that we could easily look at the situation in other species as we are annotating.

  • There would be a separate view for each model organism. Only the curators

of that organism would be able to see/update their respective curation status. But again this is open for discussion. Additionally, we could have the Ortholog view, such that curators can look at the orthologs in other species.

2. There does not appear to be a column for the annotated species ID? (or a species coulumn, if this is by ortholog)

  • Sohel's answer

This can be added.

3. Would it be possible to display a couple of 'real' reference genome entries so we can see how this would look/link together for the different species?

  • Sohel's answer

I think it is a good idea. We will work on getting some real data

4. If this was by ortholog perhaps we could also include direct x-links to i) Marys graphs ii) Treefam iii) YOGY iv) the uniprot and hugo entry for the human protein at the top or bottom of the page. Would this be useful ?

  • Sohel's answer

We are planning to put these links on the "Add Ortholog" page

Ruth However before the meeting please could a few modifications be made to the current pages so that it is easier to get forwards and backwards to pages. Would it be possible for each page could have a back option and a "index" option or something along those lines.

Some ideas:

1. Admin Central should probably just become "Add Target Gene". This should then go directly to a form where a target gene can be added (see 3. below).

2. At the bottom of the form there should be links to see each Organism's Table View with the new genes automatically added, and ideally only those can be edited where you have privileges.

3. The "Add Target Gene" form should contain distinct fields, not so much free text: Human Gene symbol (HGNC), Disease Name, other IDs in distinct fields (which one do we need?), Date added to list. Target completion date is fine.

4. On the first page, in addition to "Add Target Gene", there could be separate links to all organism's Table View, such as "Curate D. melanogaster", "Curate D. discoideum" etc.

5. On the Curation Table View (now called 'Listing Curation status'). Could the link from the human Id go to a pop-up or a new page that has all the human gene and disease info and allows to enter the curation of the species ortholog? I think only this one page is needed to enter and edit the annotation, and it should link from the human gene.

Also the headings <>Target Gene/Gene product id/symbol and <>Gene/Gene product are confusing. From what I can see I think it is more helpful to be choosing an otholog by looking at the gene name and the protein ID. However if only one on view then I guess it should be a unique identifier. Therefore the <>Target Gene/Gene product id/symbol heading should be protein accession/internal ID and the <>Gene/Gene product heading should be gene symbol. But maybe I am doing this wrong and that is why it keeps crashing!

Karen C

  1. I'd really want to have an ortholog view, where I can see the human target gene, and all the orthologs that have been called for it.
  2. On the Listing Target Genes page:
    1. it would be nice if the target gene symbols were linked to a page where you could get more info about that gene, maybe an external resource or maybe the Reference Genomes page for the target gene.
    2. As currently specified, all target genes are human, so the Organism column seems superfluous.
    3. It might also be nice to put the OMIM IDs and gene names here, since we're targeting human genes by disease first.
    4. Maybe we don't need to make Destroy so easy, it could be a click away in the Edit options (which would also save a column).
    5. If we create a page to view all the orthologs of a target gene, it would be nice to have an Orthologs column here to go to that page quickly.
  3. On the Edit Target Gene page:
    1. Same comment as for the Listing Target Genes page, as currently specified, all target genes are human, so the Organism pulldown menu seems superfluous.
    2. I agree with someone else's previous comment that it would be better to put info in more specific fields, and put less stuff into free text. As an SGD curator, our older free text fields are a nightmarish mishmash of bunches of different types of info and different wordings for the same info.
    3. It would be nice if we stored a bunch of different IDs for each target gene. The people compiling the target list have already been recording the info for a bunch of different things: OMIM ID, Chr location, EntrezGene ID, HGNC ID, RefSeq ID, and UniProtKB Ac. It would be nice to record each of these.
    4. It would also be nice to have a way to store additional gene names. Sometimes the first name on the list from the Google spreadsheet is not the current name, and you have to use another name on the list to search. Sometimes none of the names work and you have to use one of the IDs to find a page to identify the gene name. It's also useful to see the additional names to help correlate older records (that may only use one of the older names) with the current gene name.
  4. On the Add Ortholog page:
    1. It will be very cumbersome if we have to select a target gene by ID in order to add an ortholog for it. Curatorially, this will be much easier if you have an Add ortholog button or option on the Edit Target Gene page. This pulldown will get more and more irritating as more target genes are added.
    2. I'd like to see the gene name, as well as the ID, of the Target gene on this page, because curating by IDs is much harder than curating by IDs. It might be useful to include a little bit more of the Target gene info, especially the OMIM disease info to help curators have a visual sense that they are on the page they think they are on. But my comment above, about having the Add ortholog option on the Edit Target Gene page, instead of selecting it from a pulldown will help with that too.
  5. On the Listing Curation status page:
    1. We need to have the gene names of the orthologs listed, not just the ID. Curating by ID is much harder than curating by gene name.
    2. Maybe we don't need to make Destroy so easy, it could be a click away in the Edit options (which would also save a column).
    3. I'd also like to have a column listing the Target gene that the ortholog correlated too, and probably also an OMIM disease column
    4. I agree with Susan that the completed curation should contain a date rather than T/F.

David H

    1. I agree that the human gene should be represented in a standardized way such as using a HGNC-approved gene symbol and ID.
    2. I also agree that the completed field should be a date rather than a true false.
    3. In the 'Listing Curation Status' table, the species should be indicated since many gene symbols will be the same across different species.
    4. The 'Listing Curation Status' table should be sortable and searchable.

Pascale other requirements for the tool

  1. We need to be able to (i) add more than one ortholog and (ii) indicate that there is no ortholog (we've been capturing this as "no ortholog as of [date]")
  2. we need search ability
  3. We need to be able to run reports; either automatically or have a SQL interface through which we can access the database

your name comments