Reference Genome Database Requirements Discussion 2007 (Retired)
This is the place to discuss features and requirements for the Reference Genome Database being designed to replace the Google Spreadsheet system currently in use.
(here's one to get us started --chris):
- 1 Ensures consistent use of identifiers
- 2 Should allow loading of MOD reports
- 3 Should track that no ortholog was found
- 4 Should provide reports to focus curation effort
- 5 Should record that orthology is 'comprehensive' as of a certain date
- 6 Should allow a 1:many relation between Human gene and MOD ortholog
- 7 Should record orthology determination method
- 8 Only one index page
- 9 Should provide an output for easy data retrieval
Ensures consistent use of identifiers
Identifiers must unambiguously identify a single entry in a database.
Identifiers should conform to the following syntax:
DBAuthority : LocalID
DBAuthority should be in the GO xrefs metadata list:
Curators should not be expected to memorise identifiers, so a data entry system should allow them to enter symbols etc and have this resolved as an ID eg using some automatic lookup mechanism
Should allow loading of MOD reports
The database should allow MODs to submit their metrics via a tab-delimited file that can be automatically downloaded from their ftp site. The file should contain columns for the Reference gene, the organism's ortholog/orthologs, date genes have been completed and reference counts for total number of papers associated with a gene etc.
In the future we may want to add capability to determine when genes that have been completed but have new references associated with them.
References should be compiled in a central location, so once a paper is curated, it is somehow flagged that it has been done.
ZFIN does this for our own local publication database. It is very handy since one paper typically touches on several genes. Completing GO for a paper means it doesn't need to be looked at again if it is associated with another gene on the ref genome list. This would be useful for groups that don't have such a mechanism in-house. MODS could provide a tab delimited file like geneID | PUB ID | OMIM ID | pub status [curated/not curated]. If the Ref. Genome interface had this data, it would also be possible to show a report of which pubs still need to be examined for a specific ortholog. One possible complication is that a single pub may deal with the genes from multiple species. Would need to track curation of this pub separately for each species I think. -Doug
We need to decide whether we will allow individual users to modify the database a record at a time or whether the database should only be populated with files from each MOD.
I think both methods would be useful (Doug)
Should track that no ortholog was found
There should be a mechanism for indicating that a curator has looked and no ortholog could be located as of a certain date. For genomes that are not yet completely sequenced, we will want to revisit these when a new genome build is released. It would be nice to have a free text note field associated as well so we can leave notes regarding the analysis that was performed. -Doug
Should provide reports to focus curation effort
The interface should provide reports that will help focus curator effort. One example might be to provide a facility to search the data for species-specific orthologs where curation is not 'comprehensive'...these are the genes we should be working on. Another example might be to provide a report of genes where no ortholog was determined yet. It would be nice to be able to alter the sort order of the results in such reports by the following parameters: date Human Gene was added to the Ref. Genome set, ortholog ID, OMIM ID, most papers associated, difference between papers associated with a gene and papers read for GO for that gene (biggest difference to smallest)...maybe others... It would also be good to have a report generated by a query on OMIM ID that shows which species have 'comprehensive' annotation done for their ortholog(s)-Doug
Should record that orthology is 'comprehensive' as of a certain date
Curators should be able to mark that curation of an ortholog is 'comprehensive' as of a certain date. It would be good to be able to generate a report to look for cases where the 'comprehensive' curation date is getting old. These may need to be reviewed and updated. -Doug
Should allow a 1:many relation between Human gene and MOD ortholog
Perhaps this is self-evident, but it mucks up reports and things pretty nicely...
Should record orthology determination method
If the group decides on a single method we can all adhere to for ortholog determination, then this may not be required. I think this is unlikely however, in which case it may be useful to record HOW orthology was determined. Perhaps a set of checkboxes beside the various methods can be used to record what tools/methods supported the orthology call. These would include YOGY, TreeFam, InParanoid, etc. as well as manual BLAST and synteny analysis, and perhaps 'established from literature' for cases where orthology was already recorded by a MOD from a published paper. If we really want to be careful, it might be good to record which build of the databases behind the tools was being used as well..ie which Sanger build, which InParanoid version, etc.? -Doug
Only one index page
Should provide an output for easy data retrieval
Would it be possible to select different output options, ie html or excel?
The advantage of excel is that people can manipulate the data as they wish, unless a variety of outputs, eg graphs, data collation can be included in the outputs of this tool.
I would suggest that the administrators would appreciate an output table which is similar to the original google spreadsheet. With each human gene listed in separate rows, and the accession number and the metrics data and the date completed for human, and all other species listed in columns.
However, I don't think they will want to view the table as a whole every time they look at it. Especially in a couple of years time when there are 500 genes on the list.
Therefore could there be drop down options: eg having selected "metrics table" and then "edit" or "view", then for view have options "excel" or "html" then next options are: "all data", "by date added to table", "only genes comprehensively annotated in all species", "newest genes", "genes not yet annotated" this would I guess lead to the further option of dates, "2006", "Aug06", "Sept06"..., alphabetically. Perhaps the choices should be decided once people work out what data they want.