BioCurator Discussion Topics

From GO Wiki
Jump to: navigation, search

Project Information.

(Please edit below and add information about your project. Add your project if it is not listed below.)

Number of papers annotated per year?

  • CGD
  • CTD (Comparative Toxicogenomics Database) -about 3200 papers per year (chemical-gene interactions, chemical-disease associations, & gene-disease associations)
  • dictyBase
  • DIP - about 1000/year; protein-protein interactions only
  • H-InvDB (H-Invitational Database) -about 7000 paper abstracts/ per year
  • MaizeGDB (Maize Genetics and Genome Database) less than 100/year; new genes.
  • SGD
  • TAIR
  • RGD - about 6500 papers a year, since we review all papers for a gene during curation, about 15,000 papers are reviewed and data is extracted from about 6500 per year
  • UCL
  • UniProtKB -about 13000/year
  • WormBase
  • ZFIN - something in the neighborhood of 900 publications curated per year (as of 8/2008)
  • SABIO-RK - about 500/year; kinetic data only
  • GeneDB (S. pombe) - about 500/year published, about 300/year curated;
  • QTLdb - (Animal QTLdb) - About 50 papers curated into the database per year, for the past 4 years.
  • PharmGKB - about 750/year curated

What should be in a publication?

chemical name (or CAS #), official gene symbol (or Entrez-Gene ID or sequence ID) prefaced with species (e.g, mouse Hoxd11, human TP53, zebrafish nrf1) mentioned at least once somewhere in paper, disease name, species used
  • dictyBase
  • DIP
bare minimum:
gene/protein/EST/etc name (as used in the paper), database identifier, species (taxon id) provided for every gene/protein/DNA fragment used in the paper (including controls)
useful(but, IMHO, optional):
fuctional annotation as on TAIR/Plant Physiology page
formalized annotation of individual experiments
NOTE: If given a choice between a bare minimum now and useful additional information sometime in the future I (Lukasz) would take the former as consistently providing all DB identifiers already saves a LOT of work
  • H-InvDB (H-Invitational Database) - annotation items; gene/protein names, functionl description and experimental evience
  • MaizeGDB minimal requirements - in a single paragraph - to avoid lost time digging: new sequence accessions for maize and subspecies; their official gene symbol and any synonyms used in the paper; thier allele descriptors or cultivar/sub species. In addition, the chromosome location, with coordinates, or in the alternative, flanking markers or position on a sequenced BAC or some such; gene product name(s).
  • SGD
  • TAIR
  • RGD - clear statement of organism in abstract, rattus norvegicus and organism for each gene in papers in which there are multiple genes from multiple organisms; official gene symbol and name, QTL symbol and name, strain symbol and name; RGD ID for rat genes, QTLs or strains and proper IDs for genes from other organisms; RefSeq IDs, EMBL/GenBank/DDBJ IDs for sequence
  • UCL
  • UniProtKB
Official gene names for model organisms (Nature Genetics force authors to use them). When present,
ordered locus names such as CG numbers in Drosophila or AGI numbers in Arabidopsis.
For sequences: clear database identifiers (EMBL/Genbank/DDBJ, UniProtKB). When a sequence does not
exist in a database (description of a new isoform), authors should submit the sequence to the
EMBL/GenBank/DDBJ databases, or at least explicitly show the protein or mRNA sequence. For alternative
products, it is becoming increasingly difficult to obtain the precise sequence since authors usually only
provide a cartoon describing the intron-exon structure.
Also clarification on what specific isoform has been used to perform the experiments in the paper when multiple alternative products are known for a given gene.
Species (taxon id) and strain information provided for every protein used in the paper.
  • WormBase
  • ZFIN
In general extensive use of identifiers and controlled vocabularies (e.g.taxonID, GO terms..)
Identity of the enzymes
Enzyme Name, EC-Number Name, preferably the accepted name from the IUBMB Enzyme List
Organism/species & strain; Sequence accession number; Isoenzyme
Additional information on the enzyme
Tissue/organelle; Localization; Post-translational modification
Assay conditions
Measured reaction as a stoichiometrically balanced equation; Assay temperature; Assay pH; Buffer & concentrations; Metal salt(s) & concentrations; Other assay components; Substrates & concentrations; Enzyme/protein concentration
Assay method a literature reference may suffice for an established procedure that is used without modification; Type of assay e.g., continuous or discontinuous, direct or coupled, direction of the assay; Reactant determined; Reaction stoichiometry
Additional information desirable
Total assay mixture ionic strength; free metal cation concentrations; Reaction equilibrium constant
Data necessary for reporting kinetic parameters
Vmax; kcat; kcat/Km; Km;S0.5 Both given as concentrations, e.g., mM; Hill coefficient; How was the given parameter obtained?
Data required for reporting inhibition data
competitive, uncompetitive, etc.; Type and KI values with units and how they were determined
Required information for all enzyme functional data
Indication of accuracy e.g. standard error of the mean, standard deviation, confidence limmits, quartils; Specification whether relative to subunit or oligomeric form
Additional material desirable
Kinetic mechanism e.g. ordered bi-bi; kinetic law (equation)
  • PharmGKB


dbSNP rs numbers
HGNC gene names
drug names
Race and ethnicity of human subjects
MeSH disease descriptions

Suggested text for a letter to journal editors

(Please edit here and include your thoughts.)

(New version, from Rebecca Nelson (DIP), plus edits from Mike Livstone (SGD).)

Dear ____________

We, the undersigned representatives from the community of biological databases, invite you to join a collaboration to improve the accessibility and organization of biological data.

The tremendous and rapidly increasing volume of biological data would be nearly impenetrable without organization and synthesis. Biological databases, such as those we represent, provide an essential service by organizing, archiving, and unifying these data. With a few exceptions, this organization is achieved through manual curation of the published literature. That is, we read each paper, filter through its contents, and record the information of interest.

Although different databases extract different information, we all rely on clear and unambiguous identification of the genes, macromolecules, metabolites, and chemicals under study. Unfortunately, we regularly encounter papers with missing or ambiguous information (citations), which slows or prevents complete curation and reduces our ability to cover the breadth of known biology.

We therefore suggest a collaboration with (journal/publisher name) to incorporate the direct submission of gene/biomolecule names, identifiers, and source organisms into the editorial review process for each research article you publish. We attach an example block that could be added to an online submission & review form. We feel that requiring this information strikes a good balance between improving the accuracy and completeness of curated data and minimizing the burden on authors and editors.

While such a system will clearly benefit us, it will also benefit (journal/publisher), your authors, and the scientific community. Clear identification of biological components allows for accurate curation and ensures that your authors and publications are cited in our databases, increasing their visibility in the biological community. Because we will be able to curate more articles more quickly, our databases will be able to provide a more complete picture of biology for the scientific community at large.

We look forward to working with you to make this vision a reality.


(Previous version and comments)

  • GenBank is actually either DDBJ/EMBL/GenBank or the INSD (the International Nucleotide Sequence Database).
  • I think the letter will be more effective if it comes from the BioCurator Soc rather than a single database.
  • While I think all the points raised are relevant I think the letter is simply too long. Perhaps we should focus more on the positive rather than the existing problems.
  • I agree that this is a bit long, and also that we should concentrate on the benefits to the researchers and/or journals rather than why we as curators want this (i.e. if we can't curate the data it is lost in the sea of other publications, not that we would necessarily want to word it that way). Also, I would suggest using wording that is stronger than that we are wondering if they would be willing to consider... Perhaps suggest that we are interested in collaborating with them on this. Jrsmith

Here's a rough draft of the idea we (ie DIP, hopefully with support of as many databases as possible) would like to persuade the journal editors to (Lukasz/DIP):

Recent years have seen a rapid increase in the quantity of biological data published in research papers. As the volume of the data increases, it is of utmost importance to organize and combine it in a systematic way. This is one of the primary roles of the numerous biological databases: RCSB, GenBank, UniProt, SwissProt, DIP, IntAct, MINT, SGD (yeast), FlyBase, WormBase, TAIR (Arabidopsis), RGD (rat), Comparative Toxicogenomics Database (CTD), and many others.

With the exception of RCSB and GenBank, where direct data deposition by the authors is imposed by journal editors and/or funding agencies, biological databases generally depend on curators to manually extract individual pieces of information from research papers. This curation is labor-intensive, and curators agree that the major stumbling block to efficient curation of biological literature is incomplete and/or ambigous information about the identity of the biomolecules and genes studied. Every curator can provide horror stories of tracing the identity of a single protein used in a paper through a chain of 'prepared as described in...' and 'obtained as a gift from...' phrases only to discover at the end of the trail that it is still impossible to identify the protein's species of origin without contacting the authors (1). The problem seems to be universal across every journal and every database with which we have contact.

Over the years, researchers have raised this issue in numerous commentaries, reviews and editorials, mostly without any response. Two recent initiatives, however, seem to suggest that the situation is changing. TAIR has initiated a partnership with Plant Physiology journal (2) aimed at capturing as much functional data as possible with minimal burden imposed on both the journal editorial office and the authors. Similarly, FEBS Letters (3), in collaboration with the MINT database, is recruiting authors to capture protein interaction data.

While we are enthusiastic about these attempts, we realize that the scope of the problem is much broader. Curation efforts of many individual databases would become instantaneously more efficient if a list of biomolecules and genes, each with a reference to the relevant database, were published as a simple electronic supplement available to every journal reader. We believe it would translate into rapid dissemination of the information from such papers to many diverse databases. As every database references the original data source, the supplement would improve database coverage and increase the visiblity of both individual articles and the journals in which they are published.

We (as DIP, but also as a member of the biocurator forum that includes CTD, CGD, dictyBase, DIP, MaizeGDB, SGD, TAIR, RGD, UniProt, WormBase, Zfin; and as a member of the IMEx consortium of interaction databases grouping DIP, IntAct, MINT, MPact, BioGRID) wonder if your journal, XXX, would be willing to implement a policy requiring the authors of the accepted papers to prepare, with the help of the database community, an electronic supplement file listing all the biomolecules and genes studied in the manuscript. One possible approach would be to implement a form similar to the one prepared by TAIR for Plant Physiology (view at ) that would produce, as the output, a file to be included as part of the article's electronic supplement.


(1) most recent example from PNAS: Bartsch S, Monnet J, Selbach K, Quigley F, Gray J, von Wettstein D, Reinbothe S, Reinbothe C PNAS 105(12):4933-8 (2008) Three thioredoxin targets in the inner envelope membrane of chloroplasts function in protein import and chlorophyll metabolism.

There's absolutely no way to identify the Trx protein used in experiments described in Fig 1. The result is 6 interactions of this protein are lost for DIP, IntAct, MINT databases; the same holds for any other functional data reported in the paper.

(2) Plant Physiology 146:1022-1023 (2008) Plant Physiology and TAIR Partnership

(3) Superti-Furga G, Wieland F, Cesareni G Finally: The digital, democratic age of scientific abstracts FEBS Letters 582(8),1169