UCL-annotation team Dec 5-March 10

Overview:

The aim of the University College London (UCL) based Annotation team is to provide GO annotation to human cardiovascular or Parkinson's disease relevant genes, as well as to submit protein-protein interaction data to IntAct. These projects are funded by the British Heart Foundation (BHF) and Parkinson-s UK (with the Source BHF-UCL and ParkinsonsUK-UCL). We have a successful collaboration between several UCL based research groups, the European Bioinformatics Institute (EBI) and King's College, London. The annotations created by the UCL-based curators are made directly into the GOA database or the IntAct database at the EBI. 4000 human genes have been identified as associated with cardiovascular processes, the priority gene list relevant to Parkinson's disease is being developed. Annotation priorities are agreed on a regular basis in consultation with the Co-Grant holders, the International Scientific Advisory Committee and the UCL-based GO curators. The UCL-annotation team has been a GOC member since July 2008.

1. Staff:

Dr Ruth Lovering, 1 FTE – UCL-based curator, BHF scholarship to February 2014, UCL funding to July 2018
Dr Anna Melidoni, 1 FTE – UCL-based curator, BHF grant to July 2018
Dr Nancy Campbell, 1 FTE – UCL-based curator, BHF grant to July 2018 (on maternity leave)
Dr Milagros Rodríguez-López 0.8 FTE – UCL-based curator, BHF grant to April 2015
Dr Paul Denny 0.5 FTE – UCL-based curator and Parkinson's project co-ordinator, Parkinson's UK grant to December 2016
Tony Sawford, 0.35 FTE – EBI-based Software engineer, BHF grant to July 2018, Parkinson's UK grant to December 2016

No funding via GOC NIHGRI grant

2. Annotation Progress

Currently Ruth is training 2 curators in GO annotation and consequently we have very low numbers of annotations for this quarter: 100 GO terms associated to 23 human proteins (7th December 2013 to 15 February 2014).

3. Methods and strategies for annotation

(please note % effort on literature curation vs. computational annotation methods)

a. Literature curation (100%): The aim of this Initiative is to provide complete and deep annotation of 300 human proteins per year. This is achieved through both protein-centric and process-centric targeting of proteins to annotate. The process-centric annotation enables the curators to gain a better understanding of the targeted a process. The protein-centric annotation is undertaken when annotating proteins on a specific cardiovascular relevant list, such as a Genome-Wide Association Study. In addition, we annotate proteins following requests from cardiovascular scientists or when annotated by attendees of our MSc module or 2-day annotation workshops. The following approaches are taken to achieve this: • To ensure a rapid improvement in the annotations available for a large number of cardiovascular associated proteins the curators spend a maximum of one day researching the literature associated with each protein. • The protein will be marked as ‘complete’ if the curator feels there are no further terms to add. • If complete annotation cannot be achieved in a day, the protein record is marked as first pass complete. The intention is to revisit these first pass proteins, hopefully with some expert scientist input, in the following year. • The approved gene symbol (and relevant gene and protein aliases) are used to query a variety of biomedical search engines, including NCBI PubMed, iHOP and GOPubMed, to identify suitable papers for the GO annotation of each target protein (with highly researched genes the search is usually limited to human entries only). • The curators will usually associate GO terms to all of the human proteins mentioned in each paper read, depending on the experimental evidence available (occasionally GO terms are associated with non-human proteins too). • Preference is given to the use of experimental-based evidence codes, however these are only used when the curator is completely confident of the identity of the protein and its derivative species. • Reviews are also used to provide an overview of the characteristics of a protein and an insight into the complete set of GO terms required. • Experimental data relating to model organism proteins maybe included in our GO annotation process, through the direct annotation of the model organism protein and the use of the ‘inferred by sequence similarity’ evidence code to transfer the information to the orthologous human protein. • When experimentally supported literature is unobtainable, due to insufficient information about the species the protein is derived from, the lack of access to a referenced paper, or simply because the knowledge is considered so well accepted that references are not supplied, author statements are used. • When possible we associate the chronologically first paper that provides experimental evidence for the characteristic features of a given human protein. • We aim to capture the knowledge about each protein using a limited number of papers, with experimental evidence. • We do not annotate all relevant papers, if this will lead to repeated duplication of GO terms associated to the protein. • GO terms are chosen by querying the GO files with QuickGO or AmiGO. • Before assigning a GO term, its definition and position within the ontology are checked to ensure its suitability. • The GO editorial office is contacted, via SourceForge, when a new GO term is required, or modifications are needed to an existing GO term.

b. Computational annotation strategies: None used

c. Priorities for annotation:

BHF funded project: Human genes involved in cardiovascular-related processes, as agreed by the International Scientific Advisory Panel. During the past year we have been focusing on the annotation of cardiac conduction associated genes. We are now funded to spend our time equally between annotating using GO terms proteins, microRNAs and capturing protein-protein interactions through the submission of PPIs to IntAct. Currently we are revisiting previously annotated papers to capture PPIs that had been submitted to GO which we are now adding to IntAct.

Parkinson's UK funded project Human genes involved in neurological-related processes, as agreed by the grant co-applicants, our International Scientific Advisory Panel and additional expert scientists.

CAFA project We are assisting the UniProt-GOA team with this project by curating the primary functions and processes of the proteins which are on both the CAFA priority list and the Cardiovascular-priority list. This will help populate these targets with functional annotations, which will assist in the assessment of the CAFA competition.

4. Presentations and Publications

(Tony Sawford’s publications not listed here as these will be include in the GOA report)

a. Papers with substantial GO content

From zebrafish heart jogging genes to mouse and human orthologs: using Gene Ontology to investigate mammalian heart development. Varsha K Khodiyar, Doug Howe, Philippa J Talmud, Ross Breckenridge, Ruth C Lovering F1000

b. Presentations including Talks and Tutorials and Teaching

UCL group meetings The UCL GO curators are closely associated with the Cardiovascular Genetics and Molecular Neuroscience groups at UCL, and and the UCL-London-School-Edinburgh-Bristol (UCLEB) consortium of population-based prospective studies and have given 2 presentations at their group meetings.

Paul Denny: "Focusing the Gene Ontology on Parkinson's Disease-relevant Proteins", Abstract: An introduction to the Gene Ontology (GO) and description of our plans for applying GO annotation to proteins relevant to Parkinson's Disease. 40 minute presentation to Prof. John Hardy’s group at the Institute of Neurology, University College London on January 24th 2014.

Ruth Lovering: "Review of MSc autism annotation project", 30 minute presentation to Cardiovascular Genetics, Institute of Cardiovascular Science on 11th February 2014.

The BHF-UCL team teaches a 10-week ‘bioinformatics’ module for Genetics of Human Disease MSc students each year, which runs October to December. By focusing on the review of a GWAS risk-associated SNP the students constructively apply their newly acquired knowledge of a variety of online biological resources, including Ensembl, EntrezGene, IntAct, Cytoscape, UniProt, QuickGO, AmiGO, HCOP and functional analysis tools. In addition, the students learn the importance of including full experimental detail in scientific publications.

c. Poster presentations

none

5. Other Highlights:

A. Ontology Development Contributions:

Since 5th December 2013 the BHF-UCL team has made 7 Source Forge request and the PARL-UCL team has made 1 Source Forge request (to 10 March 2014), this brings the total number of GO terms created by the UCL teams to almost 1,800.

B. Annotation Outreach and User Advocacy Efforts:

The BHF-UCL team is teaching a ‘bioinformatics’ module for Genetics of Human Disease MSc students this year. By focusing on the review of a GWAS risk-associated SNP the students constructively apply their newly acquired knowledge of a variety of online biological resources, including Ensembl, EntrezGene, IntAct, Cytoscape, UniProt, QuickGO, AmiGO, HCOP and functional analysis tools. In addition, the students learn the importance of including full experimental detail in scientific publications.

C. Other Highlights: This year the BHF-UCL team has circulated a newsletter, in February, by direct email to the International Advisory Committee and individuals who have expressed an interest in this project; by indirect email, though the mailing lists of several cardiovascular related societies and to the UCL Department of Medicine mailing list and through our web site.

We have appointed a 0.8FTE Parkinson's UK funded biocurator for 2.5 years, Rebecca Foulger will start 1st May 2014.

Ruth has been involved in a functional analysis interpretation of microRNA dataset generated by Dr Anastasia Kalea, Cardiovascular Genetics, UCL.