Difference between revisions of "GO Annotation Standard Operating Procedures"

From GO Wiki
Jump to: navigation, search
m
(Automated annotations)
 
(4 intermediate revisions by 2 users not shown)
Line 10: Line 10:
 
Some model species research communities do not have an established database group with funding and time to commit to long-term maintenance of their datasets. Such groups can contribute annotations to the central repository via the UniProtKB GO Annotation (UniProtKB-GOA) multispecies annotation group. This is also a possible route for those groups just starting out in annotation who may wish to take up the responsibility for long-term maintenance of their datasets at a later date.
 
Some model species research communities do not have an established database group with funding and time to commit to long-term maintenance of their datasets. Such groups can contribute annotations to the central repository via the UniProtKB GO Annotation (UniProtKB-GOA) multispecies annotation group. This is also a possible route for those groups just starting out in annotation who may wish to take up the responsibility for long-term maintenance of their datasets at a later date.
  
 +
==[[Submit_GO_annotations]]==
  
 
==Tell us about your requirements==
 
==Tell us about your requirements==
Line 95: Line 96:
 
[[File:Diag-tigr-annotation.png]]
 
[[File:Diag-tigr-annotation.png]]
  
[[Category: Annotation]]
+
 
 +
 
 +
  The next part is moved from [Tips_to_Produce_High_Quality_Annotations] - we need to see whether any of it is useful.
 +
 
 +
 
 +
 
 +
==Annotating new organisms==
 +
If you work on a previously unannotated organism, or your research group has a specific research expertise that could be used to produce GO annotations:
 +
* [Contact the GOC](http://help.geneontology.org/) to discuss the best approach for your annotations and to ensure you are the only group working on your organism.  If you would be interested in taking ownership for an organism with outdated annotations, we can help you find the right people to contact as well.
 +
* Training of new curators will be arranged, if needed, with an existing GOC mentor.
 +
* A representative of your group will need to [join GitHub](/docs/how-to-submit-requests/) in order to maintain your group's annotations.  Once a representative is designated, the GOC will also generate internal files needed to submit your annotations to GO.
 +
 
 +
==  Not enough annotations to justify joining GO?==
 +
* Submit one or just a few manual annotations by adding a new issue on the [GOC GitHub Annotation Tracker](https://github.com/geneontology/go-annotation/issues). Each of your annotations should include at least one key literature reference (PMID) in support of your assertions. Please state whether or not regular updates will be submitted about this annotation.
 +
 
 +
==  Automated annotations==
 +
If your group is interested in generating a large number of automated/electronic annotations, please be aware that InterPro2GO is the only source of [[Inferred_from_Electronic_Annotation_(IEA)|IEAs, Inferred from Electronic Annotation]] recognized by the GOC.  Submit your transcripts or other data to UniProt, and they will automatically generate IEAs from your data.  Once your organism is in UniProt, [http://help.geneontology.org/ contact the GOC] and we will gladly assist in curator training so your group can add manual annotations as well.
 +
 
 +
== Reviewing GO annotations associated with a scientific article== 
 +
Literature annotation involves capturing published information about the exact function of a gene product as a GO annotations. This curation process is time-consuming but produces very high quality, species-specific annotation; the accuracy and uniform format of annotations allows the information to be used in high-throughput experiments. GO curation may be best carried out by people who know the function of the gene product and the associated biology in great detail- for example, experimental scientists who are familiar with the published literature. If you are an expert in a gene product or a particular field, then you may like to [suggest modifications to the ontology structure](/docs/contributing-to-go-terms/) as well.
 +
 
 +
Below is a schematic diagram giving an introduction to the steps involved in literature-based GO annotation.
 +
http://geneontology.org/sites/default/files/public/diag-literature-annot.png
 +
 
 +
To begin, check if there are existing annotations to the paper:  open a Gene Ontology browser, (e.g. [AmiGO](http://amigo.geneontology.org/amigo), [QuickGO](https://www.ebi.ac.uk/QuickGO/)) and enter a PubMed identifier (PMID) for the paper of interest in the 'Search' field.
 +
 
 +
=== If GO annotations are listed in the results:===
 +
# Check whether the paper has been annotated by GO curators.
 +
# Click on the PMID and browse annotations associated with the paper.
 +
#* If you agree that the annotations accurately represent the data, you are done!
 +
#*  If you think the annotations could be improved: Write a new issue on the 'GOC GitHub Annotation Tracker', indicating that these annotations should be reviewed. Include:
 +
#** a PMID
 +
#** the name of the species investigated in the experiment that led to this publication
 +
#** '''Please state whether or not regular updates will be submitted about this annotation'''.
 +
   
 +
====  If no results are listed using this PMID:====
 +
This means the paper has not been annotated by GO curators.
 +
* Write a new issue on the 'GOC GitHub Annotation Tracker', indicating that this is a new annotation. Include:
 +
** a PMID
 +
**  the name of the species investigated in the experiment that led to this publication
 +
**  '''Please state whether or not regular updates will be submitted about this annotation'''.
 +
 
 +
===  Reviewing GO annotations for a gene or protein:===
 +
 
 +
To start, check if there are existing annotations to the gene or protein of interest: open a Gene Ontology browser (e.g. AmiGO, QuickGO) and search for the gene or gene protein record of interest by entering it in the 'Search' field, then browse associated annotations and follow links to see the full list of annotations:
 +
 
 +
 
 +
[[Category: Annotation Guidelines]]

Latest revision as of 13:17, 20 January 2021

 From GO Annotation Standard Operating Procedures
 TO BE REVIEWED
 

This page documents some of the standard operating procedures used by members of the GO Consortium during the process of annotation. Please note that these do not represent the best, or only ways to carry out annotation, but are simply a guide to how some groups currently annotate. More information on annotation can be found in the GO annotation guide and in the GO annotation conventions; if you have any questions on the guidelines given below, please contact the GO helpdesk.



No single established database?

Some model species research communities do not have an established database group with funding and time to commit to long-term maintenance of their datasets. Such groups can contribute annotations to the central repository via the UniProtKB GO Annotation (UniProtKB-GOA) multispecies annotation group. This is also a possible route for those groups just starting out in annotation who may wish to take up the responsibility for long-term maintenance of their datasets at a later date.

Submit_GO_annotations

Tell us about your requirements

I represent a small lab working on a biological area of research

In this case, perhaps you have a list of your favorite genes and you wish to annotate them. You have a range of choices depending on what you are trying to achieve.

Please see the range of options below and choose the one that suits you best.

I have a set of ESTs and I would like to attach annotations

If you would ultimately like to send the annotations to the consortium for distribution, it is crucial that your EST clusters should maintain the same identifiers over each round of re-clustering. One way to do this is to identify clusters based on one EST that is chosen for each cluster. There may be other good ways that we have not heard about.

Many EST clusters have stable identifiers with version updates (e.g. the UniGene database). These stable identifiers can be used for making GO associations.

Once you have your clusters and stable identifiers follow the IEA directions for making electronic annotation.

You could also run BlastX, or run gene prediction programs and then BlastP. Running InterPro on the sequences will find the longest open reading frame.

I have a genome sequence

You will already have assembled the genome sequence and made gene calls. Once you have the cds sequences or predicted protein sequences then you can follow the instructions on IEA annotation and/or Literature annotation. Please see below.

I have a microarray data set

The action you can take depends somewhat on your sequences.

  • Are they cDNAs or oligos?
  • Do they have identifiers? Which kind?
  • How do they relate to the genes? If you know which sequence relates to which characterised gene then it will be easy to transfer annotations over.
  • Do the genes have GO annotations? If they do not have full GO annotation from literature then you may like to apply for funding to annotate the genes yourself, or write to your Model Organism Database to ask them to do so.
  • Can you get more up to date annotations than those provided with your tool? It may be that you are seeing only the annotations that come from your proprietary microarray software provider. It is a good idea to ask how often they update their annotations and ontology structure as these change from day to day, and there may be many more annotations available than you are seeing.

It is most likely that you will want to use mainly electronic annotations, supplemented with some literature annotation for those sequences that are not yet fully annotated.

I have a peptide sequence

  • Do you know what gene is it?
  • Can you map it to a UniProtKB or MOD identifier?
  • Does this identifier have GO annotation?

If it doesn't, you can request that it be annotated (it helps if you provide literature associated with this gene product). If you cannot map it to a UniProtKB or MOD identifier, then you can make your own GO annotation by any of the electronic or ISS methods illustrated below.

Electronic annotation

Electronic annotation is very quick and produces large amounts of less detailed annotation very quickly. Electronic annotations are rarely wrong, but tend to be less detailed. For example, electronic annotation is likely to tell you which of your genes are transcription factors but unlikely to tell you in great detail what process the gene controls. You may like to use this method if you have a new genome sequence to annotate, or a microarray with many thousands of sequences.

   diag-iea-overview.png

This diagram illustrates some of the main ways of making electronic annotation. It should be read from the top down. The diagram shows sequences from UniProt having electronic GO annotation assigned by several computational methods. All of these methods involve use of mapping files. To learn more visit the guide to information on mappings of GO to other classification systems.

InterPro Mapping

In the case of the Interpro mapping it is possible to assign electronic GO annotation to your sequences based on InterPro domains and a number of other criteria. For example if your sequence has a DNA binding domain then it makes sense to electronically annotate it to the DNA binding function term. For more information on InterPro mapping please see the information on InterProScan.

UniProt Keyword Mapping

This part of the diagram illustrates how sequences already categorized using the UniProt keyword mapping can have GO annotation automatically applied by transferring via the keyword mapping file.

HAMAP

HAMAP is a system that categorizes sequences based on family or subfamily characteristics and is applied to bacterial, archaeal and plastid-encoded proteins. GO annotation can be automatically applied to such sequences using the mapping file between HAMAP and GO.

Enzyme Commission

The Enzyme Commission database categories enzymes by the reactions they catalyze. If your sequences are already categories by EC then you can transfer GO annotations using the mapping file of EC to GO categories.

Other mappings

These are just a few examples of mapping files that can be used to transfer annotations to your sequence objects. Many other mappings are available, and if there is not a mapping file between GO and your current annotation system, we can assist you in making one.

BLAST

You can also make electronic annotations by BLASTing your sequence against manually annotated sequences and transferring the GO annotations across to your sequence. The threshold of similarity in this process is up to you, and depends on your requirements.

No similar sequences manually annotated?

If your sequence is similar to other sequences that have been well characterized but not yet annotated from the literature, then one option is to carry out the literature annotation yourself and then transfer by electronic methods.

Literature Annotation

Literature annotation involves capturing published information about the exact function of a gene product as a GO annotations. To do this you must read the publications about the gene and write down all the information. This annotation is time-consuming but produces very high quality, species-specific annotation, and brings the information about the gene product into a format in which it can be used in high-throughput experiments. This is an extremely worthwhile process in the long term. It may be best carried out by people who know the function of the gene product, and the associated biology, in great detail; for example experimental scientists who are familiar with the published literature. If you are doing this, then you may like to write and suggest modifications to the ontology structure as well.

Below is a schematic diagram giving an introduction to the steps involved in literature-based GO annotation. If you are interested in carrying out literature-based annotation you can receive full training in the process by attending a GO annotation camp or by working with an individual GO Consortium annotation mentor.

Literature Based Annotation

   View a larger version.

Sequence-based annotation

  • General principles for sequence IDs
    • You must have stable identifiers for your objects.
    • You must provide information on what the object is, e.g. a protein, nucleotide, EST, etc.. It doesn't matter if a nucleotide sequence is a gene, a genome, or an EST as long as it can be identified as such.
    • If a sequence identifier has become obsolete, there must be a mechanism in place for tracking down the replacement.
    • Your database must have an internal rule that object identifiers are never reused.

Annotation workflow

The following diagram shows the standard operating procedure for sequence-based (ISS evidence code) annotation used in the past at The Institute for Genomic Research (now JCVI). Diag-tigr-annotation.png


  The next part is moved from [Tips_to_Produce_High_Quality_Annotations] - we need to see whether any of it is useful. 


Annotating new organisms

If you work on a previously unannotated organism, or your research group has a specific research expertise that could be used to produce GO annotations:

  • [Contact the GOC](http://help.geneontology.org/) to discuss the best approach for your annotations and to ensure you are the only group working on your organism. If you would be interested in taking ownership for an organism with outdated annotations, we can help you find the right people to contact as well.
  • Training of new curators will be arranged, if needed, with an existing GOC mentor.
  • A representative of your group will need to [join GitHub](/docs/how-to-submit-requests/) in order to maintain your group's annotations. Once a representative is designated, the GOC will also generate internal files needed to submit your annotations to GO.

Not enough annotations to justify joining GO?

  • Submit one or just a few manual annotations by adding a new issue on the [GOC GitHub Annotation Tracker](https://github.com/geneontology/go-annotation/issues). Each of your annotations should include at least one key literature reference (PMID) in support of your assertions. Please state whether or not regular updates will be submitted about this annotation.

Automated annotations

If your group is interested in generating a large number of automated/electronic annotations, please be aware that InterPro2GO is the only source of IEAs, Inferred from Electronic Annotation recognized by the GOC. Submit your transcripts or other data to UniProt, and they will automatically generate IEAs from your data. Once your organism is in UniProt, contact the GOC and we will gladly assist in curator training so your group can add manual annotations as well.

Reviewing GO annotations associated with a scientific article

Literature annotation involves capturing published information about the exact function of a gene product as a GO annotations. This curation process is time-consuming but produces very high quality, species-specific annotation; the accuracy and uniform format of annotations allows the information to be used in high-throughput experiments. GO curation may be best carried out by people who know the function of the gene product and the associated biology in great detail- for example, experimental scientists who are familiar with the published literature. If you are an expert in a gene product or a particular field, then you may like to [suggest modifications to the ontology structure](/docs/contributing-to-go-terms/) as well.

Below is a schematic diagram giving an introduction to the steps involved in literature-based GO annotation. http://geneontology.org/sites/default/files/public/diag-literature-annot.png

To begin, check if there are existing annotations to the paper: open a Gene Ontology browser, (e.g. [AmiGO](http://amigo.geneontology.org/amigo), [QuickGO](https://www.ebi.ac.uk/QuickGO/)) and enter a PubMed identifier (PMID) for the paper of interest in the 'Search' field.

If GO annotations are listed in the results:

  1. Check whether the paper has been annotated by GO curators.
  2. Click on the PMID and browse annotations associated with the paper.
    • If you agree that the annotations accurately represent the data, you are done!
    • If you think the annotations could be improved: Write a new issue on the 'GOC GitHub Annotation Tracker', indicating that these annotations should be reviewed. Include:
      • a PMID
      • the name of the species investigated in the experiment that led to this publication
      • Please state whether or not regular updates will be submitted about this annotation.

If no results are listed using this PMID:

This means the paper has not been annotated by GO curators.

  • Write a new issue on the 'GOC GitHub Annotation Tracker', indicating that this is a new annotation. Include:
    • a PMID
    • the name of the species investigated in the experiment that led to this publication
    • Please state whether or not regular updates will be submitted about this annotation.

Reviewing GO annotations for a gene or protein:

To start, check if there are existing annotations to the gene or protein of interest: open a Gene Ontology browser (e.g. AmiGO, QuickGO) and search for the gene or gene protein record of interest by entering it in the 'Search' field, then browse associated annotations and follow links to see the full list of annotations: