Annotation outreach group meeting 13th July 2007

From GO Wiki
Jump to: navigation, search

Here are the categories of users that we have thought of but not yet considered in more detail:

1) Sequencing center- Has sequenced and made gene calls- wants to set up a IEA pipeline

2) Sequencing center- Has sequenced a bunch, but has not made gene calls, wants to hand over the sequence

3) Small lab- working on a particular area of biology (example meiosis, immunology)- has bunch of genes in few closely related organisms (these genes have GenBank IDs), want to give us the information, i.e, provide the papers/list of IDs etc. No resources to learn to annotate and keep the annotations

5) Somebody with a sequenced genome and with IDs and with infrastructure/resources (like CGD)- ready to do complete annotation

a. I am a MOD and I want to support my users by providing GO facilities.

If you want to use GO annotations

I am a microarray user and my array has some GO annotations and I’d like to add more by making my own annotations.

b. I am a microarray user and my array has some GO annotations and I’d like to add more by downloading annotations from a provider.

Some that we considered in more detail

4) Microarray results -

a. I am a microarray user and my array has no GO annotations.

Perhaps the user does not have ids for the genes?
Perhaps they only have ESTs?
Are the ESTs clustered?
Do they have stable ids over reclustering?
If ids are stable then electronic annotations to these sequences may be submitted to the consortium.

Is your software providing all the up to date annotations?

Many users get access to go via proprietary software that is provided by their array provider. For example if a person is using an affymetrix chip then it is likely that they will be getting access to GO via Affymetrix's proprietary software. If a user finds that the genes on the chip are not fully annotated it would helpful to check whether how often the software they are using is being updated with the latest annotations. It is possible that annotations are being released more often than the software is being updated. If this is the case then the user can write to Affymetrix and they will talk to us about how they can get updates more frequently.

If the genes on the array are still not fully annotated then it may be helpful to start doing some manual annotations. Please see the manual annotation flowchart.

If you want to send annotations to the consortium

If you have been doing manual annotation to complete the annotation of your arrayset, then you may wish to send your annotations to the consortium.

If you are making annotation please bear these general rules in mind.

General principles for sequence ids
You must have stable ids for your objects.
You must provide information on what the object is. Protein, nucleotide or whatever. It doesn't matter if a nucleotide sequence is a gene, a genome, or an est as long as you know whether it is nucleotide sequence or a protein. (Although the gene_association file column says you can say 'gene', you must say if the sequence id is protein or DNA.)
If a sequence id has become obsolete then you should be able to track down what has replaced it. What is the mechanism for that?
If people are submitting annotations to the GO consortium then they must have an internal rule that their object ids are never reused.

Some thoughts on how users and flowcharts could intersect:

small lab working on an area: If you have protein sequence in a text file, maybe from a load of papers you read.
you can blastP and get e-annotation.
you can run through interproscan and get the domains interpro2go annotations.
you can manually annotate them.

You have a set of ESTs:
You can cluster and apply ids.
then run blastX or run gene predictions programme then blastP
you can run interpro and it will try to find the longest orf.
You can also do manual annotations to the predicted proteins.

If you have a genome sequence
you assemble the genome sequence and do gene calls.
you get a cds seqence or a protein sequence or both.
then you can take the cds and do blastX or interproscan.
Or take protein predictions and interproscan or blastP and then manual annotations.

I have a microarray set ...
cdna or oligos?
do they have ids?
How do they relate to the genes?
Do the genes have GO annotations?
Is there a mod for your species that does GO?
Have you talked to them?
Can you get more upto date annotations than those provided with your tool?

A peptide sequence.
what gene is it?
Can you map to known genes with ids?
can you retrieve the annotations or make annotations.

Some thoughts on how people perceive GO annotations:

People think that Affymetrix make the annotations and they don't realize that the annotations are made by the consortium.

We could make a list of questions that users could ask of their software provider:

1) When was the last time that you updated the annotations 
that you provide with your software?
2) If you compare the number of catalytic activities in e.g. the EC v GO are you using files of that same dates?
3) What is the source of your annotations?
4) When were the ontologies updated last?
5) In your analysis, are your blast e-values reasonable?
Users should note:
There are only funded annotation groups for a very few species 
and if you are not working on these species then there will not 
be manual annotations. Groups like Affymetrix do not do 
annotation. When people are assembling their genomes they should 
include electronic GO annotation in their pipeline as standard. 

Quality of research is affected by all the normal quality things like blast stringency.
We could write a paper that could be a general discussion of these kinds of caveats rather than a point by point as it will always change in the detail but the general principles apply. We could address the questions of quality in electronic and how granular/accurate.

This paper would be addressing criticisms like an editorial, perhaps a review journal where they appreciate editorials. Some thing that talks to people doing arrays. BMC bioinformatics. It is open access and so heavily cited. BMC genomics and BMC genomics as well.

P.S. are any groups doing IEA from ISS? This should not be allowed.

these questions could be included on the tools pages and the tools questions could be included on the paper.
We should publish these questions as guidance for the users.
This is a bit like a tools review but puts the onus on the users and give them the opportunity to demand the things they need.