Column 16 discussion 12-12-09
For reference page on column 16 Annotation_Cross-Products, see: http://wiki.geneontology.org/index.php/Annotation_Cross_Products
On the call:
- Chris, Harold, Doug, Rama, Kimberly, Mary, Ruth, Pascale, Jim, Doug, Fiona, Lakshmi, Val, Yasmin, Emily
- All present were happy with the pre-prepared documentation below describing the usage and format of CL identifiers in column 16:
Use of Cell Type as an Annotation_Cross_Product in column 16.
CL identifiers would be included in column 16 for a GO annotation whenever that information is present in a particular paper. No judgment is made as to whether a gene product is involved in a particular process in just a particular cell type or in all cell types. In other words, curators simply annotate all available data in a paper.
Therefore it is incorrect to assume that a gene product used in a GO annotation that has a CL identifier in column 16 is involved in the curated process only in that annotated cell type. Similarly, it would be a mistake to conclude that lack of a CL co-annotation indicates that a given gene product is involved in a process in all cell types where it is found. The only correct interpretation of a GO annotation with a CL co-annotation is that in one particular experiment a given gene product was found to be involved in a particular process in a particular cell line.
Annotation Format of column 16 for Cell Type:
• If CL is used to refine a CC annotation, then the relation (for now) must be part_of
• If CL is used to refine a BP or MF, then the relation (for now) must be occurs_in
Simple annotation for Column 16
Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16) TLR4 cell surface (GO:0005887) PMID:nnn part_of(CL:0000576) CREB gluconeogenesis (GO:0006094) PMID:nnnn occurs_in(CL:0000182)
Where a protein has multiple cellular locations, CL identifiers should be separated by a pipe (|):
e.g. part_of(CL:0000127) | part_of(CL:0000236)
N.B. no meaning is attached to the order of that the CL identifiers are listed in column 16.
- Chris: an annotation can not differ solely in the contents of column 16. This is because it will be optional for users to process this field. Therefore all information should be added to one line, and separate statements in column 16 should be separated by pipes (|).
- multiple CL identifiers should always be separated by a pipe.
- Cell Type location should not be inferred from investigations that use immortalized cell lines. Such cell lines should be treated as an experimental tool rather than an indication of the biological context of function. As the process of immortalization is known to involve multiple genetic changes a curator should never assume that the studied process is carried out in the equivalent normal cell type. Added subsequently to call: Edimmer 11:03, 14 January 2010 (UTC)
Kimberly, Doug: WormBase/ZFIN would like to include OBO identifiers to life stage ontologies; would be informative to represent that a gene product is located in a certain cell time at a particular life stage; this combination of one CL identifier and LS identifier should be separated by commas:
|Gene (col 2/3)||Term (col 5)||Ref (col 6)||Ext (col 16)|
|GeneXYZ||cell surface (GO:0005887)||PMID:nnn||part_of(CL:0000576), occurs_in(LS:00000xyz)|
If GeneXYZ was found to be located in a two cell types at the same or different life stages, the second piece of cell type/LS information would be added to column 16, but separated from the first CL/LS statement by a pipe, for instance:
part_of(CL:0000576), occurs_in(LS:00000xyz) | part_of(CL:0000578), occurs_in(LS:00000xyz)
Chris and Harold: Curators should note that some external ontologies many not be very quick to respond to term requests, although this should not be an issue with Cell Type nor with many of the MOD-maintained ontologies.
New GO term requests vs. post-composition in Column 16
Doug: use of column 16 is likely to be very popular with ZFIN curators, as they would no longer need to wait for GO to create a term.
Chris: terms should still be requested, even if column 16 can provide this information. However it is difficult to provide unified guidelines. It seems that including life stages in a GO term is introducing too much specificity, however GO terms to describe a specific cell type migration could still be created.
Doug: perhaps GO editors could assess the popularity of a particular term + column 16 combination as an aid to decide if a new term should be created.
Chris: There are different types of 'targets'
- target of enzymatic activity
- target of binding
- target of transcriptional regulation
- targets should be identified either via a UniProtKB accession or a MOD identifier.
- ideally, protein targets should be indicated via a protein identifier, while targets of transcriptional regulation should be gene identifiers. However undecided on how strongly we would require groups to choose between protein and gene identifiers.
Pascale: could there also be direct or indirect targets; e.g. the target of a signaling pathway.
Chris: such downstream targets could be indicated by the relationship type used.
Val: I'm interested in annotating the targets of protein phosphorylation. However this data would come from multiple PMIDs, therefore multiple annotation lines would be required, as a single GO annotation cannot cite more than one PMID. There is the concern however about introducing large amounts of redundancy in the annotation set if I decide to capture all the targets of a kinase's phosphorylation activity in column 16. Rama: Perhaps GO is not the best resource to capture such detail, curators need to judge when to stop annotating in GO, and when to decide to use other resources, e.g. BioGRID.
Chris: I feel all of these targets should be captured in the context of GO, even if 20/30 targets end up being included. Although users who ignore col. 16 would see redundant annotations, redundancy is fine. However perhaps in future we could think of alternative formats for capturing such data (new GO grant?)
Emily: Redundancy in gene association files seems acceptable, however curators might want to look at ensuring that their web display of their annotations is filtered by default to reduce such redundancies.
Rama and Harold: We would only annotate those low-scale, 'definitive' experiments: not results from high-throughput assays.
Harold: How could specific promoters be identified? For instance if a paper describes the specific interaction between a gene product and a certain promoter. Could genome co-ordinates be captured (although aware these could change over time).
Chris: could use Gene ID. As the experiment is usually inferring that the gene product specifically regulates the transcription of the gene normally associated with the promoter.
Rama: Who intends to submit GAF files with data in column 16?
Val: to send Pombe trial file to Chris.
Chris: AmiGO labs in release 1.8 should start to display column 16 data
Discussion on Question 2 submitted by GOA (see below) Val: Pombe would also want to submit similar data as described in this example (Question 2)n. Rama: What is the difference between annotating the processes that 2 binding proteins contribute towards, and annotating to protein complexes, and then the processes that individual subunits contribute towards? Emily, Val: this is a binding interaction between proteins from two different cells. We'd be interested in capturing the dependencies of biological processes on interactions.
Emily: Would there be a redundancy in the GAF file format for protein binding annotations? The interactor in a protein binding annotation is currently often captured in the 'with' column. Should this binding partner be again captured in column 16?
Chris: Yes; this redundancy in an annotation is fine.
Emily: Would it be correct to transfer information in column 16 to orthologs when carrying out ISS transfers of experimentally-verified annotations?
General agreement: Yes, should be possible to do, however curators must manually assess the correctness of transferring column 16 data to orthologs each time such ISS annotations are made.
When to use has_output/has_input
- Protein A is a substrate of Protein B
- therefore Protein B has_input Protein A
This mechanism is very similar to format applied by Reactome. However Reactome can also specify the state of a protein (e.g. the phosphorylated state or location). However many groups will not be able to capture such detailed information (PRO users will be the exception here)
Therefore if annotation groups cannot indicate the altered nature of a gene product, then they should always apply the relationship has_input - as a standard default for targets of biochemical activity (phosphorylation/peptidase/protein stablization)
For 'targets' of transcriptional regulation, the gene identifier in column 16 should be prefixed with the relationship 'has_output', as the standard default.
Normally the has_catalyst' relationship would not be applied in GO annotation.
If we need to capture direct/indirect targets, then child terms of these relationships can be created: -- has_direct_output -- has_indirect_output
General Consensus: advice should be given to CHADO users on how to include column 16 into the database.
Chris: will concentrate on creating the next round of documentation to more precisely describe column 16 format esp. with regards 'annotation targets', then advice will be provided for CHADO.
Questions submitted by the GOA team before the meeting
pre-meeting notes from Chris in green
1. Is there any timeline for inclusion of additional cross-references to other OBO ontologies/details on annotation targets?
Individual groups should not be restricted by I suggest for simplicity we focus efforts on a smaller set of well-understood ontologies.
- Certain anatomical ontologies (although there are myriad issues here we should discuss)
And of course gene products, for which we should use UniProtKB for proteins and I'm not sure what for RNA products. Individual MOD IDs?
By focusing on Cell we make things simpler as we can say a lot with just two relation: occurs_in, for refining a biological process term to the cell where that process takes place, and part_of, for refining a subcellular localization with the cell. But if this group wants to focus on other areas where they feel additional expressivity is required, we can discuss that.
The original col16 documentation is too ontology-centric and does not pay enough attention to the gene product use cases. Apologies for this, we can discuss this more during the call.
2. In the proposed format for column 16, I can't see how I could indicate the biological processes which are dependent on a certain protein-protein interaction? Perhaps we need a new relationship such as 'required_for', to be used in conjunction with BP terms, which could be added into column 16 for the annotation to the 'protein binding' term? Example: The biological processes GO:0008284, GO:0050870, GO:0022409, occur when Q9P1W8 interacts with Q08722 (PMID:15383453).
The relationship would be "requires" rather than "required_for", because it is the relationship between the process/function in col5 and the entity referenced in col 16.
I would be interested in discussing this particular example and possibly some others as I think the gene product issue is an important one. What does "requires" mean here? Do the two gene products have to come together to form a complex or join a larger complex? Does one participate in signaling?
- Q9P1W8 = SIRP beta2
- Q08722 = CD47
- GO:0008284 ! positive regulation of cell proliferation
- GO:0050870 ! positive regulation of T cell activation
- GO:0022409 ! positive regulation of cell-cell adhesion
Signal-regulatory proteins (SIRPs) are transmembrane glycoproteins belonging to the immunoglobulin (Ig) superfamily that are expressed in the immune and central nervous systems. SIRPalpha binds CD47 and inhibits the function of macrophages, dendritic cells, and granulocytes, whereas SIRPbeta1 is an orphan receptor that activates the same cell types. A recently identified third member of the SIRP family, SIRPbeta2, is as yet uncharacterized in terms of expression, specificity, and function. Here, we show that SIRPbeta2 is expressed on T cells and activated natural killer (NK) cells and, like SIRPalpha, binds CD47, mediating cell-cell adhesion. Consequently, engagement of SIRPbeta2 on T cells by CD47 on antigen-presenting cells results in enhanced antigen-specific T-cell proliferation.
If the exact mechanism is not known then it is safe to simply use has_participant. E.g.
col2 = Q9P1W8 col5 = GO:0022409 col16 = has_participant(Q08722)
(see next question)
3. We were confused on how to use 'has_input' and 'has_output'. There is not much information on these two relationships on this wiki page. In particular could we have examples describing the use of the 'has_output' relationship?
The idea here is that a process or function must have one or more participants - these are physical objects such as ions, molecules, proteins, RNAs, cell components, organs, etc. Participants can play different roles in a process, such as input, output or catalyst. The relation hierarchy is:
all participants must be present at some point during the process. If a participant is present at the beginning of the process, and it is changed in some way then it is an input. If it is present at the end, and has been changed in some way by the process, then it is an output.
This is very similar to [Reactome]
The meanings become more specific when paired with a biological process.
- biosynthesis - the output is what is made from simpler parts during the process
- catabolism - the input is what is broken down during the process
However, this is not always clear cut. Consider the case of binding to a protein such as importin. Is this input or output? In the Reactome model, the input would be an importin in the 'unbound' state and the output would be importin in the 'bound' state. But this is harder to state within the confines of the GO model. What we need to do here is work out the core types of function and process we wish to use in compositions and come up with clear guidelines. For the binding case I suggest usage of has_input
Note that in the majority of cases it is not wrong to state has_participant, any more than it is wrong to annotate higher up the GO DAG, this just doesn't communicate as much
4. What relationship should be used if we would like to indicate the target of a certain molecular function or biological process? For instance:
The target of protein Q9BRA2's protein-disulfide reductase activity is P63167
Q9BRA2 TXNDC17 GO:0047134 protein-disulfide reductase activity PMID:18579519 IDA F Thioredoxin domain-containing protein 17 TXNDC17|TXNL5|IPI00646689|TXD17_HUMAN protein taxon:9606 20080627 UniProtKB
The definition of GO:0047134 ! protein-disulfide reductase activity is Catalysis of the reaction: protein-dithiol + NAD(P)+ = protein-disulfide + NAD(P)H + H+.
Here we have two inputs and two outputs. Again here I think has_input is appropriate.
5. Should there be annotation guidelines requiring that when a column 16 is used to link a GO MF to a CC term, it needs to be complemented by an annotation using the same PMID for the corresponding CC term?
yes. we can discuss automatically filling these in as part of the GAF publishing pipeline, but I have a feeling people will be more comfortable with guidelines for what to do at annotation time, in which case, yes, it should be required to make the semi-redundant annotation.
6. Is it correct to assume that when a protein is the target of some transcription activity, we should indicate the target in column 16 in both the annotation to the process terms (e.g. positive regulation of transcription) and molecular function terms (e.g. transcription factor activity)?
If we are making just the F annotation, what is the relation that should be used for TF activity? I see now that the original documentation is too weak on the subject of gene products, and this is clearly an important use case.
Let's leave time to discuss this one during the call
Another question I feel will come up:
7. When do we use col 16 and when do we request a new term?