Difference between revisions of "Column 16 discussion 12-12-09"

From GO Wiki
Jump to: navigation, search
m (On the call)
m (On the call)
Line 1: Line 1:
For reference page on column 16 Annotation_Cross-Products, see:  http://wiki.geneontology.org/index.php/Annotation_Cross_Products  
For reference page on column 16 Annotation_Cross-Products, see:  http://wiki.geneontology.org/index.php/Annotation_Cross_Products  
== On the call ==
== On the call ==

Revision as of 05:04, 16 December 2009

For reference page on column 16 Annotation_Cross-Products, see: http://wiki.geneontology.org/index.php/Annotation_Cross_Products


On the call

  • Chris, Harold, Doug, Rama, Kimberly, Mary, Ruth, Pascale, Jim, Doug, Fiona, Lakshmi, Val, Yasmin, Emily

Cell Type

  • All present were happy with the pre-prepared documentation below describing the usage and format of CL identifiers in column 16:

Use of Cell Type as an Annotation_Cross_Product in column 16.

CL identifiers would be included in column 16 for a GO annotation whenever that information is present in a particular paper. No judgment is made as to whether a gene product is involved in a particular process in just a particular cell type or in all cell types. In other words, curators simply annotate all available data in a paper.

Therefore it is incorrect to assume that a gene product used in a GO annotation that has a CL identifier in column 16 is involved in the curated process only in that annotated cell type. Similarly, it would be a mistake to conclude that lack of a CL co-annotation indicates that a given gene product is involved in a process in all cell types where it is found. The only correct interpretation of a GO annotation with a CL co-annotation is that in one particular experiment a given gene product was found to be involved in a particular process in a particular cell line.

Annotation Format of column 16 for Cell Type:

• If CL is used to refine a CC annotation, then the relation (for now) must be part_of

• If CL is used to refine a BP or MF, then the relation (for now) must be occurs_in

Simple annotation for Column 16

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
TLR4 cell surface (GO:0005887) PMID:nnn part_of(CL:0000576)
CREB gluconeogenesis (GO:0006094) PMID:nnnn occurs_in(CL:0000182)

Where a protein has multiple cellular locations, CL identifiers should be separated by a pipe (|):

e.g. part_of(CL:0000127) | part_of(CL:0000236)

N.B. no meaning is attached to the order of that the CL identifiers are listed in column 16.

  • Chris: an annotation can not differ solely in the contents of column 16. As it will be optional for users to process this field. Therefore additional information will be included via pipe | symbols.
  • multiple CL identifiers should always be separated by a pipe.

Other Ontologies

Kimberly, Doug: WormBase/ZFIN would like to use a life stage ontology. Where a curator would like to represent that a gene product is located in a certain cell time at a particular life stage, then this information would be separated by commas. E.g.

If GeneXYZ was located in a two cell types at the same or different life stages, this second piece of information would be separated in column 16 by a pipe, for instance: part_of(CL:0000576), occurs_in(LS:00000xyz) | part_of(CL:0000578), occurs_in(LS:00000xyz) Chris and Harold: Should note that some external ontologies many not be very quick to respond to term requests, although this should not be an issue with Cell Type, or probably many of the MOD-maintained ontologies.
  • New GO term requests vs. post-composition in Column 16*
Doug: use of column 16 is likely to be very popular with ZFIN curators, as they would no longer need to wait for GO to create a term. Chris:: terms should still be requested, although it is difficult to provide unified guidelines. Would appear that including life stages in a GO term is introducing too much specifity, however GO terms to describe a specific cell type migration should still be created in GO. Doug: perhaps GO editors could assess the popularity of a particular term + column 16 combination as an aid to decide if a new term should be created.

Annotation Targets

Chris: There are different types of 'targets' - target of enzymatic activity - target of binding - target of transcriptional regulation

- targets should be identified either via a UniProtKB accession or a MOD identifier - ideally, protein targets should be indicated via a protein identifier, while targets of transcriptional regulation could be gene identifiers. - undecided as to how strongly require groups to choose between protein and gene identifiers.

Pascale: could also be direct or indirect targets; e.g. the target of a signaling pathway. Chris: such downstream targets could be indicated by the relationship type used.

Val: interested in annotating the targets of protein phosphorylation. However this data would come from multiple PMID, therefore multiple annotation lines would be required, as a single annotation cannot cite more than one PMID. There is the concern however about introducing large amounts of redundancy in the annotation set if I decide to capture all the targets of a kinase's phosphorylation activity. Rama: Perhaps GO is not the best resource to capture such detail, curators need to judge when to stop annotateing in GO, and when to decide to use other resources, e.g. BioGRID.

Chris: I feel all of these targets should be captured in the context of GO, even if 20/30 targets included. Although users who ignore col. 16 would see redundant annotations, redundancy is fine. However perhaps in future we could think of alternative formats for capturing such data (new GO grant?)

Emily: Redundancy in gene association files seems acceptable, however curators might want to look at ensuring that their web display of their annotations reduces such redundancies.

Rama and Harold: only annotate those low-scale, 'definitive' experiments: not results from high-throughput assays.

Harold: How could specific promoters be identified? For instance if a paper describes the specific interaction between a gene product and a certain promoter. Could genome co-ordinates be captured; although these could change over time.

Chris: could use Gene ID. As the experiment is usually inferring that the gene product specifically regulates the transcription of the gene normally associated with the promoter.

Rama: Who intends to submit GAF files with data in column 16?

    See: http://wiki.geneontology.org/index.php/GAF_file_2.0

Val: to send Pombe trial file to Chris.

Chris: AmiGO labs in release 1.8 should start to display column 16 data

Discussion on Question 2 submitted by GOA (see below) Val: Pombe would also want to submit similar data as described in this example. Rama: What is the difference between annotating the processes that 2 binding proteins contribute towards, and annotating to protein complexes, and the processes that individual members contribute towards? Emily, Val: this is a binding interaction between proteins from two different cells. We'd be interested in capturing the dependencies of biological processes on interactions.

Emily: Would there be a redundancy in GAF files for protein binding annotations? The interactor in a protein binding annotation is currently captured in the 'with' column. Should this binding partner be again captured in column 16?

Chris: Yes; this redundancy in an annotation is fine.

Emily: Would it be correct to transfer information in column 16 to orthologs when carrying out ISS transfers of experimentally-verified annotations?

General agreement: Yes, should be possible to do, however curators must manually assess the correctness of transfering column 16 data to orthologs, each time such ISS annotations are made.


When to use has_output/has_input

- Protein A is a substrate of Protein B -- Protein B has_input Protein A

This mechanism is very similar to that used by Reactome. However Reactome can specify the state of a protein (e.g. the phosphorylated state or the location. However many groups will not be able to capture such detailed information (PRO users will be the exception here)

Therefore if annotation groups cannot indicate the altered nature of a gene product, then they should always apply the relationship has_input - as a standard default for targets of all biochemical activity

For 'targets' of transcriptional regulation, the gene identifier in column 16 should be prefixed with the relationship 'has_output', as the standard default.

Normally the has_catalyst' relationship would not be applied in GO annotation.

If we need to capture direct/indirect targets, then child terms of these relationships can be created: -- has_direct_output -- has_indirect_output

General Consensus: advice should be given to CHADO users.

Chris: will concentrate on creating the next set of documentation to describe column 16 format, esp. with regards 'annotation targets', then advice will be provided for CHADO.

Questions submitted by the GOA team before the meeting

pre-meeting notes from Chris in green

1. Is there any timeline for inclusion of additional cross-references to other OBO ontologies/details on annotation targets?

Individual groups should not be restricted by I suggest for simplicity we focus efforts on a smaller set of well-understood ontologies.

  • Cell
  • Certain anatomical ontologies (although there are myriad issues here we should discuss)

And of course gene products, for which we should use UniProtKB for proteins and I'm not sure what for RNA products. Individual MOD IDs?

By focusing on Cell we make things simpler as we can say a lot with just two relation: occurs_in, for refining a biological process term to the cell where that process takes place, and part_of, for refining a subcellular localization with the cell. But if this group wants to focus on other areas where they feel additional expressivity is required, we can discuss that.

The original col16 documentation is too ontology-centric and does not pay enough attention to the gene product use cases. Apologies for this, we can discuss this more during the call.

2. In the proposed format for column 16, I can't see how I could indicate the biological processes which are dependent on a certain protein-protein interaction? Perhaps we need a new relationship such as 'required_for', to be used in conjunction with BP terms, which could be added into column 16 for the annotation to the 'protein binding' term? Example: The biological processes GO:0008284, GO:0050870, GO:0022409, occur when Q9P1W8 interacts with Q08722 (PMID:15383453).

The relationship would be "requires" rather than "required_for", because it is the relationship between the process/function in col5 and the entity referenced in col 16.

I would be interested in discussing this particular example and possibly some others as I think the gene product issue is an important one. What does "requires" mean here? Do the two gene products have to come together to form a complex or join a larger complex? Does one participate in signaling?

  • Q9P1W8 = SIRP beta2
  • Q08722 = CD47
  • GO:0008284 ! positive regulation of cell proliferation
  • GO:0050870 ! positive regulation of T cell activation
  • GO:0022409 ! positive regulation of cell-cell adhesion

Signal-regulatory proteins (SIRPs) are transmembrane glycoproteins belonging to the immunoglobulin (Ig) superfamily that are expressed in the immune and central nervous systems. SIRPalpha binds CD47 and inhibits the function of macrophages, dendritic cells, and granulocytes, whereas SIRPbeta1 is an orphan receptor that activates the same cell types. A recently identified third member of the SIRP family, SIRPbeta2, is as yet uncharacterized in terms of expression, specificity, and function. Here, we show that SIRPbeta2 is expressed on T cells and activated natural killer (NK) cells and, like SIRPalpha, binds CD47, mediating cell-cell adhesion. Consequently, engagement of SIRPbeta2 on T cells by CD47 on antigen-presenting cells results in enhanced antigen-specific T-cell proliferation.

If the exact mechanism is not known then it is safe to simply use has_participant. E.g.

 col2 = Q9P1W8
 col5 = GO:0022409
 col16 = has_participant(Q08722)

(see next question)

3. We were confused on how to use 'has_input' and 'has_output'. There is not much information on these two relationships on this wiki page. In particular could we have examples describing the use of the 'has_output' relationship?

The idea here is that a process or function must have one or more participants - these are physical objects such as ions, molecules, proteins, RNAs, cell components, organs, etc. Participants can play different roles in a process, such as input, output or catalyst. The relation hierarchy is:

all participants must be present at some point during the process. If a participant is present at the beginning of the process, and it is changed in some way then it is an input. If it is present at the end, and has been changed in some way by the process, then it is an output.

This is very similar to [Reactome]

The meanings become more specific when paired with a biological process.

  • biosynthesis - the output is what is made from simpler parts during the process
  • catabolism - the input is what is broken down during the process

However, this is not always clear cut. Consider the case of binding to a protein such as importin. Is this input or output? In the Reactome model, the input would be an importin in the 'unbound' state and the output would be importin in the 'bound' state. But this is harder to state within the confines of the GO model. What we need to do here is work out the core types of function and process we wish to use in compositions and come up with clear guidelines. For the binding case I suggest usage of has_input

Note that in the majority of cases it is not wrong to state has_participant, any more than it is wrong to annotate higher up the GO DAG, this just doesn't communicate as much

4. What relationship should be used if we would like to indicate the target of a certain molecular function or biological process? For instance:

The target of protein Q9BRA2's protein-disulfide reductase activity is P63167

Q9BRA2 TXNDC17 GO:0047134 protein-disulfide reductase activity PMID:18579519 IDA F Thioredoxin domain-containing protein 17 TXNDC17|TXNL5|IPI00646689|TXD17_HUMAN protein taxon:9606 20080627 UniProtKB

The definition of GO:0047134 ! protein-disulfide reductase activity is Catalysis of the reaction: protein-dithiol + NAD(P)+ = protein-disulfide + NAD(P)H + H+.

Here we have two inputs and two outputs. Again here I think has_input is appropriate.

5. Should there be annotation guidelines requiring that when a column 16 is used to link a GO MF to a CC term, it needs to be complemented by an annotation using the same PMID for the corresponding CC term?

yes. we can discuss automatically filling these in as part of the GAF publishing pipeline, but I have a feeling people will be more comfortable with guidelines for what to do at annotation time, in which case, yes, it should be required to make the semi-redundant annotation.

6. Is it correct to assume that when a protein is the target of some transcription activity, we should indicate the target in column 16 in both the annotation to the process terms (e.g. positive regulation of transcription) and molecular function terms (e.g. transcription factor activity)?

If we ignore col16 for a moment I feel that in these cases groups should just make the F annotation, as we have the inter-ontology part_of links. See [GAF Inference].

If we are making just the F annotation, what is the relation that should be used for TF activity? I see now that the original documentation is too weak on the subject of gene products, and this is clearly an important use case.

Let's leave time to discuss this one during the call

Another question I feel will come up:

7. When do we use col 16 and when do we request a new term?

Gene (col 2/3) Term (col 5) Ref (col 6) Ext (col 16)
GeneXYZ cell surface (GO:0005887) PMID:nnn part_of(CL:0000576), occurs_in(LS:00000xyz)