LEGO Model Draft Specification

From GO Wiki
Jump to: navigation, search
  • Authors: Chris Mungall, Amelia Ireland
  • This version: 2011-08-22


See for latest


See File:Paul's LEGO white paper March 2010.pdf

Core LEGO Model

The LEGO model can be split into the core model, which describes the biology, and the evidence model. I will describe the core model first.

A Lego set consists of one or more Lego statements. A statement is essentially an annotation, but I use the term "statement" here to distinguish it from simpler GAF annotations, and to emphasize the fact that each annotation is an assertion with a defined meaning.

  1. An (optional) annotation identifier
  2. A subject class expression
  3. An (optional) subject occurrence identifier
  4. A relation
  5. A fill class expression
  6. An (optional) filler occurrence identifier

Occurrence IDs are a means of refering to a class of entities as they appear in a particular context. For example, the class 'phosporylation' can be instantiated in multiple different contexts. By using a unique occurrence identifier, we can reference a phosphrylation event that is particular to a specific pathway. Note that we treat genes and gene products such as NEDD4 as classes too; these gene products occur in multiple difference contexts.

In this document I will use the syntax:

[Annot ID] Subj (Occurrence ID) Rel Filler (Occurence ID)

Using gene symbols and GO labels encloded in single quoates rather than identifiers. This is not intended to be the exchange syntax that computers will produce and consume. This syntax is primarily for unambiguous communication between humans.

The first thing to note is that every GAF annotation is a lego statement, with the occurrence identifiers omitted. The subject class is colums 1+2, and the filler class is column 5. The relation is implicit from the filler. Using the syntax above, we can write an annotation that NEDD4 has ubiquitin-protein ligase activity as follows:

 NEDD4 actively_participates_in 'ubiquitin-protein ligase activity'

We could also write this as:

 NEDD4 (_) actively_participates_in 'ubiquitin-protein ligase activity' (_)

Where the "_" indicates a null/blank value for the occurrence ID. It simply means that there is some occurrence, but we choose not to label it.

Another example: annotation of PTR2 to the term TM transport:

 PTR2 actively_participates_in 'transmembrane transport'


Simple chain example

The sentence we wish to encode is:

 Ubr1p degrades the transcriptional repressor Cup9p that represses
 transcription of the PTR2 gene, which encodes a transporter, which
 transports dipeptides across a membrane. (PMID: 9427760)

I will use the following relation shorthands:

 *-> is_active_participant_in
  -> is_input_for

We can encode the picture above as:

 Ubr1p (e1) *-> 'protease activity' (p1)
 Cup9m (e2)  -> 'protease activity' (p1)
 Cup9m (e2) *-> 'positive regulation of transcription' (p2)
 PTR2  (e3)  -> 'positive regulation of transcription' (p2)
 PTR2  (e3) *-> 'dipeptide transmembrane transporter activity'

Here I am using short IDs (e1, p1) for the save of brevity. These would actually be much longer. See later on this document for possible shorthands.

The occurrence identifiers connect the entities in the statements together, in this case in the familiar topology of a linear chain of events.

UV example

(here I reuse IDs like p1, e1 for brevity, but these would in fact be longer IDs and distinct from the IDs in the previous example)

[a1] NEDD4 (e1) *-> 'ubq protein ligase activity' (p1)
[a2] RNAPII (e2) -> 'ubq protein ligase activity' (p1)
[a3] 'ubq protein ligase activity' (p1) negatively_regulates transcription (_)
[a4] 'ubq protein ligase activity' (p1) part_of 'cellular response to UV' (_)

See Paul's white paper for more details

We do not need to label every occurrence - since we never refer to the ultimate transcription event we can leave its identifier blank. However, if we wanted to extend this to include the transcription target then we could create an identifier.

We already have a term: "regulation of transcription from RNA polymerase II promoter in response to UV-induced DNA damage" (GO:0010767). This is not a problem, as we can infer that NEDD4 is involved in this - but for this we need to specify a formal semantics for the model (see next section).


We can post-compose a complex as follows

 NEDD4 (e1) *-> 'ubq protein ligase activity' (p1)
 'protein complex' (e2) -> 'ubq protein ligase activity' (p1)
 'ubq protein ligase activity' (p1) occurs_in 'epithelial cell' (e3)
 'ubq protein ligase activity' (p1) negatively_regulates 'sodium channel activity' (p3)
 'sodium channel activity' (p3) occurs_in 'epithelial cell' (e3)
 SCNN1A (_) part_of 'protein complex' (e2)
 SCNN1B (_) part_of 'protein complex' (e2)
 SCNN1C (_) part_of 'protein complex' (e2)

The complex is anonymous. However, we can infer if this is a subclass of a PRO complex, as PRO includes logical definitions

Here we choose to name the cell occurrence. This is because we want to make it clear that the nedd4 activity and the channel activity is in the same epithelial cell. There may be cases involving signaling where we want to make it clear the cells are different.

Note that if we want to refer to observations of the above model in different tissues such as the lung, nephron etc we can do this

nephron (e4) has_part 'epithelial cell' (e3)

i.e. there are some nephrons that have as parts epithelial cells that are connected to the above events.

Formal Semantics

We specify a formal semantics for Lego statements as a translation to OWL. By implementing this translation to OWL we can use OWL reasoners to check validity, consistency and inferences from a set of Lego statements. The translation also stands as formal documentation and makes the biological meaning explicit; without this we have yet another way of representing diagrams.

There are two options: An individual-based formalization, where each occurrence is treated as an OWL individual, and a class-based formalization, where each occurrence is treated as a subclass.

We show the class-based formalization here:

 SC (SI) R FC (FI)
 SI SubClassOf SC
 FI SubClassOf FC
 SI SubClassOf R some FI

Together with the logical definitions in the ontology, this is sufficient to infer classification of post-composed descriptions under pre-composed terms.

It also opens up possibilities such as automatically generating pre-coordinated terms for use in advanced term enrichment

Drawing Diagrams from Lego Statements

There is a deterministic method for drawing diagrams from collections of LEGO statements.

Visual model:

Each occurrence is drawn as a box. Boxes can contain other boxes. Boxes can be adjacent to other boxes.

A lookup table determines how boxes are spatially co-located. E.g.

  • P1 part_of P2 ==> P1 contaned_by P2
  • E actively_participates_in P ==> E immedialely-left-of P
  • P has_input E ==> E immediately-below P
  • P1 regulates P2 ==> P1 connected to P2 by arrow that is labeled with +/-

Boxes are labeled with the class. The occurrence identifier is not shown on a standard diagram, these IDs are used purely to connect boxes.

Translating to GAF

Mapping to GAF1

First translate every statement:

[Annot ID] Subj (Occurrence ID) Rel FillerAsserted (Occurence ID)


[Annot ID] Subj (Occurrence ID) Rel Filler (Occurence ID)

where Filler is the most specific subclass of occurrence ID that can be inferred by an OWL reasoner (often this will be equivalent to Filler).

Then we generate a GAF line:

c1/2 = Subj
c5   = Filler

If the Filler matches the OWL class expression



c4 = NOT
c5 = X

ignore other fields (see evidence model for converting other fields)

Note this translation is obviously lossy

Mapping to GAF2

As translation to GAF1, but we can use c16 to express some of a lego statement.

Follow the translation for GAF1. Translate every set of annotations:

[?] Subj (?) ? Filler (x)
[?] ? (x) R1 Y1 (?)
[?] ? (x) R2 Y2 (?)
[?] ? (x) Rn Yn (?)

We create a GAF line:

c1/2 = Subj
c5   = Filler
c16  = R1(Y1),R2(Y2),...,Rn(Yn)

From this we can see that if the level of nesting is >1 then we have information loss in the translation to GAF


Shorthand for occurrence IDs

Rather than forcing the use of lengthy occurrence identifiers, it may be possible to first group annotations into sets, and then use a short local identifier (1,2,3). The unique identifier is a composite key of the annotation set and the local ID.

This may introduce more complexity in the long run

Differences and similarities to Pathway models

This model places GO annotations one step further towards something like BioPAX or the Reactome model.

However, there are some crucial differences. In a pathway model, one typically has to specify all inputs and outputs. Here this is not necessary, because we are leveraging the semantics of the ontology. We only need to name the inputs and outputs if we want to refer to them again within the same picture