Research on PATIKA

Ontology

PATIKA Ontology

We have described an ontology [1] to model networks of cellular processes through integration of information on individual pathways. Our ontology is suitable for modeling incomplete information and abstractions of varying levels for complexity management. Furthermore, it facilitates concurrent modifications and extensions to existing data while maintaining its validity and consistency.

PATIKA Objects

Every first class object in the PATIKA ontology is a PATIKA object, which describes the common functionality and information. A PATIKA object has an ID that uniquely describes it with a version, an author (the user who first created this object), and an experimental data source, describing how this phenomenon was observed and points to the literature references. A data source can be:

Experimental
Inferred
Imported
Other

Every PATIKA object also has a name and description, which comply with external naming conventions and vocabularies (such as Human Genome Nomenclature [2]) whenever possible. Finally every PATIKA object is optionally associated with a set of GO Terms [3].

Bioentities

More than often actors, especially macromolecules have a common path of synthesis and/or are chemically very similar. For example, a p53 protein may be in native, phosphorylated, and MDM2-bound forms. Another example is cytoplasmic and extracellular calcium. These molecules usually have different information contexts. It is possible to model all these molecules as separate entities; however, it is not practical as these grouping are very natural and complies with the current biological paradigm. Moreover there is a wealth of information at this level of detail, thus we address entities at this level as well. Therefore it is more agreeable to maintain such biological or chemical groupings as bioentities while representing these �minor� changes in their information context with states.

In most genomic and proteomic databases such as GeneCards, SwissProt or GO, and in high-throughput data such as microarray and Y2H, bioentities form the unit entries. Each bioentity stores a set of external references mapping to these databases, and acts as a gateway to the external resources.

We hope to cover other entities like operons in the future versions of the ontology.

States

We model the actors of these events as states. The term is very generic and encapsulates macromolecules (e.g., DNAs, RNAs, and proteins), small molecules e.g., ions, ATP, and lipids), or even physical actors (e.g., heat, radiation, and mechanical stress). States also represent molecular complexes, or conceptual abstractions that behave like state. Depending on their nature, states are classified as either compound or simple.

Simple States

Simple states represent tangible and unit phenomena. They belong to a bioentity. Each state of a bioentity represents a change in the information context. Those changes are represented with the following bioentity variables:

Cellular Localization: Each state has a compartment in the cell. Changes in compartment means a change in the molecules information context, since the set of molecules it can interact with changes. This property is single and mandatory for all states. Compartments are described later in this document.
Attachment: Apart from localization, a protein might have been attached to a membrane. Like cytosolic farnesyl attached proteins. Modeling attachment as a separate variable have several advantages over modeling each compartment-attachment pair as a different location, other than this, they are equivalent
Complex Binding: This property describes a functional change due to long-lived non-covalent bonds between molecules; for instance p53 bound state of MDM2. This property is multiple and optional.
Homomerization: Non covalent bonding of two or more of the same state is not considered as a complex formation for the sake of simplicity. Instead we use a separate property for modeling such a state. This property is multiple and optional
Isomerization: Conformotional non-covalent changes within the molecule. This property is multiple and optional.
Chemical Modification: Cleavages, group additions/removals and other covalent changes are classified in this group. This property is multiple and optional.
Aberration: Two types of aberrations exist: sequence and structure aberrations (i.e. due to misfolding). Sequence aberrations can be chromosomal or point mutations as well as polymorphisms. We can define a more detailed sequence state variable annotation scheme using GO. Sequence aberrations can occur both on DNA, RNA and protein, although it must originate from a DNA. Structure aberrations can occur in RNA and Protein states and model changes due to aberrant folding proteins, physical factors or prions. Most of those aberrations are non-specific, i.e. we are actually grouping combinatorially many different molecules under a name, but it is ok since they do not exhibit any function. In the case of prions these aberrant forms can be functional, but since this is a very special and exceptional case, we will neglect it for now. This property is multiple and optional.

Any combination of bioentity variables form a unique state of this bioentity. Note that only a very small portion of the state space actually occurs in biological systems.

An important side note is in pathway drawings, it is common represent different states as a single biological entity, even when the mechanistic detail is known. This is an oversimplification as different states can have very different and sometimes conflicting effects. Mapping such information to PATIKA graphs might not be trivial, as in most cases the mechanistic detail is unknown. PATIKA allows defining relations at both bioentity and state level to address these levels of detail and abstraction.

In some cases, a bioentity�s states are also labeled with various logical ( as opposed to physical) tags, such as active form, open channel or resting state. It is best to capture such tags with the states name, as they are context dependent.

Another point is states can be ubiquitous, i.e. participates in a significantly high number of reactions. Most of the time this is true for small molecules such as ATP, or water, which have generic and structural roles. For visualization and analysis, such states can be problematic. PATIKA allows labeling states as ubique, and handles them differently during visualization ( e.g. splits up cytosolic ATP for each reaction it participates in) and query ( e.g. ignores ATP during shortest path query)

States map to a class of molecules/entities rather than a single molecule. It might be that this group is not totally homogenous. For example it is not desirable for most of the cases to model the rotamers of a protein as different states, as there are combinatorially many of them, they are very short lived (in the range of nanoseconds) and switching from one of them to another is almost instantaneous. However PATIKA ontology does not define hard lines for the level of abstraction, as it readily provides a framework for modeling and representing multiple levels of detail. So we can say that state variables can be incomplete and overlapping. An example of incomplete state variable is �phosphorylated p53�. However this representation poses a subtle problem. Since we do not know at which site the p53 is phosphorylated, relationship between this state and phosphrylated p53 at 153Arg is not clear. It might be that two authors actually talk about different states, or the latter is a non-proper subset of the first. A sensible approach is to delagate this issue to the submitter, and to the expert, as it is really hard to come up with a context free resolution rule. If the first is the case, than the submitter must modify the phosphorylated p53 entry to bring it into the correct level of detail.On the other hand if it is the second case they must switch the p53 phosphorylated into an incomplete state to indicate different levels of detail.

Obviously there are still a lot of important issues to cover, most important ones being combinatorial states, generics, polymerization, semi-quantitative phosphorylations. We are constantly evaluating examples ( use-cases) from the biological literature to come up with improving our ontology to include these "hard" cases.

Compound States

A compound state is a grouping of other PATIKA objects, which exhibits a state-like behavior, and needs to be addressed at this level. There are two types of compound states, complex and abstraction.

In biological systems molecules often form clusters for performing proper tasks, behaving like a single state. We consider each member of a molecular complex as a new state of its biological entity. The function of a molecular complex is affected by the specific binding relations within itself. Therefore these binding relations must be represented in the model as well. Moreover, members of a molecular complex may independently participate in different transitions; thus one should be able to address each member individually. In addition, a molecular complex may contain members from multiple neighboring compartments. In that case, always one of those compartments is a member type compartment. It is actually possible to model complexes in a similar fashion to membrane spanning proteins.

Complex states has a set of simple state members which are complex members.

Complex states do not have a bioentity, as they are not simple. However their members have their own bioentities. This information is used for complexes as well, e.g. for querying.

An important question is �what is a complex really? Do we model short lived binding relations as complexes or activation relationships?�. PATIKA�s answer is �never use a compound graph unless you need to�,and this also applies to complexes. If an activation relation would do the trick, it is best to use it. If that level of detail is not sufficient for another user, they can re-edit it to add a complex at that point.

Figure 1. A portion of a pathway containing a molecular complex of three states.

Transitions

A cell is not a static entity, neither are its actors. Molecules in a cell are synthesized, modified, transported and degraded constantly to respond to the changes the environment, or to accomplish a task. One can model such changes as quantitative chemical reactions. However this would reduce the coverage of the model, as currently both molecular concentrations and rate constants for most of these reactions are unknown. It is often preferred to represent these changes qualitatively since this better suits current experimental data.

A transition has a set of states as its substrates ( inputs) and products (outputs). A transition occurs only when all of its substrates are present and activation conditions are satisfied; a function of the certain other states. These states are called the effectors of a transition. Two types of effector relations are defined, activator and inhibitor, for positive and negative regulation respectively. When a transition occurs, all of its products are generated. We take great care to make all PATIKA transitions compatible with the cannonical biochemical paradigm.

PATIKA uses a pragmatic approach for formally defining transitions: any event that changes one or more states to another set of states is a transition. This definition delegates the exact definition of transition to the exact definition of state, and as mentioned above, level of modeling detail for PATIKA states are very flexible. It follows that PATIKA ontology can model transitions at multiple levels, allowing high coverage, without losing from its content. Two transitions are equal if they have the same set of substrates and products. This reveals two invariants for transitions:

A transition has at least one substrate and one product.
There cannot be two transitions with same set of substrates and products.

Although transitions can have a very large spectrum, we expect that most of them will fall to the certain classes. Those classes are captured by PATIKA transition tree. Under certain circumstances, multiple transitions having the same state as a substrate may affect each other through depleting this common substrate. This happens when the equilibrium constant of a transition is relatively much higher than the others. If such a difference occurs among the equilibrium constants of transitions, we call the transition with the highest equilibrium constant exhaustive over other transitions for the common substrate. Transitions having the same order of equilibrium constant, on the other hand, are said to be cooperative. Transitions that have one or more substrates are exhaustive on each other, through a depleting substrate. Which one of these substrates are likely to deplete is up to the modeler.

Figure 2. The transition tree.

Two transitions are called inverse of each other iff one transitions product set is other�s substrate set and vice versa.

Following properties are missing from the current ontology, but were discussed at some point and left out for future versions.

Transition Rules

The term transition logic coins a rather wide spectrum. In modeling transition logic we can use boolean predicates, linear equations, stochastic models, pi-calculus etc. PATIKA ontology assumes that the representation and equality of the transition is independent of the transition logic. We assume that transition logic is represented in the transition rule, which is not an internal part of the ontology. Currently the only way of associating the transition rule with the transition is via custom user object.

Effector Combinations

Currently we assume that any combination of effectors can regulate a transition. This might not be the case, for example two inhibitors may never be present together in the cell, or when two inhibitors are present they cancel out each other. So one point of view is to think of each effector combination as a separate transition. If it turns out that actually only a small number of all combinations of effectors are significant, a possible approach is to use the already existing compound graph notion to include children nodes into the transition for all significant combination sets, in order to be able to address them separately.

Interactions

Relations between bioentities, states and transitions are described using interactions, which can be directed or undirected. Interactions are divided into two based on their level of detail.

Mechanistic Interactions

Mechanistic interactions define relations between states and transitions at the chemical level of detail. There are five types of them:

A substrate relation is a directed relation with a state as its source and a transition as its target, which indicates that the state is consumed by the transition. A substrate relation has a stochiometry attribute which describe the number of its source states that are consumed per target transition. stochiometry defaults to one. A substrate relation may also define an exhaustive property of its target transition over the source substrate. Exhaustive defaults to false.
A product relation is a directed relation with a transition as its source and a state as its target. It indicates that the state is produced by the transition. It also has a stochiometry attribute to describe states produced per transition.
An activator relation is a directed relation with a state as its source and a transition as its target. It describes the enabling or facilitating of the transition via the source state.
An inhibitor relation is a directed relation with a state as its source and a transition as its target. It describes the disabling or impeding of the transition via the source state. Irreversible inhibitions should be modeled as a separate state of the modified enzyme.
A bind relation is an undirected relation with two complex members as their source and target. It describes a non-covalent bonding between these two states. If all binding relations were known for a complex, than the graph defined by binding relations and members would be connected.

Bioentity Interactions

Bioentity interactions describe relations between bioentities but not states. They represent incomplete information, and always map to one or more mechanistic level interaction, although latter one might not be identified yet. There are six types of bioentity interactions:

PPI (Protein-Protein Interaction) is a bi-directional relation, indicating that two proteins are observed to interact with each other in a Y2H or co-precipitation system. In other words, there is at least one state of entity A that somehow interacts with B. One or more mechanistic level relations might be associated with this entity level relation. For example a state 1 of protein A might be bound by protein B, where state 2 of protein A might be bound and cleaved by B. Even the nature of the chemical reaction does not necessarily be similar. Compartment information and n-ary relations can not be captured by PPI. Some sample databases that contain PPI data include Incyte, DIP, BIND, IntAct, PIM, ProNet and Mint.
TR (Transcription Relation) is a uni-directional relation, indicating that at least one state of source node activates/inhibits expression of at least one DNA state of the target. Although there is combinatorial information on TR, we are yet to incorporate this to our ontology. Some sample databases that contain TR data include Transfac, RegulonDB and ooTFD.
Co-clusters: A bi-directional relation indicating that expression levels of RNAs of two bioentities are tightly coupled as measured by microarray experiments and clustering algorithms. This maps to a common transcription factor or a regulation path in the "big picture".
Literature: A bi-directional or uni-directional relation indicating that two bioentities are referred significantly together in the literature; at some cases there can be inferred relationships like A activates B.
Inferred: There are so called �pathway inferral� algorithms that attempt to find bioentity level information mostly based on bioentity data. Inferred edges can be used to capture such data.
Derived: These edges represent that there is a transition in the mechanistic graph that is adjacent to at least one state of each bioentity. Depending on the exact semantics this edge might have sub-types. For example a control edge might indicate that source bioentity has a state that is an effector of a transition, from which a state of the target is produced.

Compartments

A significant number of transitions transport molecules between cellular compartments. Transitions that a state can participate in are strictly related to its compartment; thus a change in the compartment means a change in the state�s information context. We choose to incorporate the state�s compartment in the model.

As the compartments and their adjacencies are cell type dependent, compartmental structure should be modeled as part of the ontology.

Membranes pose an additional problem since not only a molecule may be located completely inside the membrane but also it may span one or both of its neighboring compartments. For membranes there are four types of sub-locations, two sides of the membrane, inside membrane and spanning membrane.

Abstractions

Network of molecular interactions derived from current biological data is incomplete and complicated. Complete network of cellular events is clearly beyond human perception. Different levels of abstractions are necessary to make effective analysis of cellular processes and dealing with complexity better.

Representing a cellular pathway as a single process or grouping related processes under a certain cellular mechanism would enhance the comprehensibility of the network of events (Figure 3). Such mappings are already present and may also be valuable for querying. We model such groupings using regular abstractions. Regular abstractions can be arbitrarily nested and can intersect. However they can not be addressed directly, i.e. they have no incident edges.

Figure 3. A cellular pathway is represented as a grouping.

Since the data on cellular processes is not complete, different levels of information may be available for certain events. In cases where it is not identified which state among a set of states constitutes the substrate, product or effector of a transition, or where target transition of an effector is obscure, we may need to abstract these states (transitions) as a single state (transition) to represent the available information despite its incomplete nature. An edge defined on an incomplete state means that it is actually defined on at least one state inside but the exact state is not known. A similar semantic applies to incomplete transitions.

In biological systems, a gene is often duplicated throughout its evolution serving a different function. A special case occurs when this differentiation serves as a specialization of a generic mechanism. For example when referring to the wnt gene, we actually mean nineteen various similar genes in human [4]. These genes are all activated by different stimulus at different tissues and can lead to different responses even though the signal processing mechanism is similar. Bhalla also describes common process motifs in signaling pathways, which are even more elementary operations that are reused through the entire network [5]. Our ontology supports representation of such homologies using abstractions.

References

[1] E. Demir, O. Babur, U. Dogrusoz, A. Gursoy, A. Ayaz, G. Gulesir, G. Nisanci and R. Cetin-Atalay (2004) "An Ontology for Collaborative Construction and Analysis of Cellular Pathways", Bioinformatics, 20(3), 349-356.

[2] H.M. Wain, M.J. Lush, F. Ducluzeau, V.K. Khodiyar, S. Povey (2004) Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res. 32 Database issue:D255-7. (PMID: 14681406)

[3] Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29.

[4] J. Miller (2001) The Wnts. Genome Biol., 3, reviews 3001.1�3001.15.

[5] U. Bhalla (2002) The chemical organization of signaling interactions, Bioinformatics, 18, 855�863.