Distributed James Server — Consistency Model

This page presents the consistency model used by the Distributed Server and points to the tools built around it.

Data Replication

The Distributed Server relies on different storage technologies, all having their own consistency models.

These data stores replicate data in order to enforce some level of availability. We call this process replication. By consistency, we mean the ability for all replica to hold the same data. By availability, we mean the ability for a replica to answer a request.

In distributed systems, according to the CAP theorem, as we will necessarily encounter network partitions, then trade-offs need to be made between consistency and availability.

This section details this trade-off for data stores used by the Distributed Server.

Cassandra consistency model

Cassandra is an eventually consistent data store. This means that replica can hold diverging data, but are guaranteed to converge over time.

Several mechanisms are built in Cassandra to enforce this convergence, and need to be leveraged by Distributed Server Administrator. Namely nodetool repair, Hinted hand-off and Read repair.

The Distributed Server tries to mitigate inconsistencies by relying on QUORUM read and write levels. This means that a majority of replica are needed for read and write operations to be performed.

Critical business operations, like UID allocation, rely on strong consistency mechanisms brought by lightweight transaction.

About multi data-center setups

As strong consistency is required for some operations, and as lightweight transactions are slow across data centers, running James with a multi data-center Cassandra setup is discouraged.

However this page enables setting alternative read level, which could be acceptable regarding limited requirements. LOCAL_QUORUM coupled with LOCAL_SERIAL is likely the only scalable setup.

Running the Distributed Server IMAP server in a multi datacenter setup will likely result either in data loss, or very slow operations - as we rely on monotic UID generation, without strong consistency, UIDs could be allocated several times. Other protocols, like JMAP, suffer less from such limitations.

ElasticSearch consistency model

ElasticSearch relies on strong consistency with home grown algorithm.

The 6.x release line, that the distributed server is using is known to be slow to recover from failures.

Be aware that data is asynchronously indexed in ElasticSearch, changes will be eventually visible.

RabbitMQ consistency model

The Distributed Server relies out of the box on a single RabbitMQ server, thus consistency concerns are not (yet) applicable. Availability concerns are applicable.

Denormalization

In Cassandra, data needs to be structured to match the read patterns. To support several conflicting read patterns, the data needs to be duplicated into different structures. This process is called denormalization.

While data can be consistent at the table level, some inconsistencies can sneak in at the applicative level across denormalization tables.

We write to a "table of truth" first, then duplicate the data to denormalization tables.

The Distributed server offers several mechanisms to mitigate these inconsistencies:

  • Writes to denormalization tables are retried.

  • Some SolveInconsistencies tasks are exposed and are able to heal a given denormalization table. They reset the "deduplication tables" content to the "table of truth" content.

  • Read repairs, when implemented for a given denormalization, enables auto-healing. When an inconsistency is detected, They reset the "deduplication tables" entry to the "table of truth" entry.

Consistency across data stores

The Distributed Server leverages several data stores:

  • Cassandra is used for metadata storage

  • ElasticSearch for search

  • Object Storage for large object storage

Thus the Distributed Server also offers mechanisms to enforce consistency across data stores.

Write path organisation

The primary data stores are composed of Cassandra for metadata and Object storage for binary data.

To ensure the data referenced in Cassandra is pointing to a valid object in the object store, we write the object store payload first, then write the corresponding metadata in Cassandra.

Such a procedure avoids metadata pointing to unexisting blobs, however might lead to some unreferenced blobs.

Cassandra <⇒ ElasticSearch

After being written to the primary stores (namely Cassandra & Object Storage), email content is asynchronously indexed into ElasticSearch.

This process, called the EventBus, which retries temporary errors, and stores transient errors for later admin-triggered retries is described further here. His role is to spread load and limit inconsistencies.

Furthermore, some re-indexing tasks enables to re-synchronise ElasticSearch content with the primary data stores