Distributed James Server — Operator guide
This guide aims to be an entry-point to the James documentation for user managing a Distributed James Server.
It includes:
-
Simple architecture explanations
-
Propose some diagnostics for some common issues
-
Present procedures that can be set up to address these issues
In order to not duplicate information, existing documentation will be linked.
Please note that this product is under active development, should be considered experimental and thus targets advanced users.
Mail processing
Currently, an administrator can monitor mail processing failure through ERROR
log
review. We also recommend watching in Kibana INFO logs using the
org.apache.james.transport.mailets.ToProcessor
value as their logger
. Metrics about
mail repository size, and the corresponding Grafana boards are yet to be contributed.
Furthermore, given the default mailet container configuration, we recommend monitoring
cassandra://var/mail/error/
to be empty.
WebAdmin exposes all utilities for reprocessing all mails in a mail repository or reprocessing a single mail in a mail repository.
In order to prevent unbounded processing that could consume unbounded resources. We can provide a CRON with limit
parameter.
Ex: 10 reprocessed per minute
Note that it only support the reprocessing all mails.
Also, one can decide to delete all the mails of a mail repository or delete a single mail of a mail repository.
Performance of mail processing can be monitored via the mailet grafana board and matcher grafana board.
Recipient rewriting
Given the default configuration, errors (like loops) uopn recipient rewritting will lead
to emails being stored in cassandra://var/mail/rrt-error/
.
We recommend monitoring the content of this mail repository to be empty.
If it is not empty, we recommend verifying user mappings via User Mappings webadmin API then once identified break the loop by removing some Recipient Rewrite Table entry via the Delete Alias, Delete Group member, Delete forward, Delete Address mapping, Delete Domain mapping or Delete Regex mapping APIs (as needed).
The Mail.error
field can help diagnose the issue as well. Then once
the root cause has been addressed, the mail can be reprocessed.
Mailbox Event Bus
It is possible for the administrator of James to define the mailbox listeners he wants to use, by adding them in the listeners.xml configuration file. It’s possible also to add your own custom mailbox listeners. This enables to enhance capabilities of James as a Mail Delivery Agent. You can get more information about those here.
Currently, an administrator can monitor listeners failures through
ERROR
log review. Metrics regarding mailbox listeners can be monitored
via
mailbox_listeners
grafana board and
mailbox_listeners_rate
grafana board.
Upon exceptions, a bounded number of retries are performed (with exponential backoff delays). If after those retries the listener is still failing to perform its operation, then the event will be stored in the Event Dead Letter. This API allows diagnosing issues, as well as redelivering the events.
To check that you have undelivered events in your system, you can first run the associated with event dead letter health check . You can explore Event DeadLetter content through WebAdmin. For this, list mailbox listener groups you will get a list of groups back, allowing you to check if those contain registered events in each by listing their failed events.
If you get failed events IDs back, you can as well check their details.
An easy way to solve this is just to trigger then the redeliver all events task. It will start reprocessing all the failed events registered in event dead letters.
In order to prevent unbounded processing that could consume unbounded resources. We can provide a CRON with limit
parameter.
Ex: 10 redelivery per minute
If for some other reason you don’t need to redeliver all events, you have more fine-grained operations allowing you to redeliver group events or even just redeliver a single event.
OpenSearch Indexing
A projection of messages is maintained in OpenSearch via a listener plugged into the mailbox event bus in order to enable search features.
You can find more information about OpenSearch configuration here.
Usual troubleshooting procedures
As explained in the Mailbox Event Bus section, processing those events can fail sometimes.
Currently, an administrator can monitor indexation failures through
ERROR
log review. You can as well
list failed events by
looking with the group called
org.apache.james.mailbox.opensearch.events.OpenSearchListeningMessageSearchIndex$OpenSearchListeningMessageSearchIndexGroup
.
A first on-the-fly solution could be to just
redeliver those group events with event dead letter.
If the event storage in dead-letters fails (for instance in the face of cassandra storage exceptions), then you might need to use our WebAdmin reIndexing tasks.
From there, you have multiple choices. You can reIndex all mails, reIndex mails from a mailbox or even just reIndex a single mail.
When checking the result of a reIndexing task, you might have failed reprocessed mails. You can still use the task ID to reprocess previously failed reIndexing mails.
On the fly OpenSearch Index setting update
Sometimes you might need to update index settings. Cases when an administrator might want to update index settings include:
-
Scaling out: increasing the shard count might be needed.
-
Changing string analysers, for instance to target another language
-
etc.
In order to achieve such a procedure, you need to:
-
Create the new index with the right settings and mapping
-
James uses two aliases on the mailbox index: one for reading (
mailboxReadAlias
) and one for writing (mailboxWriteAlias
). First add an aliasmailboxWriteAlias
to that new index, so that now James writes on the old and new indexes, while only keeping reading on the first one -
Now trigger a reindex from the old index to the new one (this actively relies on
_source
field being present) -
When this is done, add the
mailboxReadAlias
alias to the new index -
Now that the migration to the new index is done, you can drop the old index
-
You might want as well modify the James configuration file opensearch.properties by setting the parameter
opensearch.index.mailbox.name
to the name of your new index. This is to avoid that James re-creates index upon restart
Note: keep in mind that reindexing can be a very long operation depending on the volume of mails you have stored.
Mail Queue
Fine tune configuration for RabbitMQ
In order to adapt mail queue settings to the actual traffic load, an administrator needs to perform fine configuration tunning as explained in rabbitmq.properties.
Be aware that MailQueue::getSize
is currently performing a browse and
thus is expensive. Size recurring metric reporting thus introduces
performance issues. As such, we advise setting
mailqueue.size.metricsEnabled=false
.
Managing email queues
Managing an email queue is an easy task if you follow this procedure:
-
First, List mail queues and get a mail queue details.
-
And then List the mails of a mail queue.
In case, you need to clear an email queue because there are only spam or trash emails in the email queue you have this procedure to follow:
-
All mails from the given mail queue will be deleted with Clearing a mail queue.
Deleted Message Vault
We recommend the administrator to run it in cron job to save storage volume.
How to configure deleted messages vault
To setup James with Deleted Messages Vault, you need to follow those steps:
-
Enable Deleted Messages Vault by configuring Pre Deletion Hooks.
-
Configuring the retention time for the Deleted Messages Vault.
Enable Deleted Messages Vault by configuring Pre Deletion Hooks
You need to configure this hook in listeners.xml configuration file. More details about configuration & example can be found at Pre Deletion Hook Configuration
Configuring the retention time for the Deleted Messages Vault
In order to configure the retention time for the Deleted Messages Vault,
an administrator needs to perform fine configuration tunning as
explained in
deletedMessageVault.properties.
Mails are not retained forever as you have to configure a retention
period (by retentionPeriod
) before using it (with one-year retention
by default if not defined).
Restore deleted messages after deletion
After users deleted their mails and emptied the trash, the admin can use Restore Deleted Messages to restore all the deleted mails.
Cleaning expired deleted messages
You can delete all deleted messages older than the configured
retentionPeriod
by using
Purge Deleted Messages.
We recommend calling this API in CRON job on 1st day each
month.
Solving cassandra inconsistencies
Cassandra backend uses data duplication to workaround Cassandra query limitations. However, Cassandra is not doing transaction when writing in several tables, this can lead to consistency issues for a given piece of data. The consequence could be that the data is in a transient state (that should never appear outside of the system).
Because of the lack of transactions, it’s hard to prevent these kind of issues. We had developed some features to fix some existing cassandra inconsistency issues that had been reported to James.
Jmap message fast view projections
When you read a Jmap message, some calculated properties are expected to
be fast to retrieve, like preview
, hasAttachment
. James achieves it
by pre-calculating and storing them into a caching table
(message_fast_view_projection
). Missing caches are populated on
message reads and will temporarily decrease the performance.
How to detect the outdated projections
You can watch the MessageFastViewProjection
health check at
webadmin documentation.
It provides a check based on the ratio of missed projection reads.
Mailboxes
mailboxPath
and mailbox
tables share common fields like mailboxId
and mailbox name
. A successful operation of creating/renaming/delete
mailboxes has to succeed at updating mailboxPath
and mailbox
table.
Any failure on creating/updating/delete records in mailboxPath
or
mailbox
can produce inconsistencies.
How to detect the inconsistencies
If you found the suspicious MailboxNotFoundException
in your logs.
Currently, there’s no dedicated tool for that, we recommend scheduling
the SolveInconsistencies task below for the mailbox object on a regular
basis, avoiding peak traffic in order to address both inconsistencies
diagnostic and fixes.
How to solve
An admin can run offline webadmin solve Cassandra mailbox object inconsistencies task in order to sanitize his mailbox denormalization.
In order to ensure being offline, stop the traffic on SMTP, JMAP and IMAP ports, for example via re-configuration or firewall rules.
Mailboxes Counters
James maintains a per mailbox projection for message count and unseen message count. Failures during the denormalization process will lead to incorrect results being returned.
How to detect the inconsistencies
Incorrect message count/message unseen count could be seen in the
Mail User Agent
(IMAP or JMAP). Invalid values are reported in the
logs as warning with the following class
org.apache.james.mailbox.model.MailboxCounters
and the following
message prefix: Invalid mailbox counters
.
How to solve
Execute the recompute Mailbox counters task. This task is not concurrent-safe. Concurrent increments & decrements will be ignored during a single mailbox processing. Re-running this task may eventually return the correct result.
Messages
Messages are denormalized and stored in both imapUidTable
(source of
truth) and messageIdTable
. Failure in the denormalization process will
cause inconsistencies between the two tables.
How to detect the inconsistencies
User can see a message in JMAP but not in IMAP, or mark a message as `SEEN' in JMAP but the message flag is still unchanged in IMAP.
How to solve
Execute the
solve Cassandra message inconsistencies task. This task is not
concurrent-safe. User actions concurrent to the inconsistency fixing
task could result in new inconsistencies being created. However the
source of truth imapUidTable
will not be affected and thus re-running
this task may eventually fix all issues.
Quotas
User can monitor the amount of space and message count he is allowed to use, and that he is effectively using. James relies on an event bus and Cassandra to track the quota of an user. Upon Cassandra failure, this value can be incorrect.
How to detect the inconsistencies
Incorrect quotas could be seen in the Mail User Agent
(IMAP or JMAP).
How to solve
Execute the recompute Quotas counters task. This task is not concurrent-safe. Concurrent operations will result in an invalid quota to be persisted. Re-running this task may eventually return the correct result.
RRT (RecipientRewriteTable) mapping sources
rrt
and mappings_sources
tables store information about address
mappings. The source of truth is rrt
and mappings_sources
is the
projection table containing all mapping sources.
How to detect the inconsistencies
Right now there’s no tool for detecting that, we’re proposing a
development plan. By
the mean time, the recommendation is to execute the
SolveInconsistencies
task below in a regular basis.
How to solve
Execute the Cassandra mapping SolveInconsistencies
task described in
webadmin documentation
Setting Cassandra user permissions
When a Cassandra cluster is serving more than a James cluster, the keyspaces need isolation. It can be achieved by configuring James server with credentials preventing access or modification of other keyspaces.
We recommend you to not use the initial admin user of Cassandra and provide a different one with a subset of permissions for each application.
Prerequisites
We’re gonna use the Cassandra super users to create roles and grant permissions for them. To do that, Cassandra requires you to login via username/password authentication and enable granting in cassandra configuration file.
For example:
echo -e "\nauthenticator: PasswordAuthenticator" >> /etc/cassandra/cassandra.yaml echo -e "\nauthorizer: org.apache.cassandra.auth.CassandraAuthorizer" >> /etc/cassandra/cassandra.yaml
Prepare Cassandra roles & keyspaces for James
Create a role
Have a look at
cassandra documentation section CREATE ROLE
for more information
E.g.
CREATE ROLE james_one WITH PASSWORD = 'james_one' AND LOGIN = true;
Create a keyspace
Have a look at
cassandra documentation section CREATE KEYSPACE
for more information
Grant permissions on created keyspace to the role
The role to be used by James needs to have full rights on the keyspace
that James is using. Assuming the keyspace name is james_one_keyspace
and the role be james_one
.
GRANT CREATE ON KEYSPACE james_one_keyspace TO james_one; // Permission to create tables on the appointed keyspace GRANT SELECT ON KEYSPACE james_one_keyspace TO james_one; // Permission to select from tables on the appointed keyspace GRANT MODIFY ON KEYSPACE james_one_keyspace TO james_one; // Permission to update data in tables on the appointed keyspace
Warning: The granted role doesn’t have the right to create keyspaces, thus, if you haven’t created the keyspace, James server will fail to start is expected.
Tips
Since all of Cassandra roles used by different James are supposed to
have a same set of permissions, you can reduce the works by creating a
base role set like typical_james_role
with all of necessary
permissions. After that, with each James, create a new role and grant
the typical_james_role
to the newly created one. Note that, once a
base role set is updated ( granting or revoking rights) all granted
roles are automatically updated.
E.g.
CREATE ROLE james1 WITH PASSWORD = 'james1' AND LOGIN = true; GRANT typical_james_role TO james1; CREATE ROLE james2 WITH PASSWORD = 'james2' AND LOGIN = true; GRANT typical_james_role TO james2;
Revoke harmful permissions from the created role
We want a specific role that cannot describe or query the information of other keyspaces or tables used by another application. By default, Cassandra allows every role created to have the right to describe any keyspace and table. There’s no configuration that can make effect on that topic. Consequently, you have to accept that your data models are still being exposed to anyone having credentials to Cassandra.
For more information, have a look at
cassandra documentation section REVOKE PERMISSION
.
Except for the case above, the permissions are not auto available for a
specific role unless they are granted by GRANT
command. Therefore, if
you didn’t provide more permissions than
granting
section, there’s no need to revoke.
Cassandra table level configuration
While Distributed James is shipped with default table configuration options, these settings should be refined depending of your usage.
These options are:
-
The chunk size
-
The cachingoptions
The compaction algorithms allow a tradeoff between background IO upon writes and reads. We recommend:
-
Using Leveled Compaction Strategy on read intensive tables subject to updates. This limits the count of SStables being read at the cost of more background IO. High garbage collections can be caused by an inappropriate use of Leveled Compaction Strategy.
-
Otherwise use the default Size Tiered Compaction Strategy.
Bloom filters help avoiding unnecessary reads on SSTables. This probabilistic data structure can tell an entry absence from a SSTable, as well as the presence of an entry with an associated probability. If a lot of false positives are noticed, the size of the bloom filters can be increased.
As explained in this post, chunk size used upon compression allows a tradeoff between reads and writes. A smaller size will mean decreasing compression, thus it increases data being stored on disk, but allow lower chunks to be read to access data, and will favor reads. A bigger size will mean better compression, thus writing less, but it might imply reading bigger chunks.
Cassandra enables a key cache and a row cache. Key cache enables to skip reading the partition index upon reads, thus performing 1 read to the disk instead of 2. Enabling this cache is globally advised. Row cache stores the entire row in memory. It can be seen as an optimization, but it might actually use memory no longer available for instance for file system cache. We recommend turning it off on modern SSD hardware.
A review of your usage can be conducted using
nodetool
utility. For example nodetool tablestats {keyspace}
allows reviewing
the number of SSTables, the read/write ratios, bloom filter efficiency.
nodetool tablehistograms {keyspace}.{table}
might give insight about
read/write performance.
Table level options can be changed using ALTER TABLE for example with the cqlsh utility. A full compaction might be needed in order for the changes to be taken into account.
Updating Cassandra schema version
A schema version indicates you which schema your James server is relying on. The schema version number tracks if a migration is required. For instance, when the latest schema version is 2, and the current schema version is 1, you might think that you still have data in the deprecated Message table in the database. Hence, you need to migrate these messages into the MessageV2 table. Once done, you can safely bump the current schema version to 2.
Relying on outdated schema version prevents you to benefit from the newest performance and safety improvements. Otherwise, there’s something very unexpected in the way we manage cassandra schema: we create new tables without asking the admin about it. That means your James version is always using the last tables but may also take into account the old ones if the migration is not done yet.
How to detect when we should update Cassandra schema version
When you see in James logs
org.apache.james.modules.mailbox.CassandraSchemaVersionStartUpCheck
showing a warning like Recommended version is versionX
, you should
perform an update of the Cassandra schema version.
Also, we keep track of changes needed when upgrading to a newer version. You can read this upgrade instructions.
How to update Cassandra schema version
These schema updates can be triggered by webadmin using the Cassandra backend. Following steps are for updating Cassandra schema version:
-
At the very first step, you need to retrieve current Cassandra schema version
-
And then, you retrieve latest available Cassandra schema version to make sure there is a latest available version
-
Eventually, you can update the current schema version to the one you got with upgrading to the latest version
Otherwise, if you need to run the migrations to a specific version, you can use Upgrading to a specific version