Checkpointing
XTDB nodes can save checkpoints of their query indices on a regular basis, so that new nodes can start to service queries faster.
XTDB nodes that join a cluster have to obtain a local set of query indices before they can service queries. These can be built by replaying the transaction log from the beginning, although this may be slow for clusters with a lot of history. Checkpointing allows the nodes in a cluster to share checkpoints into a central 'checkpoint store', so that nodes joining a cluster can retrieve a recent checkpoint of the query indices, rather than replaying the whole history.
The checkpoint store is a pluggable module - there are a number of officially supported implementations:
-
Java’s NIO FileSystem (below)
-
AWS’s S3
-
GCP’s Cloud Storage
-
Azure’s Blob Storage
XTDB nodes in a cluster don’t explicitly communicate regarding which one is responsible for creating a checkpoint - instead, they check at random intervals to see whether any other node has recently created a checkpoint, and create one if necessary.
The desired frequency of checkpoints can be set using approx-frequency
.
The default lifecycle of checkpoints is unbounded - XTDB will not attempt to clean up old checkpoints unless explicitly told to. See setting up a retention policy below for how to manage checkpoint lifecycles within the XTDB checkpointer module itself.
Current behaviour of the checkpointer is such that we only make a checkpoint if we have processed new transactions since the previous checkpoint. As such - you will need to be careful if you have any external lifecycle policies set on your checkpoints - if no new transactions are handled for a period, no new checkpoints will be made, and you may risk having no checkpoints and having to replay from the Transaction Log! |
Setting up
You can enable checkpoints on your index-store by adding a :checkpointer
dependency to the underlying KV store:
{
"xtdb/index-store": {
"kv-store": {
"xtdb/module": "xtdb.rocksdb/->kv-store",
...
"checkpointer": {
"xtdb/module": "xtdb.checkpoint/->checkpointer",
"store": {
"xtdb/module": "xtdb.checkpoint/->filesystem-checkpoint-store",
"path": "/path/to/cp-store"
},
"approx-frequency": "PT6H"
}
}
},
...
}
{:xtdb/index-store {:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
...
:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
:store {:xtdb/module 'xtdb.checkpoint/->filesystem-checkpoint-store
:path "/path/to/cp-store"}
:approx-frequency (Duration/ofHours 6)}}}
...}
{:xtdb/index-store {:kv-store {:xtdb/module xtdb.rocksdb/->kv-store
...
:checkpointer {:xtdb/module xtdb.checkpoint/->checkpointer
:store {:xtdb/module xtdb.checkpoint/->filesystem-checkpoint-store
:path "/path/to/cp-store"}
:approx-frequency "PT6H"}}}
...}
When using a checkpoint store implementation that requires authentication (such as S3), ensure that you have the following permissions available such that the checkpointer can perform all of it’s operations:
-
Read objects from the system
-
Write objects to the system
-
List objects from the system
-
Delete objects from the system
Checkpointer parameters
-
approx-frequency
(required,Duration
): approximate frequency for the cluster to save checkpoints (if the conditions to make a checkpoint have been met - ie, a new transaction has been processed) -
store
: (required,CheckpointStore
): see the individual store for more details. -
checkpoint-dir
(string/File
/Path
): temporary directory to store checkpoints in before they’re uploaded -
keep-dir-between-checkpoints?
(boolean, default true): whether to keep the temporary checkpoint directory between checkpoints -
keep-dir-on-close?
(boolean, default false): whether to keep the temporary checkpoint directory when the node shuts down -
retention-policy
(map, optional): the retention settings of checkpointers made by the checkpointer, must contain at least one of the following (though we can use both, see the section below for more info):-
retain-newer-than
(Duration
): retain all checkpoints that are "newer than" the given duration -
retain-at-least
(integer): retain at least this many checkpoints (ie, the most recentretain-at-least
checkpoints are safe from deletion)
-
FileSystem
Checkpoint Store parameters
-
path
(required, string/File
/Path
/URI
): path to store checkpoints. -
no-colons-in-filenames?
(optional, boolean, defaults tofalse
): whether or not to replace the colons in the checkpoint filenames, This is optional, though required to betrue
on Windows as it does not accept colons in filenames.
Note about using FileSystem
Checkpoint Store
Most Cloud providers offer some high-performance block storage facility (e.g. AWS EBS or Google Cloud Storage). These volumes usually can be mounted on compute platforms (e.g. EC2) or containers (e.g. K8S). Often the Cloud provider also offers tooling to quickly snapshot those volumes (see EBS snapshots). A possible checkpoint strategy for XTDB nodes running in a cloud environment with such a facility, would consist in:
-
having
[:checkpointer :store :path]
pointing to a filesystem mounted on such a volume -
having the joining XTDB node
[:checkpointer :store :path]
point to a filesystem mounted an a snapshot from the main node (above).
Using retention-policy
on the checkpointers
When configuring the checkpointer module, we can provide a parameter, retention-policy
, to have XTDB handle the deletion & retention behaviour of old checkpoints. The config for this will look like the following:
{
...
"checkpointer": {
"xtdb/module": "xtdb.checkpoint/->checkpointer",
"store": { ... },
"approx-frequency": "PT6H",
"retention-policy": {
"retain-newer-than": "PT7D",
"retain-at-least": 5
}
}
...
}
{
...
:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
:store {...}
:approx-frequency (Duration/ofHours 6)
:retention-policy {:retain-newer-than (Duration/ofDays 7)
:retain-at-least 5}}
...
}
{
...
:checkpointer {:xtdb/module xtdb.checkpoint/->checkpointer
:store {...}
:approx-frequency "PT6H"
:retention-policy {:retain-newer-than "PT7D"
:retain-at-least 5}}
...
}
When passing in retention-policy
, we need at least one of retain-newer-than
and retain-at-least
- though you can provide both. What follows is the behaviour of the checkpointer at the point after we’ve completed making a new checkpoint:
-
If only
retain-at-least
is provided, we will take the list of available checkpoints, keep the latestretain-at-least
checkpoints, and delete the rest. -
If only
retain-newer-than
is provided, we will keep all checkpoints newer than the configured Duration and delete all checkpoints older than the configured Duration. -
If both
retain-at-least
andretain-newer-than
are provided:-
We start by splitting the list of checkpoints into "safe (from deletion)" and "unsafe (from deletion)" checkpoints - the latest
retain-at-least
values will be considered "safe", and will always be kept. -
The "unsafe" list is then mapped over - from this, we will keep all checkpoints newer than
retain-newer-than
and delete all of the remaining checkpoints that are older.
-
Permission wise - ensure that you have the ability to delete objects from the checkpoint store you are using, or the above will not be able to clear up your checkpoints. |