Checkpointing

XTDB nodes can save checkpoints of their query indices on a regular basis, so that new nodes can start to service queries faster.

XTDB nodes that join a cluster have to obtain a local set of query indices before they can service queries. These can be built by replaying the transaction log from the beginning, although this may be slow for clusters with a lot of history. Checkpointing allows the nodes in a cluster to share checkpoints into a central 'checkpoint store', so that nodes joining a cluster can retrieve a recent checkpoint of the query indices, rather than replaying the whole history.

The checkpoint store is a pluggable module - there are a number of officially supported implementations:

XTDB nodes in a cluster don’t explicitly communicate regarding which one is responsible for creating a checkpoint - instead, they check at random intervals to see whether any other node has recently created a checkpoint, and create one if necessary. The desired frequency of checkpoints can be set using approx-frequency.

The default lifecycle of checkpoints is unbounded - XTDB will not attempt to clean up old checkpoints unless explicitly told to. See setting up a retention policy below for how to manage checkpoint lifecycles within the XTDB checkpointer module itself.

Current behaviour of the checkpointer is such that we only make a checkpoint if we have processed new transactions since the previous checkpoint. As such - you will need to be careful if you have any external lifecycle policies set on your checkpoints - if no new transactions are handled for a period, no new checkpoints will be made, and you may risk having no checkpoints and having to replay from the Transaction Log!

Setting up

You can enable checkpoints on your index-store by adding a :checkpointer dependency to the underlying KV store:

  • JSON

  • Clojure

  • EDN

{
  "xtdb/index-store": {
    "kv-store": {
      "xtdb/module": "xtdb.rocksdb/->kv-store",
      ...
      "checkpointer": {
        "xtdb/module": "xtdb.checkpoint/->checkpointer",
        "store": {
          "xtdb/module": "xtdb.checkpoint/->filesystem-checkpoint-store",
          "path": "/path/to/cp-store"
        },
        "approx-frequency": "PT6H"
      }
    }
  },
  ...
}
{:xtdb/index-store {:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
                               ...
                               :checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
                                              :store {:xtdb/module 'xtdb.checkpoint/->filesystem-checkpoint-store
                                                      :path "/path/to/cp-store"}
                                              :approx-frequency (Duration/ofHours 6)}}}
 ...}
{:xtdb/index-store {:kv-store {:xtdb/module xtdb.rocksdb/->kv-store
                               ...
                               :checkpointer {:xtdb/module xtdb.checkpoint/->checkpointer
                                              :store {:xtdb/module xtdb.checkpoint/->filesystem-checkpoint-store
                                                      :path "/path/to/cp-store"}
                                              :approx-frequency "PT6H"}}}
 ...}

When using a checkpoint store implementation that requires authentication (such as S3), ensure that you have the following permissions available such that the checkpointer can perform all of it’s operations:

  • Read objects from the system

  • Write objects to the system

  • List objects from the system

  • Delete objects from the system

Checkpointer parameters

  • approx-frequency (required, Duration): approximate frequency for the cluster to save checkpoints (if the conditions to make a checkpoint have been met - ie, a new transaction has been processed)

  • store: (required, CheckpointStore): see the individual store for more details.

  • checkpoint-dir (string/File/Path): temporary directory to store checkpoints in before they’re uploaded

  • keep-dir-between-checkpoints? (boolean, default true): whether to keep the temporary checkpoint directory between checkpoints

  • keep-dir-on-close? (boolean, default false): whether to keep the temporary checkpoint directory when the node shuts down

  • retention-policy (map, optional): the retention settings of checkpointers made by the checkpointer, must contain at least one of the following (though we can use both, see the section below for more info):

    • retain-newer-than (Duration): retain all checkpoints that are "newer than" the given duration

    • retain-at-least (integer): retain at least this many checkpoints (ie, the most recent retain-at-least checkpoints are safe from deletion)

FileSystem Checkpoint Store parameters

  • path (required, string/File/Path/URI): path to store checkpoints.

  • no-colons-in-filenames? (optional, boolean, defaults to false): whether or not to replace the colons in the checkpoint filenames, This is optional, though required to be true on Windows as it does not accept colons in filenames.

Note about using FileSystem Checkpoint Store

Most Cloud providers offer some high-performance block storage facility (e.g. AWS EBS or Google Cloud Storage). These volumes usually can be mounted on compute platforms (e.g. EC2) or containers (e.g. K8S). Often the Cloud provider also offers tooling to quickly snapshot those volumes (see EBS snapshots). A possible checkpoint strategy for XTDB nodes running in a cloud environment with such a facility, would consist in:

  • having [:checkpointer :store :path] pointing to a filesystem mounted on such a volume

  • having the joining XTDB node [:checkpointer :store :path] point to a filesystem mounted an a snapshot from the main node (above).

Using retention-policy on the checkpointers

When configuring the checkpointer module, we can provide a parameter, retention-policy, to have XTDB handle the deletion & retention behaviour of old checkpoints. The config for this will look like the following:

  • JSON

  • Clojure

  • EDN

{
  ...
  "checkpointer": {
    "xtdb/module": "xtdb.checkpoint/->checkpointer",
    "store": { ... },
    "approx-frequency": "PT6H",
    "retention-policy": {
      "retain-newer-than": "PT7D",
      "retain-at-least": 5
    }
  }
  ...
}
{
  ...
  :checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
                 :store {...}
                 :approx-frequency (Duration/ofHours 6)
                 :retention-policy {:retain-newer-than (Duration/ofDays 7)
                                    :retain-at-least 5}}
  ...
}
{
  ...
  :checkpointer {:xtdb/module xtdb.checkpoint/->checkpointer
                 :store {...}
                 :approx-frequency "PT6H"
                 :retention-policy {:retain-newer-than "PT7D"
                                    :retain-at-least 5}}
  ...
}

When passing in retention-policy, we need at least one of retain-newer-than and retain-at-least - though you can provide both. What follows is the behaviour of the checkpointer at the point after we’ve completed making a new checkpoint:

  • If only retain-at-least is provided, we will take the list of available checkpoints, keep the latest retain-at-least checkpoints, and delete the rest.

  • If only retain-newer-than is provided, we will keep all checkpoints newer than the configured Duration and delete all checkpoints older than the configured Duration.

  • If both retain-at-least and retain-newer-than are provided:

    • We start by splitting the list of checkpoints into "safe (from deletion)" and "unsafe (from deletion)" checkpoints - the latest retain-at-least values will be considered "safe", and will always be kept.

    • The "unsafe" list is then mapped over - from this, we will keep all checkpoints newer than retain-newer-than and delete all of the remaining checkpoints that are older.

Permission wise - ensure that you have the ability to delete objects from the checkpoint store you are using, or the above will not be able to clear up your checkpoints.