Scheduled backups in EKS

5 min readApr 4, 2023

Automated EBS snapshots are easy to manage in Kubernetes. But none of the required components are installed by default in EKS.

You’re going to need all of these things.

VolumeSnapshot CRDs
Snapshot-controller
EBS CSI driver
A default StorageClass that uses EBS CSI
A default VolumeSnapshotClass that uses EBS CSI
Snapscheduler

CRDs

You can get the latest CRDs here.

The 3 you need are:

VolumeSnapshots
VolumeSnapshotContents
VolumeSnapshotClasses

You can just apply them with kubectl, but I like to drop them into my Flux repo under an infrastructure subdirectory alongside snapshot-controller.

Once installed you should be able to see them like this…

➜  ~ kubectl get crds | grep volume
volumesnapshotclasses.snapshot.storage.k8s.io           2023-04-04T18:55:30Z
volumesnapshotcontents.snapshot.storage.k8s.io          2023-04-04T18:55:30Z
volumesnapshots.snapshot.storage.k8s.io                 2023-04-04T18:55:30Z
➜  ~

snapshot-controller

You can get the latest manifests here.

I basically just drop these into Flux too and let them install to the kube-system namespace. The only other time I interact with them is to check the snapshot-controller logs for any issues creating EBS snapshots.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: snapshot-controller
  namespace: kube-system

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: snapshot-controller-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["list", "watch", "create", "update", "patch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotcontents"]
    verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshotcontents/status"]
    verbs: ["patch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshots"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["snapshot.storage.k8s.io"]
    resources: ["volumesnapshots/status"]
    verbs: ["update", "patch"]
  # Enable this RBAC rule only when using distributed snapshotting, i.e. when the enable-distributed-snapshotting flag is set to true
  # - apiGroups: [""]
  #   resources: ["nodes"]
  #   verbs: ["get", "list", "watch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: snapshot-controller-role
subjects:
  - kind: ServiceAccount
    name: snapshot-controller
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: snapshot-controller-runner
  apiGroup: rbac.authorization.k8s.io

---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: snapshot-controller-leaderelection
  namespace: kube-system
rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "watch", "list", "delete", "update", "create"]

---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: snapshot-controller-leaderelection
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: snapshot-controller
roleRef:
  kind: Role
  name: snapshot-controller-leaderelection
  apiGroup: rbac.authorization.k8s.io

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: snapshot-controller
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app: snapshot-controller
  # the snapshot controller won't be marked as ready if the v1 CRDs are unavailable
  # in #504 the snapshot-controller will exit after around 7.5 seconds if it
  # can't find the v1 CRDs so this value should be greater than that
  minReadySeconds: 15
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: snapshot-controller
    spec:
      serviceAccountName: snapshot-controller
      containers:
        - name: snapshot-controller
          image: registry.k8s.io/sig-storage/snapshot-controller:v6.2.1
          args:
            - "--v=5"
            - "--leader-election=true"
            # Add a marker to the snapshot-controller manifests. This is needed to enable feature gates in CSI prow jobs.
            # For example, in https://github.com/kubernetes-csi/csi-release-tools/pull/209, the snapshot-controller YAML is updated to add --prevent-volume-mode-conversion=true so that the feature can be enabled for certain e2e tests.
            # end snapshot controller args
          imagePullPolicy: IfNotPresent

EBS CSI Driver

This has to be installed to make use of VolumeSnapshots. The In-tree EBS driver will continue working for old volumes. But only PVCs which use the CSI driver can be snapshotted.

I like to do this in Terraform via EKS add-on. This way it is installed when I terraform my clusters using the community module.

resource "aws_eks_addon" "aws_ebs_csi_driver" {
  cluster_name = var.cluster_name
  addon_name   = "aws-ebs-csi-driver"
}

You also need the Amazon managed IAM policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy. Amazon has a great tutorial for using this via IRSA, but I prefer to keep it simple and add the EBS CSI policy to the IAM roles I pass to my node groups.

resource "aws_iam_role" "nodes" {
  name                  = "${var.cluster_name}-eks-nodes"
  force_detach_policies = true
  assume_role_policy = jsonencode(
    {
      Version = "2012-10-17"
      Statement = [
        {
          Sid    = "EKSWorkerAssumeRole"
          Effect = "Allow"
          Principal = {
            Service = "ec2.amazonaws.com"
          }
          Action = "sts:AssumeRole"
        }
      ]
    }
  )
  managed_policy_arns = [
    "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
    "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
    "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
    "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy",
  ]
}

Default StorageClass

You still can’t take snapshots unless your PVCs are using the CSI driver. So let’s replace the default In-tree EBS gp2 storage class with an EBS CSI gp3 storage class.

First remove the default annotation from the existing storage class.

➜ kubectl get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  242d
➜ kubectl annotate sc gp2 storageclass.kubernetes.io/is-default-class="false" --overwrite
storageclass.storage.k8s.io/gp2 annotated
➜ kubectl get sc
NAME   PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2    kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  242d
➜

Then apply the new storage class or drop it into Flux.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
parameters:
  type: gp3
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

When you’re done your storage classes should look like this…

➜ kubectl get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2             kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  242d
gp3 (default)   ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   13s
➜

Keep in mind you’ll still have to convert any old PVCs based on the old storage class to the new CSI-based storage class, but thats beyond the scope of this article. Amazon has a guide here.

Default VolumeSnapshotClass

By setting a default VolumeSnapshotClass you set the type for all new VolumeSnapshots

Just apply this or drop it into Flux.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-csi-aws
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete

Snapscheduler

Snapscheduler is a great way to define cron schedules for creating VolumeSnapshots. From there snapshot-controller can take over and create the actual EBS snaps.

First I install a helm release of snapscheduler in its own namespace under the infrastructure folders of my Flux repo.

---
apiVersion: v1
kind: Namespace
metadata:
  name: snapscheduler
---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: snapscheduler
  namespace: snapscheduler
spec:
  interval: 30m
  url: https://backube.github.io/helm-charts/
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: snapscheduler
  namespace: snapscheduler
spec:
  releaseName: snapscheduler
  targetNamespace: snapscheduler
  interval: 10m
  chart:
    spec:
      chart: snapscheduler
      version: 3.2.0
      sourceRef:
        kind: HelmRepository
        name: snapscheduler
        namespace: snapscheduler

Once snapscheduler is installed I can define schedules in various application namespaces. In this example I’ll show you what I did for daily backups of Grafana. I just drop this into my grafana folder in the apps section of my Flux repo as backups.yaml.

apiVersion: snapscheduler.backube/v1
kind: SnapshotSchedule
metadata:
  name: daily
  namespace: grafana
spec:
  retention:
    maxCount: 30
  schedule: "0 0 * * *"

And you’re done. It might seem like a lot. But once you start building all of your clusters with these components preinstalled by Terraform and Flux — adding a scheduled backup for a new app becomes as simple as dropping in backups.yaml with a cron schedule.

Two days later…

➜ kubectl get VolumeSnapshots -A
NAMESPACE   NAME                         READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT   CREATIONTIME   AGE
grafana     grafana-daily-202304040000                grafana                                                                                            20h
grafana     grafana-daily-202304050000                grafana                                                                                            15h