Scheduled backups in EKS
Automated EBS snapshots are easy to manage in Kubernetes. But none of the required components are installed by default in EKS.
You’re going to need all of these things.
- VolumeSnapshot CRDs
- Snapshot-controller
- EBS CSI driver
- A default StorageClass that uses EBS CSI
- A default VolumeSnapshotClass that uses EBS CSI
- Snapscheduler
CRDs
You can get the latest CRDs here.
The 3 you need are:
- VolumeSnapshots
- VolumeSnapshotContents
- VolumeSnapshotClasses
You can just apply them with kubectl, but I like to drop them into my Flux repo under an infrastructure subdirectory alongside snapshot-controller.
Once installed you should be able to see them like this…
➜ ~ kubectl get crds | grep volume
volumesnapshotclasses.snapshot.storage.k8s.io 2023-04-04T18:55:30Z
volumesnapshotcontents.snapshot.storage.k8s.io 2023-04-04T18:55:30Z
volumesnapshots.snapshot.storage.k8s.io 2023-04-04T18:55:30Z
➜ ~
snapshot-controller
You can get the latest manifests here.
I basically just drop these into Flux too and let them install to the kube-system namespace. The only other time I interact with them is to check the snapshot-controller logs for any issues creating EBS snapshots.
apiVersion: v1
kind: ServiceAccount
metadata:
name: snapshot-controller
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: snapshot-controller-runner
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: [""]
resources: ["events"]
verbs: ["list", "watch", "create", "update", "patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents"]
verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents/status"]
verbs: ["patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots/status"]
verbs: ["update", "patch"]
# Enable this RBAC rule only when using distributed snapshotting, i.e. when the enable-distributed-snapshotting flag is set to true
# - apiGroups: [""]
# resources: ["nodes"]
# verbs: ["get", "list", "watch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: snapshot-controller-role
subjects:
- kind: ServiceAccount
name: snapshot-controller
namespace: kube-system
roleRef:
kind: ClusterRole
name: snapshot-controller-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: snapshot-controller-leaderelection
namespace: kube-system
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: snapshot-controller-leaderelection
namespace: kube-system
subjects:
- kind: ServiceAccount
name: snapshot-controller
roleRef:
kind: Role
name: snapshot-controller-leaderelection
apiGroup: rbac.authorization.k8s.io
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: snapshot-controller
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
app: snapshot-controller
# the snapshot controller won't be marked as ready if the v1 CRDs are unavailable
# in #504 the snapshot-controller will exit after around 7.5 seconds if it
# can't find the v1 CRDs so this value should be greater than that
minReadySeconds: 15
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: snapshot-controller
spec:
serviceAccountName: snapshot-controller
containers:
- name: snapshot-controller
image: registry.k8s.io/sig-storage/snapshot-controller:v6.2.1
args:
- "--v=5"
- "--leader-election=true"
# Add a marker to the snapshot-controller manifests. This is needed to enable feature gates in CSI prow jobs.
# For example, in https://github.com/kubernetes-csi/csi-release-tools/pull/209, the snapshot-controller YAML is updated to add --prevent-volume-mode-conversion=true so that the feature can be enabled for certain e2e tests.
# end snapshot controller args
imagePullPolicy: IfNotPresent
EBS CSI Driver
This has to be installed to make use of VolumeSnapshots. The In-tree EBS driver will continue working for old volumes. But only PVCs which use the CSI driver can be snapshotted.
I like to do this in Terraform via EKS add-on. This way it is installed when I terraform my clusters using the community module.
resource "aws_eks_addon" "aws_ebs_csi_driver" {
cluster_name = var.cluster_name
addon_name = "aws-ebs-csi-driver"
}
You also need the Amazon managed IAM policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy. Amazon has a great tutorial for using this via IRSA, but I prefer to keep it simple and add the EBS CSI policy to the IAM roles I pass to my node groups.
resource "aws_iam_role" "nodes" {
name = "${var.cluster_name}-eks-nodes"
force_detach_policies = true
assume_role_policy = jsonencode(
{
Version = "2012-10-17"
Statement = [
{
Sid = "EKSWorkerAssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
Action = "sts:AssumeRole"
}
]
}
)
managed_policy_arns = [
"arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
"arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy",
"arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
"arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy",
]
}
Default StorageClass
You still can’t take snapshots unless your PVCs are using the CSI driver. So let’s replace the default In-tree EBS gp2 storage class with an EBS CSI gp3 storage class.
First remove the default annotation from the existing storage class.
➜ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 242d
➜ kubectl annotate sc gp2 storageclass.kubernetes.io/is-default-class="false" --overwrite
storageclass.storage.k8s.io/gp2 annotated
➜ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 242d
➜
Then apply the new storage class or drop it into Flux.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
parameters:
type: gp3
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
When you’re done your storage classes should look like this…
➜ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 242d
gp3 (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 13s
➜
Keep in mind you’ll still have to convert any old PVCs based on the old storage class to the new CSI-based storage class, but thats beyond the scope of this article. Amazon has a guide here.
Default VolumeSnapshotClass
By setting a default VolumeSnapshotClass you set the type for all new VolumeSnapshots
Just apply this or drop it into Flux.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-csi-aws
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete
Snapscheduler
Snapscheduler is a great way to define cron schedules for creating VolumeSnapshots. From there snapshot-controller can take over and create the actual EBS snaps.
First I install a helm release of snapscheduler in its own namespace under the infrastructure folders of my Flux repo.
---
apiVersion: v1
kind: Namespace
metadata:
name: snapscheduler
---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
name: snapscheduler
namespace: snapscheduler
spec:
interval: 30m
url: https://backube.github.io/helm-charts/
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: snapscheduler
namespace: snapscheduler
spec:
releaseName: snapscheduler
targetNamespace: snapscheduler
interval: 10m
chart:
spec:
chart: snapscheduler
version: 3.2.0
sourceRef:
kind: HelmRepository
name: snapscheduler
namespace: snapscheduler
Once snapscheduler is installed I can define schedules in various application namespaces. In this example I’ll show you what I did for daily backups of Grafana. I just drop this into my grafana folder in the apps section of my Flux repo as backups.yaml.
apiVersion: snapscheduler.backube/v1
kind: SnapshotSchedule
metadata:
name: daily
namespace: grafana
spec:
retention:
maxCount: 30
schedule: "0 0 * * *"
And you’re done. It might seem like a lot. But once you start building all of your clusters with these components preinstalled by Terraform and Flux — adding a scheduled backup for a new app becomes as simple as dropping in backups.yaml with a cron schedule.
Two days later…
➜ kubectl get VolumeSnapshots -A
NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
grafana grafana-daily-202304040000 grafana 20h
grafana grafana-daily-202304050000 grafana 15h