123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259 |
- [#backup]
- # Backing up Neo4j Containers
- [NOTE]
- **This approach supports Google Cloud, AWS, and Azure storage, and assumes you have credentials and wish
- to store your backups on those cloud storage systems**. If this is not the case, you will need to adjust the backup
- script for your desired cloud storage method, but the approach will work for any backup location.
- [NOTE]
- **This approach works only for Neo4j 4.0+**. The backup tool and the
- DBMS itself changed quite a lot between 3.5 and 4.0, and the approach
- here will likely not work for older databases without substantial
- modification.
- [NOTE]
- If you are upgrading to the helm chart 4.1.3-1 or later from an earlier version, double check
- this documentation; the syntax of using the backup chart has changed a bit to accomodate multiple
- clouds. This documentation applies only to 4.1.3-1 and forward.
- ## Background & Important Information
- ### Required Neo4j Config
- This is provided for you out of the box by the helm chart, but if you
- customize you should bear these requirements in mind:
- * `dbms.backup.enabled=true`
- * `dbms.backup.listen_address=0.0.0.0:6362`
- The default for Neo4j is to listen only on 127.0.0.1, which will not
- work as other containers would not be able to access the backup port.
- ### Backup Storage
- Backups are stored in a temporary local volume before they are uploaded to cloud. By default an ephemeral Kubernetes `emptyDir` Volume is used.
- If the database being backed up is large there might not be sufficient space in local storage. To use alternative storage set `tempVolume` in values.yaml to a different https://kubernetes.io/docs/concepts/storage/volumes[Kubernetes Volume] object.
- ### Backup Pointers
- All backups will turn into .tar.gz files with date strings when they were taken, such as: `neo4j-2020-06-16-12:32:57.tar.gz`. They are named after the database
- they are a backup of.
- When you take a backup, you will get both the dated version, and a "latest" copy,
- e.g. the above file will also be copied to `neo4j/neo4j-latest.tar.gz` in the same bucket.
- [NOTE]
- **Reminder: Each time you take a backup, the latest file will be overwritten**.
- The purpose of doing this is to have a stable name in storage where the latest
- backup can always be found, without losing any of the previous backups.
- ### Neo4j Backs Up Databases, Not the DBMS
- In Neo4j 4.0, the system can be multidatabase; most systems have at least 2 DBs,
- "system" and "neo4j". *These need to be backed up and restored individually*.
- ## Steps to Take a Backup
- ### Use a Service Account to access cloud storage (Google Cloud only)
- **GCP**
- > Workload Identity is the recommended way to access Google Cloud services from applications running within GKE due to its improved security properties and manageability.
- Follow the https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity[GCP instructions] to:
- - Enable Workload Identity on your GKE cluster
- - Create a Google Cloud IAMServiceAccount that has read and write permissions for your backup location
- - Bind the IAMServiceAccount to the Neo4j deployment's Kubernetes ServiceAccount*
- [*] you can configure the name of the Kubernetes ServiceAccount that a Neo4j deployment uses by setting `serviceAccountName` in values.yaml. To check the name of the Kubernetes ServiceAccount that a Neo4j deployment is using run `kubectl get pods -o=jsonpath='{.spec.serviceAccountName}{"\n"}' <your neo4j pod name>`
- If you are unable to use Workload Identity with GKE then you can create a service key secret instead as described in the next section.
- ### Create a service key secret to access cloud storage
- First you want to create a kubernetes secret that contains the content of your account service key. This key must have permissions to access the bucket and backup set that you're trying to restore.
- #### Azure
- - You must create the credential file and this file should look like this:
- ```azure-credentials.sh
- export ACCOUNT_NAME=<NAME_STORAGE_ACCOUNT>
- export ACCOUNT_KEY=<STORAGE_ACCOUNT_KEY>
- ```
- If you're unsure what the account key secret should be, you can recover it with the following command:
- ```
- ACCOUNT_KEY=$(az storage account keys list --resource-group "$AKS_RESOURCE_GROUP" --account-name "$STORAGE_ACCOUNT" --query [0].value -o tsv)
- ```
- - You have to create a secret for this file
- ```shell
- kubectl create secret generic neo4j-azure-credentials \
- --from-file=credentials=azure-credentials.sh
- ```
- #### AWS
- - You must create the credential file and this file should look like this:
- ```aws-credentials
- [default]
- region=
- aws_access_key_id=
- aws_secret_access_key=
- ```
- - You have to create a secret for this file
- ```shell
- kubectl create secret generic neo4j-aws-credentials \
- --from-file=credentials=aws-credentials
- ```
- #### GCP
- You do NOT need to follow the steps in this section if you are using Workload Identity for GCP.
- Download a JSON formatted service key from Google Cloud that looks like this:
- ```gcp-credentials.json
- {
- "type": "",
- "project_id": "",
- "private_key_id": "",
- "private_key": "",
- "client_email": "",
- "client_id": "",
- "auth_uri": "",
- "token_uri": "",
- "auth_provider_x509_cert_url": "",
- "client_x509_cert_url": ""
- }
- ```
- - You have to create a secret for this file
- ```shell
- kubectl create secret generic neo4j-gcp-credentials \
- --from-file=credentials=gcp-credentials.json
- ```
- [NOTE]
- **'--from-file=credentials=<your-config-path>' here is important**; the credentials under the secret must be named `credentials`
- ### Running a Backup
- The backup method is itself a mini-helm chart, and so to run a backup, you just
- do this as a minimal required example:
- [NOTE]
- **This command must be run in 'https://github.com/neo4j-contrib/neo4j-helm/tree/master/tools/backup'**
- **AWS**
- ```shell
- helm install my-neo4j-backup . \
- --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
- --set bucket=s3://my-bucket \
- --set database="neo4j\,system" \
- --set cloudProvider=aws \
- --set secretName=neo4j-aws-credentials \
- --set jobSchedule="0 */12 * * *"
- ```
- **GCP with Workload Identity**
- ```shell
- helm install my-neo4j-backup . \
- --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
- --set bucket=gs://my-bucket \
- --set database="neo4j\,system" \
- --set cloudProvider=gcp \
- --set secretName=NULL \
- --set serviceAccountName=my-neo4j-backup-sa \
- --set jobSchedule="0 */12 * * *"
- ```
- **GCP with service key secret**
- ```shell
- helm install my-neo4j-backup . \
- --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
- --set bucket=gs://my-bucket \
- --set database="neo4j\,system" \
- --set cloudProvider=gcp \
- --set secretName=neo4j-gcp-credentials \
- --set jobSchedule="0 */12 * * *"
- ```
- **Azure**
- ```shell
- helm install my-neo4j-backup . \
- --set neo4jaddr=my-neo4j.default.svc.cluster.local:6362 \
- --set bucket=my-blob-container-name \
- --set database="neo4j\,system" \
- --set cloudProvider=azure \
- --set secretName=neo4j-azure-credentials \
- --set jobSchedule="0 */12 * * *"
- ```
- [NOTE]
- **Special notes for Azure Storage**. The chart requires a "bucket" but for Azure storage,
- the naming is slightly different; the bucket specified is the "blob container name"
- where the files will be placed. Relative paths will be respected; if you set bucket
- to be `container/path/to/directory`, then you will find your backup files stored in
- `container` at the path `/path/to/directory/db/db-latest.tar.gz` where "db" is the
- name of the database being backed up (i.e. neo4j and system).
- If all goes well, after a period of time when the Kubernetes Job is complete, you
- will simply see the backup files appear in the designated bucket, under directories named
- after the databases you backed up.
- [NOTE]
- **If your backup does not appear, consult the job's pod container logs to find out
- why**
- **If you want to get a hot backup before schedule, you can use this command:**
- ```shell
- kubectl create job --from=cronjob/my-neo4j-backup-job neo4j-hot-backup
- ```
- **Required parameters**
- * `neo4jaddr` pointing to an address where your cluster is running, ideally the
- discovery address.
- * `bucket` where you want the backup copied to. It should be `gs://bucketname` or `s3://bucketname`.
- * `databases` a comma separated list of databases to back up. The default is
- `neo4j,system`. If your DBMS has many individual databases, you should change this.
- * `cloudProvider` Which cloud service do you want to keep backups on?(gcp or aws)
- * `jobSchedule` what intervals do you want to take backup? It should be cron like "0 */12 * * *". You can set your own schedule(https://crontab.guru/#0_*/12_*_*_*)
- At least one of `secretName` and `serviceAccountName` must be set.
- * `secretName` the name of the secret you created (set NULL if using Workload Identity on GKE)
- * `serviceAccountName` the name of the Kubernetes ServiceAccount to use for the backup Job (required if using Workload Identity on GKE)
- **Optional environment variables**
- All of the following variables mimic the command line options
- for https://neo4j.com/docs/operations-manual/current/backup/performing/#backup-performing-command[neo4j-admin backup documented here]
- * `pageCache`
- * `heapSize`
- * `fallbackToFull` (true/false), default=true
- * `checkConsistency` (true/false), default=true
- * `checkIndexes` (true/false) default=true
- * `checkGraph` (true/false), default=true
- * `checkLabelScanStore` (true/false), default=true
- * `checkPropertyOwners` (true/false), default=false
- ### Exit Conditions
- If the backup of any of the individual databases mentioned in the database parameters
- fails, the entire container will exit with a non-zero exit code and fail.
- **Note**: it is possible for Neo4j backups to succeed, but with failed consistency checks.
- This will be noted in the logs, but will operationally behave as a successful backup.
|