cback agents

A cback agent is a stateless process responsible for executing cback jobs and/or updating job definitions in the job database as different actions occur, cback as of writing has:backup, restore, prune, switch, verify and portal agents available.

Below is a basic, high level diagram showcasing the general layout of a cback system. It can be seen that agents clearly fall into two main classes, those that orchestrate and update both the cback jobs db and jobs found within it, and secondly those that run jobs to perform as a unit of work from the db: screenshot [note] it is clear from reading the cback jobs document page, that agents that run jobs for execution, clearly update certain fields in jobs they run on conclusion or failure. In the context of this diagram, the author is conveying that update agents are categorically responsible for allowing cback operators or users to perform significantly more complex update operations on the cback job db.

Backup

The backup agent is responsible for the execution of backup jobs within cback. This takes the form of picking backup jobs that have entered a pending status, executing them by copying the delta of files in a defined source location into a new snapshot in the repository and finally updating the job to Completed so that a switch agent may schedule the job for reissuing.

Restore

The restore agent is responsible for the execution of restore jobs within cback. This takes the form of picking restore jobs that have entered a pending status, executing them by copying the files defined in a specific snapshot (and those linked below it) in the repository to a destination location defined within the restore job. Typically Restore jobs and thus restore picking is one time, with the jobs being added manually by a cback operator, and not being switched back to pending by the switch agent.

Prune

The prune agent is responsible for the execution of prune jobs within cback, this takes the form of picking prune jobs that have entered a pending status, executing them by:

  • checking against a group configuration retention_policy and the age of snapshots in a repository, to determine which snapshots have exceeded their lifespan and must be deleted from the repository.

  • if graceful deletion is enabled for a tag provided by a cback portal SNAPSHOT DELETE http operation, the snapshot won't be immediately deleted, instead, it will be retained for a period specified by graceful_deletion_retention_period before being permanently removed.

[note] The default configuration behaviour is for graceful_deletion to be enabled in the prune agent at the group level, thus protecting snapshots in a cback system from accidental scheduled pruning. It must be explicitly disabled to use a retention policy.

Switch

The switch agent is responsible for the reseting of finished prune, verify and backup jobs so that they may be repicked by their respective agents, this is achieved by comparing the run_completed field of a given job to a expiration_time parameter set per job type. If a job has 'expired', e.g. run_completed + expiration_time > current_time, the switch agent will set the job back to pending from completed so that it may be repicked.

Verify

The verify agent is responsible for validating the integrity of a cback repository. It executes verify jobs by selecting repositories scheduled for verification and applying the following rules:

  • The repository size is compared against the group verify agent configuration parameter full_verify_threshold.

  • If the size is below this threshold, a full verification of the repository is performed.

  • If the size is above this threshold, a partial verification is performed instead.

For partial verification, the agent uses two group verify agent configuration parameters:

  • partial_verify_percentage — defines what fraction of the repository should be verified.
  • partial_verify_threshold — defines an upper limit on the portion size to prevent excessively large or resource-intensive verification runs. If the calculated portion size from partial_verify_percentage exceeds this limit, only partial_verify_threshold worth of data is verified.

The verification logic described above is illustrated in the following diagram:

screenshot

Portal

The cback portal is an optional agent, aimed at operators that wish to programmatically interact with cback. In effect, it provides a http/s REST interface webserver that can be used for the manipulation, creation and deletion of jobs / workloads in a cback system. The portal is self-documenting, and provides a fast api interface at http/s://<portal-node-url:<port>/docs# that can be viewed to see the methods available to you.

Enabling agents

In the same way that we have discussed how jobs have an enabled state, likewise agents also have this mechanism, which can be useful for temporarily disabling all agents of a class on a given worker node, the below shows this in practice:

# disable all backup agents on a worker node
$ cback backup agent --disable "example reason"
2024-05-17 11:44:17 INFO /usr/lib/python3.9/site-packages/cback/model/agents/agent.py:141 > agents disabled pid=1058614 agent=backup id=0 reason=example reason

# check to see the lock file exists
$ ls /etc/cback/locks
backup-0.disabled

# check that a running agent will not pick jobs by running one ephemerally
$ cback backup agent
2024-05-17 12:09:27 DEBUG /usr/lib/python3.9/site-packages/cback/model/agents/agent.py:109 > agent is disabled pid=1072942 agent=backup id=0 reason=example reason
2024-05-17 12:09:27 DEBUG /usr/lib/python3.9/site-packages/cback/model/agents/agent.py:112 > agent will sleep for 3 secs pid=1072942 agent=backup id=0

# reenable the backup agents
$ cback backup agent --enable
2024-05-17 11:44:28 INFO /usr/lib/python3.9/site-packages/cback/model/agents/agent.py:120 > agents enabled pid=1058626 agent=backup id=0

[warning❗] It is important to note that disabling a set of agents only stops them from picking new jobs, it will not terminate an existing job that an agent has already collected and is processing. You should consider this prior to performing any actions that might fail the job such as a systemd unit restart or stop, and wait for the agents to enter an idle state of operation.