Skip to content

Drift Detection

Drift is the gap between what you declared in source and what’s actually deployed. Most IaC tools track this through an authoritative state file — chant doesn’t. This page explains the model, the diff vocabulary, and when (and when not) to invest in continuous drift detection. For where this sits in the broader model — the dial from observe to reconcile to authoritative — see Lifecycle Models.

chant lifecycle snapshot <env> writes a record of what was observed in the cloud at a point in time. It’s stored on a chant/lifecycle orphan branch in your repo. There’s no central state server, no lock file, no encrypted backend.

This is the opposite of how Terraform, Pulumi, and CDK/CloudFormation think about state. Those tools treat state as authoritative — the state file is the truth, and apply reconciles the cloud to match. The state file must be locked during writes, secured against tampering, and protected from corruption because deployments depend on it.

chant’s snapshots are observational. They’re a forensic record: “this is what we saw at 10:42 UTC on Tuesday.” Snapshots don’t drive deployments — they exist to be diffed against. Three consequences fall out:

Authoritative state (Terraform, Pulumi, CDK)Observational state (chant)
State must be locked; concurrent writes corruptA stale snapshot is inconvenient, not dangerous
Sensitive data leaks into state fileSnapshots record metadata only — no secrets
Bad state breaks deploysBad snapshot is a record problem; deploys query live APIs
apply knows exactly what to changeNo automatic plan/apply — agents verify before acting

For the broader trade-off comparison, see State: authoritative vs. observational in the comparison guide.

Drift can mean two structurally different things, and chant tracks them through two different plugin contracts:

Resources are 1:1 cloud equivalents of declared chant entities — an AWS CloudFormation resource, a K8s Deployment, an ARM resource group, a Temporal namespace. Each declaration has a name, the cloud version has a name, and they’re correlated. Lexicons that fit this model implement describeResources() (entity-keyed: “look up the live state of these declared things”).

Artifacts are runtime concepts created by tooling outside chant’s entity model. A Helm release isn’t declared in chant — chant declares Chart.yaml + templates, and helm install later creates the release. Same story for Docker containers. Lexicons that fit this model implement listArtifacts() (context-keyed: “tell me what artifacts exist in this environment right now”).

The diff engine treats them differently because they are different. Resources have a “declared” axis to compare against; artifacts don’t.

For the implementer-side walkthrough, see Implementing Observation.

chant lifecycle diff <env> --live returns ten categories — six for resources, four for artifacts.

The diff engine compares three axes: what’s declared in source now, what was observed in the previous snapshot, and what the live API reports right now.

CategoryDeclared nowIn last snapshotObserved nowMeaning
missingDeclared but not in cloud — never deployed, manually deleted, or stack rolled back
orphanIn cloud but not declared — manual creation, untracked tooling, or imported-pending
disappearedWas there at last snapshot, gone now
newly observedObserved now, not in any prior snapshot
driftedPresent in both, but status, physicalId, or attributes.* changed
unchangedPresent in both, metadata identical

Drifted entries include attribute-level deltas so you can see what changed, not just that something changed.

An orphan is a resource in the cloud that source doesn’t know about. Detection is the first position on the dial; resolving it is the next. There are two moves:

  • Adopt it into source. Regenerate the resource as chant TypeScript with live import: chant import --from <env> --name <orphan>. The orphan stops being a surprise and starts being declared — the ReconcileOp workflow automates this as a PR.
  • Delete it. Only ever for a chant-owned orphan — one carrying the ownership marker. A foreign orphan (no marker) is never auto-deleted; it escalates to adopt-or-review. chant lifecycle plan classifies which is which, and ApplyOp deletes only the owned ones.

There’s no “declared” axis, so the engine just compares now-vs-then:

CategoryIn last snapshotObserved nowMeaning
artifacts addedNewly created in the cloud since last snapshot
artifacts removedExisted at last snapshot, gone now
artifacts changedPresent in both, metadata changed
artifacts unchangedPresent in both, metadata identical

Same lexicon may emit both — for instance a future K8s lexicon could report Deployments as resources and in-cluster Pods (created by the Deployment, not by chant) as artifacts. lifecycle diff --live shows them in separate sections.

Drift detection is most valuable in environments where the gap between source and reality has a real chance of opening:

It helps when:

  • Multiple humans have cluster/cloud access. Someone scaled an ASG by hand to ride out an incident, didn’t get back to update source — drift detection catches the gap on the next snapshot.
  • Coordination across teams. Platform team owns the VPC, app teams own services. A platform-side change (subnet CIDR, tag policy) shows up as drift in app-team snapshots before it surfaces as a deploy failure.
  • Long-running infrastructure between change windows. Anything declared once and expected to stay put — IAM roles, Cloud DNS zones, KMS keys, Temporal namespaces. The longer the gap between intentional changes, the higher the chance of out-of-band ones.
  • Audit and incident timelines. Snapshots in git give you a forensic record: “the bucket policy was permissive on Tuesday morning, restrictive by Wednesday afternoon.” Useful at compliance review time.

It doesn’t help when:

  • Every deploy is a full teardown + redeploy. If the env is rebuilt from scratch each release, there’s no surface for drift to accumulate on.
  • Single operator, ephemeral envs. A solo developer’s dev cluster that’s destroyed nightly doesn’t need a drift cron.
  • Stateless apps with no persistent infra. If the only thing chant manages is an ECS task definition that’s redeployed on every CI run, drift detection adds noise without signal.
  • Tools outside chant own the resource. If a third-party operator owns and reconciles a CRD, observed drift will be the operator’s normal behavior, not a problem.

The pragmatic test: would you act on a drift signal if it fired right now? If yes, snapshot+diff is worth the cost. If no, skip it for that env.

chant’s observational model is a deliberate choice, not a missing feature. The costs:

  • No automatic remediation. Drift tells you something changed; it doesn’t snap the cloud back. That’s an agent or a human decision because the right answer is domain-specific.
  • No automatic plan/apply. Without authoritative state, chant can’t compute a precise change set the way Terraform does. Diff output is informational; the apply step is your existing tooling.
  • No locking. Two operators snapshotting the same env at the same time race on the orphan branch. The push uses --force-with-lease so the second writer fails fast rather than silently overwriting (see Concurrent snapshots).
  • Coverage is per-lexicon. A lexicon without describeResources() / listArtifacts() is warn-skipped. The Runtime observation coverage matrix lists where coverage exists today.

Read these trade-offs in context in How chant compares.