A lightweight Kubernetes chaos engineering operator with transparent target selection and optional manual approval.
Omen lets you declaratively define chaos experiments against your workloads. Each run:
- Selects a fixed set of target pods (preview)
- Optionally waits for manual approval
- Executes the chaos action against those exact targets
- Records per-target results and a summary
Two CRDs are provided:
- Experiment β defines the schedule, target selector, action, safety limits, and approval policy
- ExperimentRun β a single execution instance created by the controller, holding the target preview, approval state, and results
Curious about what's coming next? Check out our Roadmap to see our plans for advanced target filtering, ChatOps integrations, and more!
Version 0.3.0 introduces a major architectural shift to an opt-in sidecar model for network chaos, replacing the old ephemeral containers approach.
- Namespace Opt-in Required: You must explicitly label target namespaces with
chaos.kreicer.dev/enabled=true. - Sidecar Injection: A mutating webhook now automatically injects the
omen-agentsidecar into pods in enabled namespaces. - Removed Flags: The
ProtectedNamespacesCLI flag and Helm value have been completely removed in favor of the new opt-in label system.
helm install omen oci://ghcr.io/k-krew/charts/omen \
--namespace omen-system \
--create-namespace \
--version <version>To customise the installation:
helm install omen oci://ghcr.io/k-krew/charts/omen \
--namespace omen-system \
--create-namespace \
--version <version> \
--set manager.leaderElect=true \
--set resources.limits.memory=256Mi \
--set manager.agentImage="ghcr.io/k-krew/omen-agent:<version>" \
--set manager.agentPort=9999| Flag | Default | Description |
|---|---|---|
--webhook-timeout |
10s |
Timeout for outgoing approval webhook HTTP requests. |
--leader-elect |
false |
Enable leader election for HA deployments. |
--metrics-bind-address |
0 |
Address for the metrics endpoint (0 disables it). |
--health-probe-bind-address |
:8081 |
Address for liveness/readiness probes. |
--agent-image |
ghcr.io/k-krew/omen-agent:v0.3.1 |
Container image injected as the omen-agent sidecar into target pods. |
--agent-port |
9999 |
Port the agent sidecar listens on. Change if it conflicts with application ports. |
Ready-to-apply YAML manifests live in the examples/ directory:
| File | Description |
|---|---|
delete-pod-once.yaml |
One-shot pod deletion, fixed count |
delete-pod-percent.yaml |
One-shot pod deletion, percentage-based |
delete-pod-repeat-approval.yaml |
Recurring deletion with manual approval and webhook notification |
network-fault-latency.yaml |
Inject 100ms latency + 10ms jitter for 5 minutes |
network-fault-packet-loss.yaml |
Drop 30% of packets for 3 minutes |
network-fault-blackhole.yaml |
Complete network blackhole (100% packet loss) with approval gate |
To approve a pending run:
kubectl patch experimentrun <run-name> \
--type=merge \
-p '{"spec":{"approved":true}}'Deletes the selected pods. Supports force: true for immediate deletion (grace period 0).
Injects network chaos into target pods using Linux Traffic Control (tc netem). The controller sends HTTP requests to the omen-agent sidecar running inside each target pod, which applies and removes the fault. The fault is automatically rolled back after the configured duration.
Prerequisite: The target namespace must be labeled chaos.kreicer.dev/enabled=true so that the sidecar is injected (see Architecture below).
Parameters (spec.action.networkFault):
| Field | Type | Description |
|---|---|---|
latency |
duration | Fixed delay added to outgoing packets (e.g., 100ms). |
jitter |
duration | Random variation on top of latency (e.g., 10ms). Requires latency. |
packetLoss |
integer (1-100) | Percentage of packets to drop. Set to 100 for a full blackhole. |
duration |
duration | How long to hold the fault before automatic rollback. Defaults to 5m. |
At least one of latency or packetLoss must be set.
Omen uses an opt-in sidecar model. Chaos is only allowed in namespaces explicitly labeled with chaos.kreicer.dev/enabled=true. A Mutating Webhook automatically injects the omen-agent sidecar into all new pods in these namespaces.
kubectl label namespace <target-ns> chaos.kreicer.dev/enabled=true
The controller then:
- Selects targets only from pods that live in labeled namespaces.
- For
delete_pod: deletes the pod via the Kubernetes API. - For
network_fault: sends an HTTPPOST /network-faultto the agent sidecar inside the pod to applytcrules, thenDELETE /network-faultafter the duration to roll back.
The omen-agent sidecar requires the NET_ADMIN Linux capability to run tc commands. This means namespaces used for network chaos must allow it via Pod Security Admission:
kubectl label namespace <target-ns> \
pod-security.kubernetes.io/enforce=baselineCommunication between the controller and agents is authenticated with a shared token (generated by Helm and stored in a Kubernetes Secret). The token is automatically injected into each agent sidecar as OMEN_SECRET_TOKEN by the mutating webhook.
A NetworkPolicy is shipped with the Helm chart that restricts ingress to agent sidecars so only the controller pod can reach them.
On startup, the controller performs a TCP connectivity check to the agent image registry. If the registry is unreachable (e.g., in an air-gapped cluster without proper registry credentials), sidecar injection is disabled automatically so that user pods are never blocked by ImagePullBackOff. A warning is logged:
WARNING: agent image registry is not reachable β sidecar injection will be disabled
Individual pods can be excluded from all chaos experiments by adding the annotation chaos.kreicer.dev/ignore: "true". Annotated pods are neither injected with the agent sidecar nor selected as targets.
kubectl annotate pod <pod-name> chaos.kreicer.dev/ignore=trueOr in the pod template:
metadata:
annotations:
chaos.kreicer.dev/ignore: "true"Experiment-level protection is also available via spec.safety.denyNamespaces:
spec:
safety:
denyNamespaces:
- my-critical-namespaceEvery phase transition of an ExperimentRun emits a standard Kubernetes Event on the object:
kubectl describe experimentrun <run-name>Events use Normal type for successful transitions (PreviewGenerated, Approved, Running, Completed) and Warning for failure states (Failed, Expired).
The TOTAL column in kubectl get expruns is populated as soon as targets are selected during the PreviewGenerated phase, so you can see how many pods will be affected before the run executes.
Experiment objects carry a finalizer (chaos.omen.com/finalizer). When an Experiment is deleted, the controller first deletes all owned ExperimentRuns and waits for them to be removed before releasing the finalizer.
ExperimentRuns executing a network_fault action carry an additional finalizer (chaos.omen.com/network-fault). Before the run object is removed, the controller sends DELETE /network-fault to the agent in all targets where the fault was still active, ensuring the network is restored even if the experiment is aborted mid-flight.
Set dryRun: true on the Experiment to preview target selection without executing any action. For delete_pod, no pods are deleted. For network_fault, no HTTP requests are sent to the agent. Results are recorded as Success in both cases.
- Go 1.26+
kubebuilderv4kubectlpointing at a local cluster
# Install CRDs
GOTOOLCHAIN=local make install
# Run the controller locally (uses ~/.kube/config)
GOTOOLCHAIN=local make runThe controller reads POD_NAMESPACE to exclude its own pods from target selection. Set it when running locally:
POD_NAMESPACE=omen-system GOTOOLCHAIN=local make run# Regenerate CRDs and RBAC after editing types
GOTOOLCHAIN=local make manifests generate
# Build the binary
GOTOOLCHAIN=local make build
# Run tests (requires setup-envtest)
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
export KUBEBUILDER_ASSETS=$(setup-envtest use --print path)
go test ./... -v