Isolation Model

Each layer defends against a different class of attack. No single layer is sufficient. Together, they make sandbox escape require simultaneously defeating all eight.

Layer	Kernel primitive	What it prevents
1. PID namespace	`CLONE_NEWPID`	Seeing or signalling host processes
2. Mount namespace	`CLONE_NEWNS` + chroot	Accessing host filesystem
3. Network namespace	`CLONE_NEWNET`	Network access (no sockets, no DNS)
4. Cgroups	cgroup v2 (v1 fallback)	Memory bombs, fork bombs, CPU hogging
5. Seccomp-BPF	`prctl` + BPF	Dangerous syscalls (ptrace, mount, bpf)
6. Capabilities	`capset` + bounding set	Privilege escalation
7. Credential drop	`setresuid`/`setresgid`	Running as root
8. NO_NEW_PRIVS	`prctl(PR_SET_NO_NEW_PRIVS)`	Regaining privileges via setuid binaries

Namespaces (layers 1-3)

Namespaces give the sandbox its own view of the world. The sandboxed process sees PID 1 as itself, an empty network, and a minimal filesystem via chroot.

Cgroups (layer 4)

The enforcer for resource limits. Cgroups are the only mechanism that can actually kill a process for using too much memory - rlimits can only limit virtual memory, not resident memory.

Seccomp-BPF (layer 5)

A BPF program loaded into the kernel that intercepts every syscall. io_uring gets special treatment - it returns ENOSYS instead of killing the process, because some standard libraries probe for it on startup.

Privilege stripping (layers 6-8)

After the sandbox environment is set up, we strip everything:

Drop all capabilities from bounding, ambient, effective, permitted, and inheritable sets
Drop to unprivileged UID/GID from the UID pool (range 60000-60999)
Set NO_NEW_PRIVS - even setuid binaries won’t grant privileges

The order matters. You must drop capabilities before dropping UID. You must set NO_NEW_PRIVS last.