Isolation Model
Each layer defends against a different class of attack. No single layer is sufficient. Together, they make sandbox escape require simultaneously defeating all eight.
| Layer | Kernel primitive | What it prevents |
|---|---|---|
| 1. PID namespace | CLONE_NEWPID | Seeing or signalling host processes |
| 2. Mount namespace | CLONE_NEWNS + chroot | Accessing host filesystem |
| 3. Network namespace | CLONE_NEWNET | Network access (no sockets, no DNS) |
| 4. Cgroups | cgroup v2 (v1 fallback) | Memory bombs, fork bombs, CPU hogging |
| 5. Seccomp-BPF | prctl + BPF | Dangerous syscalls (ptrace, mount, bpf) |
| 6. Capabilities | capset + bounding set | Privilege escalation |
| 7. Credential drop | setresuid/setresgid | Running as root |
| 8. NO_NEW_PRIVS | prctl(PR_SET_NO_NEW_PRIVS) | Regaining privileges via setuid binaries |
Namespaces (layers 1-3)
Section titled “Namespaces (layers 1-3)”Namespaces give the sandbox its own view of the world. The sandboxed process sees PID 1 as itself, an empty network, and a minimal filesystem via chroot.
Cgroups (layer 4)
Section titled “Cgroups (layer 4)”The enforcer for resource limits. Cgroups are the only mechanism that can actually kill a process for using too much memory - rlimits can only limit virtual memory, not resident memory.
Seccomp-BPF (layer 5)
Section titled “Seccomp-BPF (layer 5)”A BPF program loaded into the kernel that intercepts every syscall. io_uring gets special treatment - it returns ENOSYS instead of killing the process, because some standard libraries probe for it on startup.
Privilege stripping (layers 6-8)
Section titled “Privilege stripping (layers 6-8)”After the sandbox environment is set up, we strip everything:
- Drop all capabilities from bounding, ambient, effective, permitted, and inheritable sets
- Drop to unprivileged UID/GID from the UID pool (range 60000-60999)
- Set NO_NEW_PRIVS - even setuid binaries won’t grant privileges
The order matters. You must drop capabilities before dropping UID. You must set NO_NEW_PRIVS last.