Skip to content

Isolation Model

Each layer defends against a different class of attack. No single layer is sufficient. Together, they make sandbox escape require simultaneously defeating all eight.

8-layer isolation model
LayerKernel primitiveWhat it prevents
1. PID namespaceCLONE_NEWPIDSeeing or signalling host processes
2. Mount namespaceCLONE_NEWNS + chrootAccessing host filesystem
3. Network namespaceCLONE_NEWNETNetwork access (no sockets, no DNS)
4. Cgroupscgroup v2 (v1 fallback)Memory bombs, fork bombs, CPU hogging
5. Seccomp-BPFprctl + BPFDangerous syscalls (ptrace, mount, bpf)
6. Capabilitiescapset + bounding setPrivilege escalation
7. Credential dropsetresuid/setresgidRunning as root
8. NO_NEW_PRIVSprctl(PR_SET_NO_NEW_PRIVS)Regaining privileges via setuid binaries

Namespaces give the sandbox its own view of the world. The sandboxed process sees PID 1 as itself, an empty network, and a minimal filesystem via chroot.

The enforcer for resource limits. Cgroups are the only mechanism that can actually kill a process for using too much memory - rlimits can only limit virtual memory, not resident memory.

A BPF program loaded into the kernel that intercepts every syscall. io_uring gets special treatment - it returns ENOSYS instead of killing the process, because some standard libraries probe for it on startup.

After the sandbox environment is set up, we strip everything:

  1. Drop all capabilities from bounding, ambient, effective, permitted, and inheritable sets
  2. Drop to unprivileged UID/GID from the UID pool (range 60000-60999)
  3. Set NO_NEW_PRIVS - even setuid binaries won’t grant privileges

The order matters. You must drop capabilities before dropping UID. You must set NO_NEW_PRIVS last.