Seccomp
Why deny-list, not allowlist
Section titled “Why deny-list, not allowlist”This is the most common question about our seccomp design.
Allowlist (nsjail-style for simple binaries): enumerate every allowed syscall. Everything else is blocked. Most secure if you know exactly what your program needs.
Deny-list (our approach): block the specific syscalls that enable sandbox escape. Everything else is allowed.
We chose deny-list because:
- Python calls 100+ different syscalls during
import sys. An allowlist would be enormous and break on version upgrades. - Java spawns threads, uses futex, mmap, clone - the JVM’s syscall profile changes between minor versions.
- C++ standard library probes for features (io_uring, statx) at startup. Blocking probes kills innocent programs.
nsjail uses allowlists for simple, controlled workloads. They use deny-lists for complex runtimes. We only run complex runtimes.
Implementation
Section titled “Implementation”We use the seccompiler crate (from AWS Firecracker). It builds BPF programs from Rust data structures:
const BUILTIN_DENY_LIST: &[SyscallRule] = &[ SyscallRule { name: "io_uring_setup", action: Errno(ENOSYS) }, SyscallRule { name: "ptrace", action: KillProcess }, SyscallRule { name: "process_vm_readv", action: KillProcess }, // ... 15 more];Two actions:
Errno(ENOSYS)for probe syscalls (io_uring) - process gets “not supported” and continuesKillProcessfor exploit syscalls (ptrace, bpf, mount) - immediate termination
Two BPF programs are stacked (seccompiler doesn’t support mixed actions in one filter). Both must allow the syscall for it to proceed.
Custom policies
Section titled “Custom policies”judge execute-code --seccomp-policy /path/to/policy.json --language python --code '...'Custom policies replace the built-in deny-list entirely. You own the security posture.