Seccomp Filtering
Rustbox installs a seccomp-BPF filter on every sandbox that intercepts every syscall before it reaches the kernel. This is layer 5 of the 8-layer isolation model.
Deny-list, not allowlist
Section titled “Deny-list, not allowlist”We use a deny-list approach: block the specific syscalls that enable sandbox escape, allow everything else.
Why not an allowlist? Because the languages we support have complex runtime requirements:
- Python calls 100+ different syscalls during
import sys. An allowlist would break on version upgrades. - Java spawns threads, uses futex, mmap, clone - the JVM’s syscall profile changes between minor versions.
- C++ standard library probes for features (io_uring, statx) at startup. Blocking probes kills innocent programs.
A deny-list that targets the specific syscalls attackers need is more robust across language runtimes and version upgrades.
The 51-syscall deny-list
Section titled “The 51-syscall deny-list”| Family | Syscalls | Action | Why |
|---|---|---|---|
| io_uring | io_uring_setup, io_uring_enter, io_uring_register | ERRNO(ENOSYS) | Kernel LPE history (CVE-2021-41073, CVE-2023-2598) |
| Tracing | ptrace | KILL | Cross-process inspection |
| Process memory | process_vm_readv, process_vm_writev | ERRNO(EPERM) | Runtime crash handlers probe these at startup |
| Kernel subsystems | bpf, userfaultfd, perf_event_open | KILL | eBPF loading, page fault interception, perf abuse |
| Module loading | kexec_load, kexec_file_load, init_module, finit_module, delete_module | KILL | Kernel module/boot manipulation |
| Mount/swap | mount, umount2, pivot_root, swapon, swapoff | KILL | Filesystem manipulation |
| New mount API | fsopen, fsmount, fsconfig, fspick, move_mount, open_tree, mount_setattr | KILL | Linux 5.2+ mount manipulation |
| Namespace escape | unshare, chroot, setns | KILL | Nested namespace creation, chroot escape |
| DAC bypass | name_to_handle_at, open_by_handle_at | KILL | File handle manipulation (CVE-2014-0038) |
| System clock | reboot, settimeofday, clock_settime, acct | KILL | System state manipulation |
| Kernel keyring | add_key, keyctl, request_key | KILL | Not namespaced (CVE-2016-0728) |
| NUMA | mbind, set_mempolicy, move_pages | KILL | Memory policy manipulation |
| Execution domain | personality | ERRNO(EPERM) | Blocks READ_IMPLIES_EXEC |
Three response modes
Section titled “Three response modes”Not all blocked syscalls deserve the same response:
-
Errno(ENOSYS)for probe syscalls (io_uring). The process gets “not supported” and falls back to a safe alternative. This prevents unnecessary crashes when runtimes probe for kernel features. -
Errno(EPERM)for diagnostic syscalls (process_vm_readv/writev, personality). Runtime crash handlers degrade gracefully instead of receiving SIGSYS. -
KillProcessfor exploit-class syscalls (ptrace, bpf, mount). Immediate process termination. No second chances.
Implementation
Section titled “Implementation”The BPF filter is built using the seccompiler crate (from AWS Firecracker). Four BPF programs are stacked:
- ENOSYS filter (io_uring probes)
- EPERM filter (diagnostic syscalls)
- KILL filter (exploit syscalls)
- clone(NEWUSER) argument filter (blocks user namespace creation)
The kernel evaluates all four filters and applies the most restrictive result.
clone(NEWUSER) special case
Section titled “clone(NEWUSER) special case”User namespace creation (clone with CLONE_NEWUSER) gets a dedicated argument-level BPF filter rather than blocking clone entirely. This is because clone is essential for process creation and threading. Only the CLONE_NEWUSER flag combination is blocked.