AI-assisted documentation, GitHub, and the art of managing a complex Linux computing project
This is the story of "pepperpots" — a 14-core Intel Kubuntu workstation with a 16G Radeon-GPU running four BOINC projects simultaneously. 12 of those cores are currently grinding through ATLAS event simulations for CERN. The setup works beautifully now. Getting here was instructive.
There's a moment, maybe two weeks into running BOINC, when you open a terminal and realise your PC has been quietly simulating proton collisions for CERN while you slept. The task completed at 3am. Credits were awarded. Some fraction of a Higgs boson cross-section calculation now has your hostname in its provenance chain.
That moment is worth chasing.
BOINC — the Berkeley Open Infrastructure for Network Computing — is a volunteer distributed computing platform that lets you donate idle CPU and GPU cycles to real scientific research. Einstein@Home searches LIGO gravitational wave data for pulsars. MilkyWay@home maps dark matter distribution by simulating tidal streams of stars being ripped apart by the Milky Way's gravity. LHC@home runs actual Monte Carlo simulations of particle collisions at CERN, feeding data into the same physics pipeline as the full-time computing clusters.
This isn't screensaver science. This is the real pipeline.
But here's what the documentation doesn't tell you: LHC@home is not a simple application you install and forget. It's a system that quietly runs two entirely separate container runtimes — VirtualBox for CMS tasks, Docker for ATLAS simulations — each with its own permission model, each with its own failure modes, and almost none of it documented in one place. When something breaks (and it will break, usually silently, usually after a routine apt upgrade), you are largely on your own.
This article is about what happens when you stop treating your homelab like a hobby and start treating it like a software project. It's about using GitHub and an AI assistant not just to fix problems, but to remember them — so that when the same package update blows away your carefully crafted workaround six weeks later, the diagnosis takes ten minutes instead of three days.
It's also about the moment you run ps aux and realise one of your CPU cores has been pegged at 100% for days by a CERN physics simulation that escaped its container, re-parented itself to PID 1, and is now completely invisible to every tool you'd normally use to kill it.
That's a real thing that happens. We'll get to it.
The Setup — pepperpots and the Problem Stack
pepperpots is not a server. It sits on a desk, runs KDE Plasma, and occasionally has a browser open with seventeen tabs. It is also, simultaneously, a node in several of the most computationally demanding scientific projects on the planet.
The hardware is respectable but not exotic: 14-core Intel CPU, 16GB Radeon GPU, NVMe storage, running Kubuntu. The kind of machine a technically inclined person builds or buys for general use and then, one idle afternoon, decides to point at the universe's unsolved problems.
Four projects run simultaneously:
Einstein@Home searches LIGO and Arecibo data for previously undiscovered pulsars and gravitational wave signals. The GPU earns its keep here — OpenCL-accelerated semi-coherent searches sweep through frequency bands measuring fractions of a hertz, covering sky partitions numbered in the tens of thousands. A single work unit on pepperpots processes roughly 864 trillion floating point operations while searching a 0.5 Hz band around 2059 Hz for a neutron star that may or may not exist.
LHC@home runs Monte Carlo simulations of particle collisions at CERN. Your CPU becomes, in a meaningful sense, part of the same computing infrastructure that processes data from the Large Hadron Collider. CMS tasks simulate detector responses to collision events. ATLAS tasks run full GEANT4 simulations of particle interactions through the detector geometry — multi-gigabyte input files, hours of compute, gigabytes of output that get uploaded to CERN's grid.
MilkyWay@home fits N-body simulations of dwarf galaxy tidal streams to real sky observations. The Milky Way is slowly tearing apart its satellite galaxies, leaving stellar debris trails across the sky. The shape of those trails encodes the gravitational field they moved through — mostly dark matter. Each work unit tests one set of initial conditions out of tens of thousands required to constrain the dark matter distribution. The project runs out of Rensselaer Polytechnic Institute and now includes the gravitational influence of the Large Magellanic Cloud in its models.
Asteroids@home reconstructs 3D shape models of asteroids from photometric light curves. When an asteroid rotates, its brightness changes in ways that depend on its shape. Enough observations from enough angles, combined with enough CPU time, produces a model accurate enough to inform deflection mission planning.
Four projects. Four schedulers competing for resources. Two container runtimes. One AMD GPU shared between Einstein@Home and the desktop. And a BOINC client that, it turns out, has some deeply non-obvious opinions about preference hierarchies, process ownership, and what happens when Docker releases a minor version update.
The problems didn't all arrive at once. They accumulated, each one teaching something the documentation had omitted. What made the difference was deciding early on to treat the whole thing as a software project rather than a hobby setup — version controlled, documented, with a running record of what broke and why.
That decision paid dividends faster than expected.
Two Engines, One Project, Plentiful Opportunities for Failure
Most BOINC projects are conceptually simple: download a work unit, run an executable, upload results. LHC@home is not most BOINC projects.
Beneath the single project entry in your BOINC Manager, LHC@home runs two entirely separate execution runtimes, serving two different physics experiments, using two different container technologies, each with its own installation requirements, permission model, and failure modes. You don't choose between them — BOINC runs whichever tasks the LHC@home scheduler sends, and the scheduler sends both.
Understanding this architecture is the single most important thing you can do before setting up LHC@home on a Linux system. Almost nothing that follows makes sense without it.
Engine One: VirtualBox and the CMS Experiment
The older of the two paths. CMS tasks — simulations for the Compact Muon Solenoid detector at CERN — run inside a full virtual machine managed by VirtualBox.
When BOINC receives a CMS work unit, it invokes boinc_vbox_wrapper, which launches a VBoxHeadless process — a headless VirtualBox VM — using a pre-downloaded .vdi disk image as its root filesystem. The simulation runs entirely inside the VM. Results are extracted via a shared folder. The VM shuts down. BOINC uploads the output.
The .vdi image (CMS_2025_04_08_prod.vdi on pepperpots at time of writing) is managed by BOINC itself — downloaded, cached, and updated as CERN releases new versions. You don't interact with it directly. You just need VirtualBox installed and the boinc system user able to invoke it.
That last requirement is where things quietly fall apart.
VirtualBox on Linux restricts VM management to members of the vboxusers group. If the boinc system user — the account under which the BOINC daemon runs — is not in that group, VBoxManage will refuse to execute and CMS tasks will simply never launch. No error surfaced to the BOINC interface. No computation error logged. Tasks sit in the queue, unstarted, and BOINC eventually returns them to the server as unfinished.
The diagnostic is straightforward once you know to look:
bash
sudo -u boinc VBoxManage list vmsIf this returns a permission error rather than an empty list, your boinc user is not in vboxusers. The fix is a single command:
bash
sudo usermod -aG vboxusers boincFollowed by a full BOINC restart to pick up the new group membership. This is not documented in the LHC@home setup instructions. It is not flagged during BOINC installation. It is the kind of thing you discover by noticing, weeks later, that you have never once received a CMS task despite being attached to the project.
Engine Two: Docker and the ATLAS Experiment
The newer path, and the more complex one. ATLAS tasks — simulations for the ATLAS detector, the largest of the LHC's four main experiments — run via Docker, using software distributed through CVMFS: the CERN Virtual Machine File System, a globally distributed, read-only software repository that mounts like a local filesystem.
When BOINC receives an ATLAS work unit, it invokes docker_wrapper, which pulls the task configuration and launches a Docker container. Inside the container, the ATLAS software stack — accessed live via /cvmfs/atlas.cern.ch/ — runs a GEANT4 simulation of particle interactions through the ATLAS detector geometry. Input files are multi-gigabyte collections of Monte Carlo generated collision events. Output is a processed file that eventually feeds into CERN's central analysis pipeline.
The compute demand is substantial. A single ATLAS task running on pepperpots consumes 200-300% CPU (multi-threaded), 2-3 GB of RAM, and runs for several hours. When the system is working correctly, this is impressive and deeply satisfying. When it isn't, it fails in ways that are neither obvious nor well-documented.
The failure that bit pepperpots — and will bite any system that upgrades Docker without knowing what to look for — is a security profile change introduced in docker-ce 29.x. The ATLAS container needs to mount a tmpfs filesystem internally, used by a lighttpd instance that serves simulation results back to the wrapper. Mounting tmpfs inside a container requires CAP_SYS_ADMIN. Docker 29.x's default security profile does not grant it.
The symptom is a task that either dies immediately or runs to 100% completion and then fails — because the lighttpd server couldn't set up its working directory, so the results, though computed, can't be served and collected. BOINC logs it as a computation error. Credits are not awarded.
The obvious fix — adding --privileged to the container options via BOINC's cc_config.xml docker_container_options field — doesn't work. The field is silently ignored in BOINC 8.2.8. This is a confirmed upstream bug, reported to both the LHC@home forums and the BOINC GitHub.
The actual fix is a wrapper script that intercepts every docker run and docker create call and injects the required flags:
bash
#!/bin/bash
# Wrapper: injects --privileged --user 122:127 into docker run/create
# Workaround for docker-ce 29.x + BOINC 8.2.8 LHC@home tmpfs incompatibility
# Real binary: /usr/bin/docker.real (managed by dpkg-divert)
REAL_DOCKER=/usr/bin/docker.real
if [[ "$1" == "run" || "$1" == "create" ]]; then
exec "$REAL_DOCKER" "$1" --privileged --user 122:127 "${@:2}"
fi
exec "$REAL_DOCKER" "$@"This lives at /usr/bin/docker. The real Docker binary lives at /usr/bin/docker.real. Every ATLAS container launch passes through the wrapper transparently. The --user 122:127 flag ensures output files are written as boinc:boinc rather than root:root — a separate ownership problem that would otherwise prevent BOINC from moving completed output files out of the slot directory.
There is one more wrinkle. The wrapper approach has an adversary: the package manager.
The Wrapper That Kept Disappearing
The first time apt upgrade overwrote /usr/bin/docker with a fresh Docker binary, the failure was immediate and total — every ATLAS task computation error, project backoff, no new work requested. Diagnosis was fast because the problem was documented. The wrapper was rewritten and redeployed.
The naive protective measure — apt-mark hold docker-ce — prevents automatic upgrades but doesn't survive an intentional apt install docker-ce. It also means never getting Docker security updates without manually unhooking the hold first. Fragile in both directions.
The correct solution is dpkg-divert:
bash
dpkg-divert --add --no-rename --divert /usr/bin/docker.real /usr/bin/dockerThis instructs the Debian package system that whenever any package tries to install a file at /usr/bin/docker, it should be redirected to /usr/bin/docker.real instead. The wrapper owns /usr/bin/docker permanently. Future Docker upgrades update the real binary at docker.real and never touch the wrapper. The divert survives apt upgrade, apt full-upgrade, and apt install docker-ce alike.
Verified by running apt-get install --reinstall docker-ce and confirming the wrapper was untouched. The real binary updated. The wrapper survived. ATLAS tasks continued running.
dpkg-divert is one of those tools that most Linux users never encounter until the moment they need it — at which point it's exactly the right instrument for the job. Any time you find yourself maintaining a customised version of a file that a package also wants to own, this is the pattern.
The Group Membership Matrix
Putting it all together, full LHC@home functionality on a standard BOINC Linux installation requires the boinc system user to be a member of five groups:
| Group | Purpose |
|---|---|
boinc | Primary group — BOINC data directory access |
video | GPU access for Einstein@Home OpenCL tasks |
render | AMD/Intel GPU render node access |
docker | ATLAS task container execution |
vboxusers | CMS task VM execution |
Check your current state: id boinc
Add any missing groups: sudo usermod -aG docker,vboxusers boinc
None of this is documented in one place in the official LHC@home or BOINC documentation. The docker group requirement appears in passing in some forum posts. The vboxusers requirement is implied but never stated explicitly. The consequence of missing either is silent task failure that looks, from BOINC's perspective, like a scheduling drought.
The Feedback Loop — Why This Time Was Different
Every Linux administrator has a folder somewhere. Maybe it's called notes, maybe fixes, maybe just a graveyard of text files with names like docker_thing_march.txt and BOINC_SOLVED_FINALLY.md. The notes made sense when you wrote them. Six months later they're archaeological artifacts requiring interpretation.
This is the normal state of homelab documentation. It's also why the same problems get solved two, three, four times — each time from scratch, each time eating hours that the previous solution already paid for.
The pepperpots setup broke that pattern, not by being more disciplined about notes, but by changing what "documentation" means in practice.
The Repo as Institutional Memory
From early in the troubleshooting process, everything went into a GitHub repository: scripts, systemd unit files, configuration, and most importantly, a running TIMELINE.md documenting every problem encountered, its root cause, the fix applied, and the outcome. Not a changelog. Not a commit log. A human-readable narrative of what broke and why.
When the Docker wrapper disappeared the second time — overwritten by apt upgrade during a routine update — the diagnosis looked like this:
boinc-affinity: LHC tasks failing with computation errors
→ check wrapper: head -5 /usr/bin/docker
→ ELF binary, not bash script
→ wrapper overwritten by package update
→ TIMELINE.md: Phase 2, same failure mode
→ fix: redeploy wrapper, this time use dpkg-divertTen minutes from symptom to resolution, including the time to implement a better fix than the original. Because the problem had been seen before, recorded in detail, and the record was immediately accessible.
Without the documentation, the same sequence would have started from scratch: what's failing, why is it failing, what did we do last time, where did we put that script, was it --privileged or something else we injected. An hour minimum. Probably more.
The wrapper not being committed to the repo the first time around was its own lesson. The fix existed. The understanding existed. But because the script hadn't been checked in, the repo couldn't answer the question "what exactly did we deploy?" with certainty. That gap — between solving a problem and recording the solution — is where institutional memory leaks out.
The rule that emerged: a fix that isn't committed didn't happen.
The AI Project as Technical Co-Pilot
The other half of the pattern is less conventional: a Claude Project (or chatGPT or whatever your personal choice may be) — a persistent AI assistant session with memory scoped to the project — used not as a search engine or a code generator, but as a technical collaborator that remembers the full context of the system.
This needs some honest framing. An AI assistant doesn't replace understanding. Every fix described in this article required actual diagnosis — reading logs, testing hypotheses, understanding why CAP_SYS_ADMIN matters for tmpfs mounts, knowing what dpkg-divert does and why it's the right tool rather than apt-mark hold. The AI doesn't short-circuit that process and shouldn't.
What it does is compress the overhead around it.
When the Docker wrapper failed the second time, the conversation didn't start with "explain what LHC@home is" or "what's a Docker wrapper." It started with the symptom, and the assistant already knew the architecture, the prior fix, the known failure modes, and the reason dpkg-divert was the upgrade from the hold approach. The context that would otherwise have to be reconstructed from scratch — or read back from notes — was already present.
The session becomes a single continuous thread: problem identified, diagnosed in context, fixed with full awareness of the system's history, documented, committed. The AI session and the git log tell the same story because they're built in parallel, each informing the other.
For a solo administrator managing a complex system over months, this changes the economics of documentation significantly. Writing things down stops feeling like overhead and starts feeling like the natural end of a problem-solving session — because the record is already half-built by the time the fix is confirmed working.
What Gets Captured
The sable-boinc_admin repository now contains, in addition to the scripts and config files:
- TIMELINE.md — nine phases of troubleshooting history, root causes and all
- STARTUP_SEQUENCE.md — annotated start and stop procedures covering every non-obvious decision: why boinc runs as root, why boincmgr doesn't need sudo, the
boinccmdworking directory quirk, shutdown ordering - ATLAS_ORPHAN_PROBLEM.md — a deep dive into CVMFS process escape and the detection and cleanup approach (more on this shortly)
Each document is written for an external audience — another BOINC administrator who has hit the same wall and is looking for a path through it. That constraint turns out to be clarifying. Writing for someone who doesn't already know the system forces precision about what actually matters and why, which makes the documentation more useful to future-you as well.
The repository is public: github.com/black-vajra/sable-boinc_admin.
A Deeper Dive — How an AI Copilot Pattern Actually Works
Section 4 described the feedback loop in broad strokes — problems get solved and recorded in the same session, the repo and the AI session tell the same story. What follows is the mechanics of how that's set up and why the specific choices matter.
There's a version of "using AI for sysadmin work" that looks like typing questions into a chat window and copying the answers into a terminal. That's not what's described here, and the distinction matters.
What follows is a specific methodology built around Claude's Projects feature that turns an AI assistant into something closer to a technical co-author with persistent context — one that knows your system's history, understands its quirks, and can pick up mid-session without needing to be re-briefed from scratch every time. Getting there requires some deliberate setup. The payoff is significant.
Projects vs Conversations — Why It Matters
A standard Claude conversation is stateless. Each new chat starts with a blank slate. You explain the system, describe the problem, provide context, get help, close the tab. The next time something breaks, you start over. For a one-off question this is fine. For a system you're actively developing and maintaining over months, it's death by a thousand re-explanations.
Claude Projects changes this. A Project is a persistent workspace that maintains context across all conversations within it. Everything discussed in previous sessions — the Docker wrapper saga, the VBox group membership discovery, the nine phases of the timeline — is searchable and accessible in new sessions via conversation search. The assistant doesn't need to be told what dpkg-divert is in the context of this system, or why boinccmd requires the cd /var/lib/boinc-client subshell pattern. It already knows.
For a complex, evolving system like pepperpots, this is the foundational capability that makes everything else work.
The System Prompt — Your Briefing Document
Every Claude Project accepts a system prompt — instructions and context that are injected at the start of every conversation in the project. Think of it as the briefing document the assistant reads before each session.
The sable-boinc_admin project system prompt covers:
- Machine profile: hostname, CPU, RAM, GPU, OS, kernel
- Active BOINC projects and their execution characteristics
- Known quirks and gotchas (the
boinccmddirectory issue, the manual startup architecture, the Docker wrapper, the ATLAS orphan problem) - Repository structure
- Behavioral instructions: technical audience, no hand-holding, flag stability risks proactively, write documentation for external readers
This means every new conversation starts with the assistant already oriented. "LHC@home tasks are failing" lands in a context where the assistant already knows the system has a Docker wrapper, knows what the wrapper does, knows the prior failure modes, and can immediately triage rather than asking preliminary questions.
Writing a good system prompt is worth investing time in. It should describe the system accurately, document the non-obvious decisions and their rationale, and specify how you want the assistant to behave. It's a living document — update it as the system evolves. The system prompt for this project went through several revisions as new quirks were discovered and the scope of the work became clearer.
Project Files — Persistent Reference Material
Projects also support file uploads — documents that persist across all conversations and are available for the assistant to reference at any time. On pepperpots this includes:
TIMELINE.md— the full troubleshooting historyboinc_affinity.sh— the current affinity scriptSCRIPTS_AND_DOCS.md— inventory and documentation roadmapBOINC_Admin_Addendum_Mar2_2026.docx— the session addendum covering February 28 to March 2
When a new session begins, these files are available immediately. A question like "what version of the affinity script are we on and what were the key changes in v3?" gets answered from the actual file, not from training data. A question like "when did we first encounter the ATLAS orphan problem?" gets answered from the timeline, accurately, with the specific context of this system.
The practical implication: keep your key documents in the project. The more complete and current the uploaded files, the less the assistant needs to ask, and the more useful its contributions are from the first message.
Memory Settings — The Subtle Controls
Claude's memory system operates at two levels that are worth understanding separately.
Automatic memory generates summaries of conversations and stores them as memories that persist across the entire Claude account — not just within a project. These are the fragments of context that let Claude greet you by name, know you work in cybersecurity, remember that the machine is called pepperpots. Useful ambient context, but not fine-grained enough to replace project files for technical detail.
User-controlled memory edits let you explicitly instruct Claude to remember or forget specific things. In the Claude settings under memory, you can view, add, edit, and delete these entries directly. For a technical project this is where you'd capture things like preferred terminology, workflow preferences, or system facts that aren't in the uploaded files. The key discipline: keep these entries clean and current. Stale memory entries can mislead more than help.
The memory toggle in conversation settings controls whether memories are applied in a given session. Leave it on for technical project work. The context it provides is net positive.
Within a Project, the scoping is important: memories generated from project conversations are scoped to that project. This is what you want — pepperpots context doesn't bleed into unrelated conversations, and vice versa.
Token Economy — Managing the Context Window
This is the part that isn't discussed enough in AI workflow guides.
Every conversation has a context window — a finite amount of text the model can hold in active attention at once. In a long technical session covering multiple problems, debugging sequences, and document drafts, that window fills up. When it does, early context can fall out of active attention, and the quality of responses can degrade subtly — the assistant may lose track of a detail established early in the session or start asking questions it already answered.
Practical management strategies:
Start new conversations for new problems. Don't let a single conversation sprawl across five unrelated issues. A conversation that began debugging the Docker wrapper and ends writing STARTUP_SEQUENCE.md has covered a lot of ground — that's fine. A conversation that started in February and has accumulated fifty exchanges across three different problem domains is burning context on history that the project files already capture better.
Front-load context at session start. Open each new conversation by establishing what you're working on and what the current state is. "Continuing from last session — Docker wrapper is fixed and committed, now working on STARTUP_SEQUENCE.md" gives the assistant a clean starting point without needing to read back through prior exchanges.
Use the project files as the source of truth, not the conversation history. When a document is finished and committed, it lives in the repo and can be re-uploaded to the project. The conversation that produced it can end. The artifact persists; the thread doesn't need to.
Uploads vs conversation text. Pasting long files into conversation messages consumes context window rapidly. Uploading them as project files makes them available without burning active context. For reference material you'll need across multiple sessions, always upload rather than paste.
Draft documents in Claude, but own the final commit. Claude can produce a first draft of STARTUP_SEQUENCE.md or a TIMELINE entry that's 90% correct. The remaining 10% — verifying technical accuracy, checking that commands are right, confirming the narrative matches what actually happened — is always the human's job. The assistant is confident; confidence is not accuracy. Read everything before it goes in the repo.
The Live Session Pattern
The workflow that emerged for pepperpots looks like this:
- Problem appears — something breaks, something needs building
- Open project conversation — assistant has system context from prompt and files
- Diagnose together — assistant brings knowledge of the system's history, prior failure modes, relevant commands; human runs them and reports output
- Fix is identified and applied — human executes, assistant verifies logic
- Documentation drafted — assistant writes the TIMELINE entry, doc section, or commit message while the session is fresh and the details are live in context
- Committed — human reviews, adjusts for accuracy, commits to the repo
The documentation step is the one most easily skipped and most worth protecting. The moment a fix is confirmed working is exactly when the details are clearest — the root cause is understood, the solution is fresh, the gotchas are top of mind. Writing it down then costs ten minutes. Writing it down a week later costs an hour and produces a worse document.
With the assistant doing the initial draft, that ten minutes shrinks to two. The friction is low enough that skipping it starts to feel like the lazier option.
What the AI Doesn't Do
To be direct about the limits: the assistant doesn't diagnose problems from first principles. It doesn't know that docker-ce 29.x changed its security profiles unless that information is in its training data or you tell it. It doesn't know what's running on your system. It doesn't run commands or read logs.
What it does is hold context, reason about it, suggest the right questions to ask, know which tools are appropriate for which problems, and produce documentation that reflects the session accurately. The human is always the diagnostic instrument. The assistant is the memory and the pen.
For solo administrators managing complex systems — which describes most homelab setups — that division of labour is extremely well-matched to the actual workflow.
The Ghost in the Machine — ATLAS Orphans and the Limits of Process Management
Everything described so far has a clean fix. Wrong group membership — add the group. Overwritten wrapper — use dpkg-divert. Authentication failure — use the subshell pattern. The problems are real but the solutions are surgical.
The ATLAS orphan problem is different. It's the one that requires you to fundamentally rethink your assumptions about how processes work on Linux — specifically, the assumption that stopping the program that started a process will stop the process.
On a normal system, that assumption holds. On a system running ATLAS simulations via CVMFS, it doesn't.
How ATLAS Jobs Escape the Process Tree
When BOINC launches an ATLAS task, it spawns a chain of processes. The chain looks roughly like this:
boinc (PID 1234)
└── docker_wrapper
└── docker run ...
└── [container: runpilot2-wrapper.sh]
└── python runargs.EVNTtoHITS.pyUnder normal circumstances, killing the root of this tree — the boinc process — would propagate SIGTERM down through the hierarchy and clean everything up. This is the expected behaviour and it's how every other BOINC project on pepperpots works.
ATLAS tasks run their simulation software via CVMFS — the CERN Virtual Machine File System, a distributed, read-only software repository that mounts like a local filesystem at /cvmfs/atlas.cern.ch/. At some point in the wrapper chain, a process executed via CVMFS gets re-parented to PID 1 — the init process. The mechanism is a standard Unix behaviour: when a parent process exits before its child, the child is adopted by init. In the ATLAS execution chain, this re-parenting is not accidental — it's a consequence of how the CVMFS-hosted wrapper scripts manage their child processes.
The result is a process — typically python runargs.EVNTtoHITS.py, the GEANT4 simulation itself — that is completely detached from BOINC's process tree. It has PID 1 as its parent. It owns a CPU core. It is invisible to any tool that navigates the process hierarchy from the boinc process downward. And it survives BOINC client restarts entirely.
The first sign something was wrong on pepperpots: one CPU core pegged at 100% after stopping boinc-client. Not throttling. Not tapering off. Solid 100%, indefinitely. pgrep boinc returned nothing. The BOINC interface showed no running tasks. But the core was pinned, temperatures were climbing, and ps aux showed a python runargs.EVNTtoHITS.py process consuming 238% CPU (multi-threaded) with 2.9 GB RSS — owned by root, parent PID 1.
BOINC had no idea it existed.
Why KillMode=process Made It Worse
The systemd drop-in for boinc-client.service had KillMode=process at the time. This setting tells systemd: when stopping this service, send the kill signal only to the main process, not to the entire cgroup. The rationale was to let BOINC clean up its own children gracefully before dying.
The problem: KillMode=process means systemd explicitly does not kill the cgroup. Any process that escaped the process tree — like an ATLAS simulation re-parented to init — was already outside the scope of a process-targeted kill anyway, but KillMode=process made the situation worse by also exempting processes that hadn't escaped. Legitimate BOINC children running inside containers could persist after boinc-client stopped.
The fix was to change to KillMode=control-group:
ini
[Service]
KillMode=control-groupThis tells systemd to nuke the entire cgroup when the service stops — every process associated with the boinc-client service unit, regardless of process tree position. Containers, wrappers, grandchildren, all of it.
There's a significant caveat, and it matters: this only works when BOINC is started via systemctl. On pepperpots, BOINC is started manually via ~/start-boinc-procedure. A manually started process is not in the boinc-client service's cgroup. The KillMode=control-group setting is present and correct in the drop-in, but it only protects sessions where systemctl was used to start the daemon. For the manual startup case, the orphan problem remains theoretically possible for long-running sessions.
This is a known, documented limitation of the current setup — a conscious tradeoff between the control afforded by manual startup and the cgroup management that systemd provides.
Detection — Casting a Wider Net
The CPU affinity script (boinc_affinity.sh) originally detected workers by walking the BOINC process tree from the client PID downward. This is the right approach for every project except ATLAS — for ATLAS, the workers have already escaped the tree by the time the affinity script sees them.
The fix was a parallel detection function that ignores the process tree entirely and searches by process name pattern:
bash
get_atlas_pids() {
pgrep -f "runargs\|EVNTtoHITS\|AtlasG4\|Sim_tf\|Gen_tf\|python.*atlas\|python.*cern" 2>/dev/null
}This runs alongside the normal descendant walk and the results are merged:
bash
mapfile -t ALL_DESCENDANTS < <({ get_descendants "$CLIENT_PID"; get_atlas_pids; } | sort -u)The orphans are now visible to the affinity manager. They get CPU affinity assignments and renice -n 19 treatment along with every other worker — which means even an escaped ATLAS simulation running as root at 238% CPU gets pushed to the lowest scheduling priority and pinned to a rotating window of cores rather than hammering the same two cores indefinitely.
Cleanup — The Start Script pkill
Detection helps during a running session. But orphans from a previous session — processes that survived a BOINC restart and have been running unmanaged since — need to be cleaned up before a new session begins. The start script handles this:
bash
sudo pkill -f "runargs.EVNTtoHITS" 2>/dev/null
sudo pkill -f "EVNTtoHITS" 2>/dev/null
sudo boinc --redirectio &The pkill runs before boinc starts. Any orphan from the previous session is terminated before the new session inherits it. The 2>/dev/null suppression means this is silent when there's nothing to kill — which is most of the time. When there is something to kill, it's exactly the right moment to do it.
The Remaining Exposure
With all of the above in place — get_atlas_pids() detection, KillMode=control-group in the systemd unit, pre-start pkill in the start script — the orphan problem is managed but not eliminated. A long-running session started manually, where CVMFS re-parents a process mid-session, will produce an orphan that the affinity script will detect and manage (renice, pin to rotating cores) but that won't be killed when BOINC stops, because the cgroup management only applies to systemd-started sessions.
The practical consequence is minimal: the orphan gets cleaned up at the next start sequence. But it means a stopped BOINC session on pepperpots can leave a low-priority CERN simulation running in the background until the next startup. For a homelab context this is acceptable. For a production environment it would not be.
It's documented. Future-you — or the next administrator — can make an informed decision about whether the manual startup architecture is still the right tradeoff.
Why This Is Worth Documenting
The ATLAS orphan problem is not a bug in any single piece of software. It's an emergent behaviour at the intersection of CVMFS's process management, Docker's container lifecycle, Linux's process re-parenting semantics, and BOINC's assumption that it controls its own process tree. None of those components is doing anything wrong individually. Together they produce a situation that will confuse any administrator who hasn't seen it before.
As of the writing of this article, there is no clear upstream documentation describing this behaviour or its mitigations. The pattern described here — name-based detection, cgroup kill mode, pre-start cleanup — was developed empirically on pepperpots over several sessions and represents, as far as we know, the most complete publicly available treatment of the problem.
If you're running Einstein@Home ATLAS tasks and you've ever noticed a CPU core that won't settle down even after stopping BOINC — now you know why.
Tools of the Trade — The Short List That Actually Matters
Every complex system accumulates a toolkit. Most of the tools in a Linux administrator's arsenal are familiar — ps, grep, systemctl, journalctl. What follows isn't a comprehensive Linux reference. It's the specific set of tools that proved indispensable for this particular class of problem: a long-running, multi-project BOINC installation with container runtimes, process escape, thermal management, and a package manager that occasionally destroys your careful workarounds.
dpkg-divert — Own a File the Package Manager Wants
Already covered in depth in the Docker section, but worth restating as a general principle because it applies far beyond this specific use case.
Any time you find yourself maintaining a customised version of a file that a Debian/Ubuntu package also installs — a config file, a binary, a wrapper script — dpkg-divert is the correct tool. It registers a permanent redirect in the dpkg database: "when any package tries to install /usr/bin/docker, put it at /usr/bin/docker.real instead." Your version owns the original path. Package updates update the diverted path. The two never collide.
bash
# Register the divert
dpkg-divert --add --no-rename --divert /usr/bin/docker.real /usr/bin/docker
# Verify
dpkg-divert --list | grep docker
# Remove when no longer needed
dpkg-divert --remove --no-rename /usr/bin/dockerapt-mark hold is a blunt instrument that blocks updates entirely. dpkg-divert is a scalpel that lets updates proceed while protecting your customisation. Learn it once, reach for it whenever the situation calls for it.
systemd Drop-ins — Modify Units Without Touching the Original
Systemd unit files installed by packages live in /lib/systemd/system/. You could edit them directly — and they'd be overwritten the next time the package updates. The correct approach is drop-in files: small override snippets that live in /etc/systemd/system/<unit>.d/ and are merged with the original at load time.
bash
# Create a drop-in directory and file
mkdir -p /etc/systemd/system/boinc-client.service.d/
nano /etc/systemd/system/boinc-client.service.d/docker-dep.confDrop-ins can add, override, or clear directives. The clearing syntax is non-obvious but important — to remove an inherited directive, set it to empty:
ini
# Clear an inherited directive
BindsTo=
# Override a value
KillMode=control-group
# Add a dependency
After=docker.service
Requires=docker.serviceThe BindsTo= empty assignment was specifically needed for boinc-affinity.service — the main unit file set BindsTo=boinc-client.service, which caused the affinity service to die whenever boinc-client restarted. The drop-in cleared it. Package updates to either service leave the drop-in untouched.
Always prefer drop-ins over direct unit file edits. Your future self will thank you when an update doesn't silently revert a change you made six months ago and forgot about.
taskset and renice — CPU Affinity and Priority
taskset pins a process to a specific set of CPU cores. renice adjusts its scheduling priority. Together they give you precise control over how compute-intensive processes share hardware.
bash
# Pin PID 12345 to cores 0-3
taskset -cp 0-3 12345
# Set PID 12345 to lowest scheduling priority
renice -n 19 -p 12345
```
On pepperpots these are called programmatically by `boinc_affinity.sh` every ten seconds, with a rotating core window to distribute thermal load. The key insight from building that script: CPU affinity without priority management is incomplete. A process pinned to cores 0-1 at normal priority will still starve other work on those cores. `renice -n 19` ensures BOINC workers yield to anything that needs the CPU — the desktop, the browser, the user — while still consuming all available idle cycles.
The combination is what makes BOINC a genuinely good citizen on a machine that's also used for other things.
---
### boinccmd — The CLI Interface and Its Quirks
`boinccmd` is the command-line interface to a running BOINC client. It communicates via the GUI RPC interface — the same channel `boincmgr` uses — which means it needs to authenticate with a password stored in `gui_rpc_auth.cfg`.
The quirk: `boinccmd` looks for that file in the **current working directory**, not at the path where the file actually lives. The file is at `/etc/boinc-client/gui_rpc_auth.cfg`, symlinked to `/var/lib/boinc-client/gui_rpc_auth.cfg`. Running `boinccmd` from any directory other than `/var/lib/boinc-client/` produces:
```
gui_rpc_auth.cfg exists but can't be readThe fix is the subshell pattern, used consistently throughout the start and stop scripts:
bash
(cd /var/lib/boinc-client && boinccmd --set_run_mode auto)The subshell () means the cd doesn't affect the calling shell's working directory. Clean, portable, and immediately obvious to anyone reading the script.
Do not use --passwd $(cat /path/to/file) as an alternative — it exposes the password in the process list where any user can see it with ps aux.
pgrep and pkill — Process Management by Name
pgrep and pkill find and signal processes by name or command line pattern rather than PID. For BOINC administration, where process names are more stable than PIDs across restarts, they're more useful than PID-based approaches.
bash
# Find processes matching a pattern
pgrep -f "runargs.EVNTtoHITS"
# Kill processes matching a pattern (requires sudo for root-owned processes)
sudo pkill -f "EVNTtoHITS"
# List all matching processes with full command line
pgrep -af "boinc"
```
The `-f` flag matches against the full command line, not just the process name. Essential for ATLAS orphan detection where the distinguishing information is in the arguments (`runargs.EVNTtoHITS.py`), not the binary name (`python`).
One consistent trap: if BOINC is running as root (as it does on pepperpots via `sudo boinc --redirectio`), `pkill boinc` run as an unprivileged user silently does nothing. The process doesn't terminate, no error is reported, and the shell returns normally. Always verify with `pgrep -a boinc` after a pkill if you're not certain of the privilege level.
---
### git — Institutional Memory With Timestamps
Not a sysadmin tool in the traditional sense, but the most important one on this list for long-term system management.
The value of version control for infrastructure isn't primarily about branching or collaboration. It's about the commit log — a timestamped, annotated record of every change made to every file, with the human reasoning preserved in the commit message.
```
Phase 9: docker wrapper dpkg-divert, vboxusers fix, stop script repair
- stop-boinc-procedure: add sudo to pkill commands; fix boinccmd auth pattern
- STARTUP_SEQUENCE.md: new doc covering root/user decisions, boinccmd quirk,
group membership requirements, annotated start/stop sequences
- TIMELINE.md: append Phase 9 covering docker-ce overwrite, vboxusers discovery,
stop script pkill failure and dpkg-divert solutionSix months from now, that commit tells you exactly what was wrong, what was fixed, and why. The timestamp tells you when. The diff tells you precisely what changed.
Write commit messages that explain the why, not just the what. "Fix stop script" is useless history. "Add sudo to pkill: boinc runs as root, unprivileged pkill silently fails" is the kind of message that saves an hour of head-scratching when the same issue surfaces in a different context.
The repository for this project — github.com/black-vajra/sable-boinc_admin — is public and intended as a reference for other BOINC administrators. The commit history is part of the documentation.
What Those CPU Cycles Are Actually Doing
It's worth stepping back from the tooling and the troubleshooting for a moment to talk about the science. Not because the technical problems aren't interesting — they clearly are — but because the reason to solve them carefully, to document them properly, to keep the system running reliably, is that the work it's doing is genuinely important.
pepperpots is not folding proteins or rendering someone's CGI pipeline. It is contributing to active, peer-reviewed scientific research at four separate institutions. Understanding what that research is makes the effort of maintaining the system feel less like yak shaving and more like what it actually is: participating in science at a scale that was impossible for individuals a generation ago.
Einstein@Home — Listening for Dead Stars
The Einstein@Home work unit running on pepperpots during the session described in this article was searching a 0.5 Hz band centred on 2059 Hz for continuous gravitational wave signals from rotating neutron stars. The data came from LIGO's fourth observing run — the same interferometers that made history in 2015 with the first direct detection of gravitational waves from a black hole merger.
A neutron star spinning at 1030 revolutions per second emits gravitational waves at twice its spin frequency: 2059 Hz. The wave amplitude is extraordinarily small — spacetime distortions measured in fractions of the diameter of a proton, propagating across thousands of light years before arriving at the LIGO detectors in Hanford, Washington and Livingston, Louisiana. Finding them in the noise requires searching 11,448 patches of sky, each independently, across thousands of frequency bands, each requiring on the order of 864 trillion floating point operations per work unit.
No single computing cluster has the resources to do this at the required depth. The search is distributed across hundreds of thousands of volunteer computers worldwide. Each work unit covers one sky patch, one frequency band. The aggregate is a survey of the entire sky at a sensitivity no other approach can match.
If a continuous gravitational wave source is discovered in Einstein@Home data, the work unit that found it will be traceable to the specific volunteer computer that processed it. That computer's hostname will appear in the paper.
LHC@home — Simulating the Collider
The Large Hadron Collider produces approximately one billion proton-proton collisions per second at the interaction points inside its detectors. The vast majority are uninteresting — well-understood Standard Model processes producing familiar particles. The physics of interest is in the rare events: Higgs boson production, potential supersymmetric particle signatures, precision measurements of known particles that might reveal deviations from theoretical predictions.
To distinguish signal from background, physicists need to know exactly what the detectors should see for any given physics process. This requires Monte Carlo simulation: generate a collision event according to the theoretical model, propagate every resulting particle through a detailed simulation of the detector geometry, record the simulated detector response, compare to real data.
GEANT4 — the simulation toolkit running inside every ATLAS work unit on pepperpots — models the interaction of particles with matter at the level of individual detector elements. It tracks every particle through every layer of the detector, simulating electromagnetic showers, hadronic interactions, energy deposits in calorimeter cells, track hits in silicon pixel detectors. A single simulated event can involve thousands of particles and millions of individual physics calculations.
The CMS tasks running via VirtualBox are doing the same thing for the Compact Muon Solenoid — the other large general-purpose detector at the LHC, independently searching the same collision data for the same and different physics signatures. Two experiments, same collider, independent analyses. Agreement between them is one of the strongest validations in experimental particle physics.
When you see a new LHC result — a refined Higgs mass measurement, a new constraint on dark matter production, a precision electroweak measurement — the Monte Carlo samples that underpinned it were produced by a distributed computing grid that included volunteer machines like pepperpots.
MilkyWay@home — Weighing Dark Matter
The Milky Way is cannibalising its satellite galaxies. As dwarf galaxies orbit within the Milky Way's gravitational field, tidal forces strip stars from their outskirts, leaving long streams of stellar debris tracing the orbit through space. These streams — visible in wide-field sky surveys as overdensities of stars following great circle arcs — are fossils of the interaction.
The shape of a tidal stream depends on the gravitational field it moved through. A stream that formed in a spherical gravitational potential looks different from one that formed in a flattened or triaxial potential. Since most of the Milky Way's mass is dark matter, and dark matter's distribution determines the potential, the streams are visible tracers of an invisible mass distribution.
MilkyWay@home at Rensselaer Polytechnic Institute runs N-body simulations of dwarf galaxy disruption — thousands of simulations per run, each testing different assumptions about the dark matter halo's mass, shape, and density profile. Each work unit is one parameter combination. The simulation's final stellar distribution is compared to actual sky observations using a likelihood function. Differential evolution drives the parameter search toward the best fit.
The result is a measurement of dark matter distribution in the Milky Way derived entirely from the positions of stars — no dark matter detector required, no assumptions about dark matter's particle nature, just gravity and patient computation.
Asteroids@home — Mapping the Solar System's Small Bodies
Asteroid shape models matter for reasons beyond pure scientific interest. An asteroid's rotation state, shape, and surface properties determine how the YORP effect — radiation pressure torque — will evolve its orbit over centuries. Accurate shape models are a prerequisite for reliable long-term orbit prediction, which is prerequisite for meaningful impact risk assessment.
Asteroids@home reconstructs three-dimensional shape models from photometric light curves — brightness measurements over time as an asteroid rotates. The brightness depends on the projected cross-section toward the observer, which changes as the asteroid rotates and as the viewing geometry changes across multiple oppositions. With enough light curves from enough viewing angles, the shape can be constrained.
Each work unit tests one candidate rotation pole and shape combination against the observational data. The computational demand is modest by ATLAS standards, but the search space is large — thousands of candidates per asteroid, thousands of asteroids in the queue.
The Aggregate
On a typical day, pepperpots contributes to all four of these efforts simultaneously. The GPU searches for gravitational wave sources in LIGO data while the CPU cores split between ATLAS particle physics simulations, dark matter distribution fitting, and asteroid shape reconstruction.
None of this requires any particular expertise from the volunteer. Install BOINC, attach to projects, keep the machine running. The science happens automatically.
What requires expertise — and what this article has been about — is keeping it running reliably when the software stack is complex, the container runtimes are finicky, and routine package updates occasionally break things in non-obvious ways. The science is the reason to invest that effort. The tooling and the documentation are what make the investment sustainable.
Getting Started — Your Machine, CERN's Data
If this article has done its job, you're either already running BOINC and now understand why some things weren't working, or you're not running it yet and want to be. Either way, here's the practical path forward.
Step 1 — Install BOINC
On Debian/Ubuntu/Kubuntu:
bash
sudo apt install boinc-client boinc-managerThis installs the daemon (boinc-client), the GUI (boincmgr), and the command-line interface (boinccmd). The systemd service will be enabled automatically. If you plan to follow the manual startup pattern described in this article, disable it:
bash
sudo systemctl disable boinc-clientStep 2 — Install the Container Runtimes
For LHC@home specifically, you need both:
Docker:
bash
# Follow the official docker-ce installation for your distro
# at docs.docker.com/engine/install/ubuntu/
# Do NOT use the docker.io package from Ubuntu repos — it's outdatedVirtualBox:
bash
# Follow the official VirtualBox installation for your distro
# at virtualbox.org/wiki/Linux_Downloads
# Install the current 7.x seriesStep 3 — Sort Out Group Memberships
Before attaching to any project, get the boinc system user's group memberships right. Do it now, before you've spent hours wondering why CMS tasks never run:
bash
sudo usermod -aG docker,vboxusers boinc
id boinc # verifyExpected output includes: docker and vboxusers in the groups list.
Step 4 — Deploy the Docker Wrapper
Before starting BOINC or attaching to LHC@home, deploy the wrapper and protect it with dpkg-divert:
bash
# Move real binary aside
sudo mv /usr/bin/docker /usr/bin/docker.real
# Create wrapper at /usr/bin/docker
sudo tee /usr/bin/docker << 'EOF'
#!/bin/bash
# Wrapper: injects --privileged --user 122:127 into docker run/create
# Workaround for docker-ce 29.x + BOINC 8.2.8 LHC@home tmpfs incompatibility
# Real binary: /usr/bin/docker.real (managed by dpkg-divert)
# Deployment: dpkg-divert --add --no-rename --divert /usr/bin/docker.real /usr/bin/docker
# Removal: dpkg-divert --remove --no-rename /usr/bin/docker
REAL_DOCKER=/usr/bin/docker.real
if [[ "$1" == "run" || "$1" == "create" ]]; then
exec "$REAL_DOCKER" "$1" --privileged --user 122:127 "${@:2}"
fi
exec "$REAL_DOCKER" "$@"
EOF
sudo chmod 755 /usr/bin/docker
# Register the divert — future docker-ce updates go to docker.real
sudo dpkg-divert --add --no-rename --divert /usr/bin/docker.real /usr/bin/docker
# Verify
dpkg-divert --list | grep docker
head -3 /usr/bin/dockerStep 5 — Attach to Projects
Launch boincmgr and use the project wizard, or use boinccmd directly:
bash
# From /var/lib/boinc-client/ for auth reasons covered in this article
(cd /var/lib/boinc-client && boinccmd --project_attach https://lhcathome.cern.ch/lhcathome <account_key>)
(cd /var/lib/boinc-client && boinccmd --project_attach https://einsteinathome.org <account_key>)Account keys are found in your account settings on each project's website after registration.
Step 6 — Start a GitHub Repo
Do this before anything starts breaking. The marginal effort of setting up version control on day one is tiny. The value compounds immediately.
bash
mkdir ~/boinc-admin
cd ~/boinc-admin
git init
mkdir scripts systemd docs config
touch README.md docs/TIMELINE.md
git add .
git commit -m "Initial repo structure"Copy your scripts in as you write them. Commit with messages that explain why, not just what. Start TIMELINE.md with today's date and the baseline system state. The next time something breaks — and something will break — you'll have context to work from.
The sable-boinc_admin repository at github.com/black-vajra/sable-boinc_admin is available as a reference implementation: scripts, systemd units, configuration files, and the full documentation set described in this article. Fork it, adapt it, use what's useful.
Step 7 — Start a Claude Project
Open Claude, create a new Project, and write a system prompt that describes your machine: hostname, hardware, OS, BOINC version, active projects, known quirks. Upload your key documents as project files. Keep both current as the system evolves.
The investment is an hour of setup. The return is months of contextual technical assistance that doesn't require re-explaining your system from scratch every session. Refer back to the methodology section of this article for the full workflow pattern.
What to Expect
The first few weeks of running BOINC on a complex setup will surface problems. Some will be in this article. Some won't. The ATLAS orphan issue took weeks to manifest clearly — it requires a long-running session and specific task types to reproduce. The VBox group membership issue won't show itself until CMS tasks are available and you notice you never get any.
Expect the unexpected. Document as you go. Commit early and often. And when you find something that isn't in this article or the repository — write it up and share it. The BOINC community is small enough that a well-documented solution to a non-obvious problem has outsized impact.
Your CPU cores are idle right now. They don't have to be.
The sable-boinc_admin repository: github.com/black-vajra/sable-boinc_admin
LHC@home: lhcathome.cern.ch
Einstein@Home: einsteinathome.org
BOINC: boinc.berkeley.edu
Jonathan Brown ~ Border Cyber Group
Member discussion: