Getting Started with JobDistributor
A practical, end-to-end guide to running large hyperparameter sweeps across one machine, multiple machines, or one or more HPC clusters — without changing a single line of your training code. Already know the flow? Try the Quick Start instead.
? Introduction
JobDistributor is an open-source platform for distributing a large number of independent compute tasks across any number of machines. It consists of three components:
| Component | What it does | Where it runs |
|---|---|---|
| Hub | Manages accounts, experiments, API keys, and secure public access for your server | Central cloud server (hosted at jobdistributor.net) |
| Server | Holds the job queue; serves jobs to workers; stores results; provides a web dashboard | Any machine with Docker — your laptop, lab server, or VM |
Worker (jd_worker_cli) |
Pulls jobs from the server, runs your script, reports completion | Any compute node — laptop, desktop, GPU server, Slurm nodes on one or more HPC clusters, VPS instances, or cloud VMs |
JobDistributor is the right tool when:
- You need to run the same script many times with different parameter combinations.
- Each run is completely independent — no job needs results from another.
- You want to spread work across laptops, GPU servers, and cluster nodes simultaneously.
- You want one central dashboard to track progress, failures, and results in real time.
Typical use cases:
- Machine learning hyperparameter search
- Physics or network simulation sweeps
- Benchmark matrices across algorithm variants
- Repeated experiments with different random seeds
1 Example Problem — ML Hyperparameter Search
Throughout this tutorial we will train a neural network classifier to recognize handwritten digits. The dataset (scikit-learn's built-in optical digits, 1,797 samples, 10 classes) is a classic benchmark for classification algorithms.
The objective: find the combination of four hyperparameters — learning rate, network size, regularization strength, and maximum training iterations — that produces the highest test-set accuracy. There is no mathematical shortcut; we simply need to train and evaluate the model once for every combination.
| Hyperparameter | Values | Count |
|---|---|---|
learning_rate_init |
0.0010.010.1 | 3 |
hidden_layer_sizes |
6412864,64 | 3 |
alpha (L2 regularization) |
0.00010.0010.01 | 3 |
max_iter |
100200 | 2 |
| Total combinations (3 × 3 × 3 × 2) | 54 | |
Serial baseline — running all combinations on one machine
Before distributing anything, here is the naïve approach. We need two
files: train.py, which trains the model for a single
parameter combination, and serial_grid.py, which calls
train.py in a loop over all 54 combinations.
train.py — trains and evaluates the model for one combination of hyperparameters:
#!/usr/bin/env python3
"""Train an MLP on the digits dataset for one hyperparameter combination.
Usage:
python train.py --learning_rate_init 0.01 --hidden_layer_sizes 128 \
--alpha 0.001 --max_iter 200
"""
import argparse
import json
import os
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate_init", type=float, required=True)
parser.add_argument("--hidden_layer_sizes", type=str, required=True,
help="Single int (e.g. 128) or comma-separated (e.g. 64,64)")
parser.add_argument("--alpha", type=float, required=True)
parser.add_argument("--max_iter", type=int, required=True)
args = parser.parse_args()
# Parse hidden_layer_sizes: "64" → (64,) "64,64" → (64, 64)
hidden = tuple(int(x) for x in args.hidden_layer_sizes.split(","))
# Load and split data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Normalise
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Train
clf = MLPClassifier(
hidden_layer_sizes=hidden,
learning_rate_init=args.learning_rate_init,
alpha=args.alpha,
max_iter=args.max_iter,
random_state=42,
)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
result = {
"learning_rate_init": args.learning_rate_init,
"hidden_layer_sizes": args.hidden_layer_sizes,
"alpha": args.alpha,
"max_iter": args.max_iter,
"accuracy": round(accuracy, 4),
}
print(json.dumps(result))
You can run it directly to verify it works before wiring it into a sweep:
python train.py --learning_rate_init 0.01 --hidden_layer_sizes 128 \
--alpha 0.001 --max_iter 200
# {"learning_rate_init": 0.01, "hidden_layer_sizes": "128", ..., "accuracy": 0.9722}
serial_grid.py — iterates over every combination and calls train.py once per run:
#!/usr/bin/env python3
"""Naïve serial sweep — runs all 54 combinations one by one."""
import itertools, subprocess, sys, time
GRID = {
"learning_rate_init": [0.001, 0.01, 0.1],
"hidden_layer_sizes": ["64", "128", "64,64"],
"alpha": [0.0001, 0.001, 0.01],
"max_iter": [100, 200],
}
keys = list(GRID.keys())
combos = list(itertools.product(*GRID.values()))
print(f"Total combinations: {len(combos)}")
for i, combo in enumerate(combos, 1):
params = dict(zip(keys, combo))
print(f"\n[{i:02d}/{len(combos)}] {params}")
t0 = time.time()
proc = subprocess.run(
[sys.executable, "train.py"] +
[x for k, v in params.items() for x in (f"--{k}", str(v))],
capture_output=True, text=True,
)
print(f" {'OK' if proc.returncode == 0 else 'FAILED'} ({time.time()-t0:.1f}s)")
if proc.stdout:
print(" ", proc.stdout.strip())
print("\nAll done.")
Why this problem is embarrassingly parallel
Each of the 54 combinations is completely independent: no run needs results from another, and there is no shared state. This makes the problem embarrassingly parallel — the ideal case for a distributed job queue.
- With 1 worker: 54 jobs × 5 min = 4.5 hours.
- With 10 workers: ≈ 27 minutes.
- With 54 workers: ≈ 5 minutes — all jobs in parallel.
You might be thinking: "I can already use Python's
multiprocessing or concurrent.futures to run
jobs in parallel." That is true — and it is a perfectly reasonable
first step. However, multiprocessing is limited to the cores of a
single machine. Once that machine's CPU is fully
occupied, you cannot go faster without provisioning more hardware.
JobDistributor removes that ceiling. Instead of being confined to one
machine, you can point any machine you have access to
at the same experiment and let it pick up jobs from the shared queue —
your laptop, a colleague's desktop, one or more HPC clusters, multiple
VPS instances, cloud VMs, or any combination of these simultaneously. Each machine runs
jd_worker_cli, and the server distributes work across all
of them automatically. There is no configuration change required to add
or remove a machine mid-sweep.
| Approach | Machines | Setup needed | Scales beyond one machine? |
|---|---|---|---|
| Serial loop | 1 | None | No |
Python multiprocessing |
1 | Minimal | No — limited to local CPU cores |
| JobDistributor | Any number — laptops, desktops, multiple HPC clusters, multiple VPS, cloud VMs | Install jd-worker, set API key |
Yes |
train.py above has no dependency on the
jd library — it is pure scikit-learn. You will add a
few jd calls in Step 3, but the core training logic
stays exactly the same. The same script runs whether it is called
by the serial loop or by jd_worker_cli.
① Step 1 — Hub Setup
The Hub is the central control plane. It registers your experiments, manages your API keys, and coordinates secure access so workers on any machine can reach your job server. You only need to do this once per experiment.
1a. Create an account
- Open JobDistributor Hub → Create account.
- Enter your email address and a password of at least 10 characters.
- Check your inbox for the 6-digit verification code and enter it on the verification page.
1b. Create an experiment
- Sign in and click Dashboard → + New Experiment.
-
Choose a short, descriptive name such as
digits-tune.
Rules: lowercase letters, digits, and hyphens only; must start with a letter; maximum 48 characters. - Click Create experiment.
After creation, the Hub provisions a public URL for your server so that workers anywhere on the internet can reach it. You will see this URL on the experiment detail page once your server container is running — there is nothing you need to configure manually.
1c. Generate an API key
- Click API Keys in the top navigation bar.
- Click + New key and give it a descriptive name, e.g.
lab-server. - Click Create key. The full key is shown once in a green banner immediately after creation. Copy it now — it cannot be retrieved again without entering your password.
.env file excluded via .gitignore.
Export the key in your shell:
export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~/.bashrc or ~/.zshrc so it
is set automatically in every new terminal session.
② Step 2 — Start the Job Server with Docker
The job server runs as a Docker container on any machine you control — your laptop, a lab server, or a cloud VM. Once started, it registers itself with the Hub automatically and becomes publicly accessible. No manual network configuration or port forwarding is required.
2a. Install Docker
Docker is the only prerequisite for running the server. Follow the official instructions for your operating system:
Linux (Ubuntu / Debian)
# Install Docker Engine (official script — runs on most Debian/Ubuntu releases)
curl -fsSL https://get.docker.com | sudo sh
# Allow your user to run Docker without sudo
sudo usermod -aG docker $USER
newgrp docker # apply group change without logging out
# Verify
docker --version
macOS
Download and install Docker Desktop for Mac.
Once installed, Docker Desktop runs in the background and provides
the docker command in your terminal.
Windows
Download and install Docker Desktop for Windows. Enable WSL 2 integration during setup for the best performance (recommended). After installation, use Docker from a PowerShell or WSL terminal.
Verify the installation on any platform:
docker --version
docker run hello-world
2b. Download run.sh and start the server
All you need is a single shell script. Download it once, then run it with your API key and experiment name:
# Download run.sh (one-time)
curl -fsSL https://raw.githubusercontent.com/NWSL-UCF/job-distributor/main/server/run.sh \
-o run.sh && chmod +x run.sh
JD_API_KEY=jd_xxxxxxxxxxxxxxxx ./run.sh digits-tune
The script automatically:
- Pulls the latest
jobdistributor/jd-serverDocker image. - Contacts the Hub to obtain credentials and configuration for your experiment.
- Starts the job server and registers it with the Hub.
- Sends a heartbeat to the Hub every few minutes so the dashboard shows the server as online.
Other useful commands:
./run.sh digits-tune logs # tail live container logs
./run.sh digits-tune status # check if container is running
./run.sh digits-tune stop # stop container (data is preserved on disk)
All experiment data is stored on the host at
~/jd_server/digits-tune/ and persists across restarts.
2c. Dashboard PIN
The server dashboard is protected by a 6-digit PIN. There are three ways to manage it:
- View the default PIN: Go to the Hub Dashboard → click on your experiment → you will see the current PIN listed in the experiment details. You can update it directly from this page at any time without needing to know the old PIN.
- Update from the Hub: On the experiment detail page, click Reset PIN and enter a new PIN. This takes effect immediately.
- Update from the server dashboard: Open the server dashboard → go to Settings → PIN section. This requires you to enter the current PIN first. Use this option when you are already logged into the server dashboard.
③ Step 3 — Write Your Training Script
A job is a single execution of your script with one specific combination of parameters — one row in your experiment grid. Making your script atomic means each run performs a complete, self-contained unit of work: it reads its parameters, trains or computes something, saves the result, and exits. No run depends on any other.
JobDistributor passes each job's parameters to your script as standard
command-line arguments (--key value). Your script needs to:
- Accept parameters via
argparse. - Save outputs to the per-job directory provided by
jd_job_dir(). - Optionally call
jd_upload()to push results to the central server.
3a. Complete atomic entry script (train.py)
#!/usr/bin/env python3
"""
Atomic hyperparameter sweep entry script.
JobDistributor calls this script once per job, passing one complete
parameter combination as CLI arguments:
--learning_rate_init 0.01 --hidden_layer_sizes 128 ...
"""
import argparse, json
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from jd import jd_job_dir, jd_upload
# ── Parse hyperparameters ──────────────────────────────────────────────────
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate_init", type=float, default=0.001)
parser.add_argument("--hidden_layer_sizes", type=str, default="128")
parser.add_argument("--alpha", type=float, default=1e-4)
parser.add_argument("--max_iter", type=int, default=200)
args = parser.parse_args()
# Convert "64,64" → (64, 64); "128" → (128,)
hidden = tuple(int(x) for x in args.hidden_layer_sizes.split(","))
# ── Train ──────────────────────────────────────────────────────────────────
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
clf = MLPClassifier(
hidden_layer_sizes=hidden,
learning_rate_init=args.learning_rate_init,
alpha=args.alpha,
max_iter=args.max_iter,
random_state=42,
)
clf.fit(X_train, y_train)
accuracy = accuracy_score(y_test, clf.predict(X_test))
print(f"Accuracy: {accuracy:.4f}")
# ── Save result to this job's local directory ──────────────────────────────
job_dir = jd_job_dir() # ~/jd_data/digits-tune//
job_dir.mkdir(parents=True, exist_ok=True)
result = {
"learning_rate_init": args.learning_rate_init,
"hidden_layer_sizes": args.hidden_layer_sizes,
"alpha": args.alpha,
"max_iter": args.max_iter,
"accuracy": round(accuracy, 4),
}
out = job_dir / "result.json"
out.write_text(json.dumps(result, indent=2))
print(f"Saved: {out}")
# ── Upload result to the central server (optional) ─────────────────────────
# The file appears in the server dashboard under this job's entry.
jd_upload(str(out))
3b. Path helpers and environment variables
| Helper / Variable | Returns |
|---|---|
jd_job_dir() | This job's local folder: ~/jd_data/<exp>/<job_id>/ |
jd_exp_dir() | All jobs for this experiment: ~/jd_data/<exp>/ |
jd_worker_workspace() | The jd_data root: ~/jd_data/ |
os.environ["JD_JOB_ID"] | Integer job ID (as a string) |
os.environ["JD_EXP_ID"] | Experiment name, e.g. digits-tune |
3c. Install the jd library
pip install jd-worker scikit-learn
④ Step 4 — Create Jobs on the Server Dashboard
Open the server dashboard from the Hub — click your experiment in the Hub Dashboard and then click Open Server Dashboard. Enter your PIN when prompted.
In the dashboard, go to the Add Jobs section and select the JSON tab. Paste the parameter grid below. The server generates all 54 combinations automatically.
{
"learning_rate_init": [0.001, 0.01, 0.1],
"hidden_layer_sizes": ["64", "128", "64,64"],
"alpha": [0.0001, 0.001, 0.01],
"max_iter": [100, 200]
}
After submission, the dashboard will show 54 jobs with status PENDING. Jobs move to RUNNING as workers pick them up, and to DONE or ABORTED when complete.
⑤ Step 5 — Run Workers with jd_worker_cli
jd_worker_cli is a command-line interface (CLI)
— a program you run in your terminal. It continuously polls the server
for pending jobs, launches your script with each job's parameters, and
reports the result back to the server. You can run as many
jd_worker_cli processes in parallel as your machine has
cores or GPUs available.
5a. Create a Python virtual environment
A virtual environment is an isolated Python installation
for your project. It keeps your dependencies separate from other projects
and from the system Python. Always create one before installing
jd-worker.
Linux / macOS
# Create the environment (run once)
python3 -m venv .venv
# Activate it (run every new terminal session)
source .venv/bin/activate
# Confirm the active Python
which python
Windows (PowerShell)
# Create the environment (run once)
python -m venv .venv
# Activate it
.\.venv\Scripts\Activate.ps1
5b. Install jd-worker
pip install jd-worker scikit-learn
5c. Launch workers
By default, jd_worker_cli detaches to the background
and returns immediately — you do not need tmux or
screen. The worker keeps polling the job queue even if you
close the SSH session or terminal. Process metadata is stored in
~/.cache/<expId>/workers.db (or under
JD_CACHE_PATH if you set it).
source .venv/bin/activate
export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
jd_worker_cli expId=digits-tune entry_script=train.py machine_type=cpu
Manage background workers from any terminal (same API key and venv):
jd_worker_cli expId=digits-tune worker-list
jd_worker_cli expId=digits-tune worker-logs 0
jd_worker_cli expId=digits-tune stop all
foreground=true only when you want the worker attached
to the current terminal (local debugging). On Slurm, see Step 6 —
foreground=true is required there.
5d. Spawn multiple workers at once
If your machine has multiple CPU cores or GPUs, you can run one
worker per core to fully utilize the hardware. Use the
num_workers=N argument to launch N parallel workers
with a single command — no shell loops or multiple terminals needed:
# Spawn 4 parallel workers (process IDs 0–3 are assigned automatically)
jd_worker_cli expId=digits-tune entry_script=train.py \
num_workers=4 machine_type=cpu
Each worker gets an auto-assigned process_id (0, 1, 2, …)
so they appear as separate entries in the server dashboard.
The command blocks until all workers finish or the queue is empty.
Press Ctrl+C to stop all workers at once.
num_workers to the number of
available CPU cores (or GPUs). On Linux you can check with
nproc; on macOS with sysctl -n hw.physicalcpu.
All available jd_worker_cli options:
| Argument | Env variable | Default | Description |
|---|---|---|---|
expId=<name> | JD_EXP_ID | required | Experiment name |
entry_script=<path> | JD_ENTRY_SCRIPT | required | Python script to run per job |
api_key=<key> | JD_API_KEY | — | API key — enables Hub mode (auto-discovers server URL) |
machine_type=<label> | JD_MACHINE_TYPE | worker | Label shown in dashboard |
process_id=<N> | — | 0 | Unique ID for multiple workers on one machine |
num_workers=<N> | JD_NUM_WORKERS | 1 | Spawn N parallel workers in one command (process IDs assigned automatically) |
once=true | JD_ONCE | false | Run one job then exit (useful for debugging) |
foreground=true | JD_FOREGROUND | false | Run in the foreground — required on Slurm; optional for local debugging |
hub=<url> | JD_HUB_URL | hub.jobdistributor.net | Hub base URL |
| — | JD_WORKSPACE_PATH | ~ | Parent directory for jd_data/<exp>/<job_id>/ (job files, default logs) |
| — | JD_CACHE_PATH | same as workspace | Parent directory for .cache/<exp>/workers.db (registry). On HPC, e.g. /tmp/.jd_cache — see Step 6 |
Worker logs are saved to:
~/jd_data/digits-tune/jd_worker_logs/jd_worker_<process_id>.log
⑥ Step 6 — Scale on HPC with Slurm Array Jobs
If you have access to one or more HPC clusters managed by Slurm, you
can launch many workers as a Slurm array job. Each array task runs one
jd_worker_cli process that continuously pulls jobs from the
queue until the queue is empty or the time limit is reached.
No manual partitioning is required — workers across all machines
share the same queue.
Save the following as submit_jd.slurm next to
train.py, then adjust the #SBATCH directives
(partition, array size, time limit, mail) for your cluster.
Use foreground=true so the worker stays attached to the
batch job — without it, the launcher exits and Slurm kills the workers
when the script ends.
Two storage paths on HPC.
JD_WORKSPACE_PATH is where job sandboxes live
(…/jd_data/<expId>/<job_id>/) — usually a
shared directory (home or Lustre) so checkpoints and
uploads persist.
JD_CACHE_PATH is where the worker keeps its SQLite registry
(…/.cache/<expId>/workers.db) — use
node-local scratch (e.g. /tmp/.jd_cache)
so registry I/O stays off Lustre/NFS (which can cause
disk I/O error in worker logs). One
.jd_cache per allocated node is enough; experiments still
get separate .cache/<expId>/ subdirs.
On a laptop you typically omit JD_CACHE_PATH; both paths
default to your home directory.
#!/bin/bash
#SBATCH --job-name=jd
#SBATCH --array=0-99 # one worker per array task; adjust range
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=48:00:00
#SBATCH --partition=normal # replace with your cluster's partition
#SBATCH --output=logs/slurm-%A_%a.out
#SBATCH --error=logs/slurm-%A_%a.err
#SBATCH --mail-user=you@example.com
#SBATCH --mail-type=BEGIN,END,FAIL
set -euxo pipefail
mkdir -p logs
# Activate your virtual environment (create with: python -m venv venv)
source ./venv/bin/activate
export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Shared — job dirs, checkpoints, worker logs (adjust for Lustre/home)
export JD_WORKSPACE_PATH="${SLURM_SUBMIT_DIR}/jd_data"
# Node-local — registry only (one .jd_cache per compute node)
export JD_CACHE_PATH="${TMPDIR:-/tmp}/.jd_cache"
mkdir -p "${JD_WORKSPACE_PATH}" "${JD_CACHE_PATH}"
python --version
which jd_worker_cli
jd_worker_cli expId=digits-tune entry_script=train.py \
machine_type=stk foreground=true
sbatch submit_jd.slurm
# Monitor
squeue -u $USER
foreground=true. The default
background mode detaches workers and returns immediately; the batch
script then exits and Slurm terminates those worker processes.
Registry DB: shows the
resolved workers.db path (under JD_CACHE_PATH).
Job data paths use JD_WORKSPACE_PATH. Run
jd_worker_cli expId=digits-tune where on the login node with
the same exports if you use management commands there.
✓ End-to-End Checklist
- ✔Hub account created and email verified
- ✔Experiment created on the Hub Dashboard (
digits-tune) - ✔API key generated → copied and exported as
JD_API_KEY - ✔Docker installed (docs.docker.com/get-docker/)
- ✔Server started:
JD_API_KEY=jd_xxx ./run.sh digits-tune - ✔Dashboard PIN confirmed from Hub experiment detail page
- ✔
train.pyacceptsargparsearguments and callsjd_job_dir() - ✔Virtual environment created and
pip install jd-worker scikit-learnrun - ✔Parameter grid submitted from server dashboard → 54 PENDING jobs
- ✔Workers launched:
jd_worker_cli expId=digits-tune entry_script=train.py - ✔(Optional) Slurm array submitted:
sbatch submit_jd.slurm - ✔Results visible in server dashboard and / or
~/jd_data/digits-tune/