Getting Started with JobDistributor

A practical, end-to-end guide to running large hyperparameter sweeps across one machine, multiple machines, or one or more HPC clusters — without changing a single line of your training code. Already know the flow? Try the Quick Start instead.

? Introduction

JobDistributor is an open-source platform for distributing a large number of independent compute tasks across any number of machines. It consists of three components:

ComponentWhat it doesWhere it runs
Hub Manages accounts, experiments, API keys, and secure public access for your server Central cloud server (hosted at jobdistributor.net)
Server Holds the job queue; serves jobs to workers; stores results; provides a web dashboard Any machine with Docker — your laptop, lab server, or VM
Worker (jd_worker_cli) Pulls jobs from the server, runs your script, reports completion Any compute node — laptop, desktop, GPU server, Slurm nodes on one or more HPC clusters, VPS instances, or cloud VMs

JobDistributor is the right tool when:

  • You need to run the same script many times with different parameter combinations.
  • Each run is completely independent — no job needs results from another.
  • You want to spread work across laptops, GPU servers, and cluster nodes simultaneously.
  • You want one central dashboard to track progress, failures, and results in real time.

Typical use cases:

  • Machine learning hyperparameter search
  • Physics or network simulation sweeps
  • Benchmark matrices across algorithm variants
  • Repeated experiments with different random seeds
ℹ️ JobDistributor is not designed for tightly coupled jobs (MPI programs, multi-GPU training with shared gradients, etc.). It is optimized for embarrassingly parallel workloads — where each run is fully independent.

1 Example Problem — ML Hyperparameter Search

Throughout this tutorial we will train a neural network classifier to recognize handwritten digits. The dataset (scikit-learn's built-in optical digits, 1,797 samples, 10 classes) is a classic benchmark for classification algorithms.

The objective: find the combination of four hyperparameters — learning rate, network size, regularization strength, and maximum training iterations — that produces the highest test-set accuracy. There is no mathematical shortcut; we simply need to train and evaluate the model once for every combination.

HyperparameterValuesCount
learning_rate_init 0.0010.010.1 3
hidden_layer_sizes 6412864,64 3
alpha (L2 regularization) 0.00010.0010.01 3
max_iter 100200 2
Total combinations (3 × 3 × 3 × 2) 54

Serial baseline — running all combinations on one machine

Before distributing anything, here is the naïve approach. We need two files: train.py, which trains the model for a single parameter combination, and serial_grid.py, which calls train.py in a loop over all 54 combinations.

train.py — trains and evaluates the model for one combination of hyperparameters:

train.py — Python
#!/usr/bin/env python3
"""Train an MLP on the digits dataset for one hyperparameter combination.

Usage:
    python train.py --learning_rate_init 0.01 --hidden_layer_sizes 128 \
                    --alpha 0.001 --max_iter 200
"""
import argparse
import json
import os

from sklearn.datasets   import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network  import MLPClassifier
from sklearn.preprocessing   import StandardScaler

parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate_init", type=float, required=True)
parser.add_argument("--hidden_layer_sizes", type=str,   required=True,
                    help="Single int (e.g. 128) or comma-separated (e.g. 64,64)")
parser.add_argument("--alpha",              type=float, required=True)
parser.add_argument("--max_iter",           type=int,   required=True)
args = parser.parse_args()

# Parse hidden_layer_sizes: "64" → (64,)   "64,64" → (64, 64)
hidden = tuple(int(x) for x in args.hidden_layer_sizes.split(","))

# Load and split data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Normalise
scaler  = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)

# Train
clf = MLPClassifier(
    hidden_layer_sizes=hidden,
    learning_rate_init=args.learning_rate_init,
    alpha=args.alpha,
    max_iter=args.max_iter,
    random_state=42,
)
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)

result = {
    "learning_rate_init": args.learning_rate_init,
    "hidden_layer_sizes": args.hidden_layer_sizes,
    "alpha":              args.alpha,
    "max_iter":           args.max_iter,
    "accuracy":           round(accuracy, 4),
}
print(json.dumps(result))

You can run it directly to verify it works before wiring it into a sweep:

bash
python train.py --learning_rate_init 0.01 --hidden_layer_sizes 128 \
               --alpha 0.001 --max_iter 200
# {"learning_rate_init": 0.01, "hidden_layer_sizes": "128", ..., "accuracy": 0.9722}

serial_grid.py — iterates over every combination and calls train.py once per run:

serial_grid.py — Python
#!/usr/bin/env python3
"""Naïve serial sweep — runs all 54 combinations one by one."""
import itertools, subprocess, sys, time

GRID = {
    "learning_rate_init": [0.001, 0.01, 0.1],
    "hidden_layer_sizes": ["64", "128", "64,64"],
    "alpha":              [0.0001, 0.001, 0.01],
    "max_iter":           [100, 200],
}

keys   = list(GRID.keys())
combos = list(itertools.product(*GRID.values()))
print(f"Total combinations: {len(combos)}")

for i, combo in enumerate(combos, 1):
    params = dict(zip(keys, combo))
    print(f"\n[{i:02d}/{len(combos)}]  {params}")
    t0   = time.time()
    proc = subprocess.run(
        [sys.executable, "train.py"] +
        [x for k, v in params.items() for x in (f"--{k}", str(v))],
        capture_output=True, text=True,
    )
    print(f"  {'OK' if proc.returncode == 0 else 'FAILED'}  ({time.time()-t0:.1f}s)")
    if proc.stdout:
        print(" ", proc.stdout.strip())

print("\nAll done.")

Why this problem is embarrassingly parallel

Each of the 54 combinations is completely independent: no run needs results from another, and there is no shared state. This makes the problem embarrassingly parallel — the ideal case for a distributed job queue.

  • With 1 worker: 54 jobs × 5 min = 4.5 hours.
  • With 10 workers: ≈ 27 minutes.
  • With 54 workers: ≈ 5 minutes — all jobs in parallel.

You might be thinking: "I can already use Python's multiprocessing or concurrent.futures to run jobs in parallel." That is true — and it is a perfectly reasonable first step. However, multiprocessing is limited to the cores of a single machine. Once that machine's CPU is fully occupied, you cannot go faster without provisioning more hardware.

JobDistributor removes that ceiling. Instead of being confined to one machine, you can point any machine you have access to at the same experiment and let it pick up jobs from the shared queue — your laptop, a colleague's desktop, one or more HPC clusters, multiple VPS instances, cloud VMs, or any combination of these simultaneously. Each machine runs jd_worker_cli, and the server distributes work across all of them automatically. There is no configuration change required to add or remove a machine mid-sweep.

ApproachMachinesSetup neededScales beyond one machine?
Serial loop 1 None No
Python multiprocessing 1 Minimal No — limited to local CPU cores
JobDistributor Any number — laptops, desktops, multiple HPC clusters, multiple VPS, cloud VMs Install jd-worker, set API key Yes
💡 Notice that train.py above has no dependency on the jd library — it is pure scikit-learn. You will add a few jd calls in Step 3, but the core training logic stays exactly the same. The same script runs whether it is called by the serial loop or by jd_worker_cli.

Step 1 — Hub Setup

The Hub is the central control plane. It registers your experiments, manages your API keys, and coordinates secure access so workers on any machine can reach your job server. You only need to do this once per experiment.

1a. Create an account

  1. Open JobDistributor Hub → Create account.
  2. Enter your email address and a password of at least 10 characters.
  3. Check your inbox for the 6-digit verification code and enter it on the verification page.

1b. Create an experiment

  1. Sign in and click Dashboard → + New Experiment.
  2. Choose a short, descriptive name such as digits-tune.
    Rules: lowercase letters, digits, and hyphens only; must start with a letter; maximum 48 characters.
  3. Click Create experiment.

After creation, the Hub provisions a public URL for your server so that workers anywhere on the internet can reach it. You will see this URL on the experiment detail page once your server container is running — there is nothing you need to configure manually.

1c. Generate an API key

  1. Click API Keys in the top navigation bar.
  2. Click + New key and give it a descriptive name, e.g. lab-server.
  3. Click Create key. The full key is shown once in a green banner immediately after creation. Copy it now — it cannot be retrieved again without entering your password.
⚠️ Treat your API key like a password. Never commit it to version control. Use environment variables or a .env file excluded via .gitignore.

Export the key in your shell:

bash
export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
💡 Add this line to ~/.bashrc or ~/.zshrc so it is set automatically in every new terminal session.

Step 2 — Start the Job Server with Docker

The job server runs as a Docker container on any machine you control — your laptop, a lab server, or a cloud VM. Once started, it registers itself with the Hub automatically and becomes publicly accessible. No manual network configuration or port forwarding is required.

2a. Install Docker

Docker is the only prerequisite for running the server. Follow the official instructions for your operating system:

Linux (Ubuntu / Debian)

bash
# Install Docker Engine (official script — runs on most Debian/Ubuntu releases)
curl -fsSL https://get.docker.com | sudo sh

# Allow your user to run Docker without sudo
sudo usermod -aG docker $USER
newgrp docker          # apply group change without logging out

# Verify
docker --version

macOS

Download and install Docker Desktop for Mac. Once installed, Docker Desktop runs in the background and provides the docker command in your terminal.

Windows

Download and install Docker Desktop for Windows. Enable WSL 2 integration during setup for the best performance (recommended). After installation, use Docker from a PowerShell or WSL terminal.

Verify the installation on any platform:

bash
docker --version
docker run hello-world

2b. Download run.sh and start the server

All you need is a single shell script. Download it once, then run it with your API key and experiment name:

bash
# Download run.sh (one-time)
curl -fsSL https://raw.githubusercontent.com/NWSL-UCF/job-distributor/main/server/run.sh \
     -o run.sh && chmod +x run.sh
bash
JD_API_KEY=jd_xxxxxxxxxxxxxxxx ./run.sh digits-tune

The script automatically:

  1. Pulls the latest jobdistributor/jd-server Docker image.
  2. Contacts the Hub to obtain credentials and configuration for your experiment.
  3. Starts the job server and registers it with the Hub.
  4. Sends a heartbeat to the Hub every few minutes so the dashboard shows the server as online.

Other useful commands:

bash
./run.sh digits-tune logs      # tail live container logs
./run.sh digits-tune status    # check if container is running
./run.sh digits-tune stop      # stop container (data is preserved on disk)

All experiment data is stored on the host at ~/jd_server/digits-tune/ and persists across restarts.

2c. Dashboard PIN

The server dashboard is protected by a 6-digit PIN. There are three ways to manage it:

  • View the default PIN: Go to the Hub Dashboard → click on your experiment → you will see the current PIN listed in the experiment details. You can update it directly from this page at any time without needing to know the old PIN.
  • Update from the Hub: On the experiment detail page, click Reset PIN and enter a new PIN. This takes effect immediately.
  • Update from the server dashboard: Open the server dashboard → go to Settings → PIN section. This requires you to enter the current PIN first. Use this option when you are already logged into the server dashboard.

Step 3 — Write Your Training Script

A job is a single execution of your script with one specific combination of parameters — one row in your experiment grid. Making your script atomic means each run performs a complete, self-contained unit of work: it reads its parameters, trains or computes something, saves the result, and exits. No run depends on any other.

JobDistributor passes each job's parameters to your script as standard command-line arguments (--key value). Your script needs to:

  1. Accept parameters via argparse.
  2. Save outputs to the per-job directory provided by jd_job_dir().
  3. Optionally call jd_upload() to push results to the central server.

3a. Complete atomic entry script (train.py)

train.py — Python
#!/usr/bin/env python3
"""
Atomic hyperparameter sweep entry script.

JobDistributor calls this script once per job, passing one complete
parameter combination as CLI arguments:
    --learning_rate_init 0.01 --hidden_layer_sizes 128 ...
"""
import argparse, json

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

from jd import jd_job_dir, jd_upload


# ── Parse hyperparameters ──────────────────────────────────────────────────
parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate_init", type=float, default=0.001)
parser.add_argument("--hidden_layer_sizes", type=str,   default="128")
parser.add_argument("--alpha",              type=float, default=1e-4)
parser.add_argument("--max_iter",           type=int,   default=200)
args = parser.parse_args()

# Convert "64,64" → (64, 64);  "128" → (128,)
hidden = tuple(int(x) for x in args.hidden_layer_sizes.split(","))


# ── Train ──────────────────────────────────────────────────────────────────
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

clf = MLPClassifier(
    hidden_layer_sizes=hidden,
    learning_rate_init=args.learning_rate_init,
    alpha=args.alpha,
    max_iter=args.max_iter,
    random_state=42,
)
clf.fit(X_train, y_train)
accuracy = accuracy_score(y_test, clf.predict(X_test))
print(f"Accuracy: {accuracy:.4f}")


# ── Save result to this job's local directory ──────────────────────────────
job_dir = jd_job_dir()       # ~/jd_data/digits-tune//
job_dir.mkdir(parents=True, exist_ok=True)

result = {
    "learning_rate_init": args.learning_rate_init,
    "hidden_layer_sizes": args.hidden_layer_sizes,
    "alpha":              args.alpha,
    "max_iter":           args.max_iter,
    "accuracy":           round(accuracy, 4),
}
out = job_dir / "result.json"
out.write_text(json.dumps(result, indent=2))
print(f"Saved: {out}")


# ── Upload result to the central server (optional) ─────────────────────────
# The file appears in the server dashboard under this job's entry.
jd_upload(str(out))

3b. Path helpers and environment variables

Helper / VariableReturns
jd_job_dir()This job's local folder: ~/jd_data/<exp>/<job_id>/
jd_exp_dir()All jobs for this experiment: ~/jd_data/<exp>/
jd_worker_workspace()The jd_data root: ~/jd_data/
os.environ["JD_JOB_ID"]Integer job ID (as a string)
os.environ["JD_EXP_ID"]Experiment name, e.g. digits-tune

3c. Install the jd library

bash
pip install jd-worker scikit-learn

Step 4 — Create Jobs on the Server Dashboard

Open the server dashboard from the Hub — click your experiment in the Hub Dashboard and then click Open Server Dashboard. Enter your PIN when prompted.

In the dashboard, go to the Add Jobs section and select the JSON tab. Paste the parameter grid below. The server generates all 54 combinations automatically.

JSON — parameter grid
{
  "learning_rate_init": [0.001, 0.01, 0.1],
  "hidden_layer_sizes": ["64", "128", "64,64"],
  "alpha":              [0.0001, 0.001, 0.01],
  "max_iter":           [100, 200]
}

After submission, the dashboard will show 54 jobs with status PENDING. Jobs move to RUNNING as workers pick them up, and to DONE or ABORTED when complete.

ℹ️ The Replace option clears all existing jobs before creating new ones. Leave it unchecked to append jobs to an existing queue.

Step 5 — Run Workers with jd_worker_cli

jd_worker_cli is a command-line interface (CLI) — a program you run in your terminal. It continuously polls the server for pending jobs, launches your script with each job's parameters, and reports the result back to the server. You can run as many jd_worker_cli processes in parallel as your machine has cores or GPUs available.

5a. Create a Python virtual environment

A virtual environment is an isolated Python installation for your project. It keeps your dependencies separate from other projects and from the system Python. Always create one before installing jd-worker.

Linux / macOS

bash
# Create the environment (run once)
python3 -m venv .venv

# Activate it (run every new terminal session)
source .venv/bin/activate

# Confirm the active Python
which python

Windows (PowerShell)

bash
# Create the environment (run once)
python -m venv .venv

# Activate it
.\.venv\Scripts\Activate.ps1

5b. Install jd-worker

bash
pip install jd-worker scikit-learn

5c. Launch workers

By default, jd_worker_cli detaches to the background and returns immediately — you do not need tmux or screen. The worker keeps polling the job queue even if you close the SSH session or terminal. Process metadata is stored in ~/.cache/<expId>/workers.db (or under JD_CACHE_PATH if you set it).

bash
source .venv/bin/activate
export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

jd_worker_cli expId=digits-tune entry_script=train.py machine_type=cpu

Manage background workers from any terminal (same API key and venv):

bash
jd_worker_cli expId=digits-tune worker-list
jd_worker_cli expId=digits-tune worker-logs 0
jd_worker_cli expId=digits-tune stop all
💡 Use foreground=true only when you want the worker attached to the current terminal (local debugging). On Slurm, see Step 6 — foreground=true is required there.

5d. Spawn multiple workers at once

If your machine has multiple CPU cores or GPUs, you can run one worker per core to fully utilize the hardware. Use the num_workers=N argument to launch N parallel workers with a single command — no shell loops or multiple terminals needed:

bash
# Spawn 4 parallel workers (process IDs 0–3 are assigned automatically)
jd_worker_cli expId=digits-tune entry_script=train.py \
              num_workers=4 machine_type=cpu

Each worker gets an auto-assigned process_id (0, 1, 2, …) so they appear as separate entries in the server dashboard. The command blocks until all workers finish or the queue is empty. Press Ctrl+C to stop all workers at once.

💡 A good rule of thumb: set num_workers to the number of available CPU cores (or GPUs). On Linux you can check with nproc; on macOS with sysctl -n hw.physicalcpu.

All available jd_worker_cli options:

ArgumentEnv variableDefaultDescription
expId=<name>JD_EXP_IDrequiredExperiment name
entry_script=<path>JD_ENTRY_SCRIPTrequiredPython script to run per job
api_key=<key>JD_API_KEYAPI key — enables Hub mode (auto-discovers server URL)
machine_type=<label>JD_MACHINE_TYPEworkerLabel shown in dashboard
process_id=<N>0Unique ID for multiple workers on one machine
num_workers=<N>JD_NUM_WORKERS1Spawn N parallel workers in one command (process IDs assigned automatically)
once=trueJD_ONCEfalseRun one job then exit (useful for debugging)
foreground=trueJD_FOREGROUNDfalseRun in the foreground — required on Slurm; optional for local debugging
hub=<url>JD_HUB_URLhub.jobdistributor.netHub base URL
JD_WORKSPACE_PATH~Parent directory for jd_data/<exp>/<job_id>/ (job files, default logs)
JD_CACHE_PATHsame as workspaceParent directory for .cache/<exp>/workers.db (registry). On HPC, e.g. /tmp/.jd_cache — see Step 6

Worker logs are saved to:

path
~/jd_data/digits-tune/jd_worker_logs/jd_worker_<process_id>.log

Step 6 — Scale on HPC with Slurm Array Jobs

If you have access to one or more HPC clusters managed by Slurm, you can launch many workers as a Slurm array job. Each array task runs one jd_worker_cli process that continuously pulls jobs from the queue until the queue is empty or the time limit is reached. No manual partitioning is required — workers across all machines share the same queue.

Save the following as submit_jd.slurm next to train.py, then adjust the #SBATCH directives (partition, array size, time limit, mail) for your cluster. Use foreground=true so the worker stays attached to the batch job — without it, the launcher exits and Slurm kills the workers when the script ends.

Two storage paths on HPC. JD_WORKSPACE_PATH is where job sandboxes live (…/jd_data/<expId>/<job_id>/) — usually a shared directory (home or Lustre) so checkpoints and uploads persist. JD_CACHE_PATH is where the worker keeps its SQLite registry (…/.cache/<expId>/workers.db) — use node-local scratch (e.g. /tmp/.jd_cache) so registry I/O stays off Lustre/NFS (which can cause disk I/O error in worker logs). One .jd_cache per allocated node is enough; experiments still get separate .cache/<expId>/ subdirs. On a laptop you typically omit JD_CACHE_PATH; both paths default to your home directory.

submit_jd.slurm — Slurm batch script
#!/bin/bash
#SBATCH --job-name=jd
#SBATCH --array=0-99              # one worker per array task; adjust range
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=48:00:00
#SBATCH --partition=normal        # replace with your cluster's partition
#SBATCH --output=logs/slurm-%A_%a.out
#SBATCH --error=logs/slurm-%A_%a.err
#SBATCH --mail-user=you@example.com
#SBATCH --mail-type=BEGIN,END,FAIL

set -euxo pipefail

mkdir -p logs

# Activate your virtual environment (create with: python -m venv venv)
source ./venv/bin/activate

export JD_API_KEY=jd_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Shared — job dirs, checkpoints, worker logs (adjust for Lustre/home)
export JD_WORKSPACE_PATH="${SLURM_SUBMIT_DIR}/jd_data"
# Node-local — registry only (one .jd_cache per compute node)
export JD_CACHE_PATH="${TMPDIR:-/tmp}/.jd_cache"

mkdir -p "${JD_WORKSPACE_PATH}" "${JD_CACHE_PATH}"

python --version
which jd_worker_cli

jd_worker_cli expId=digits-tune entry_script=train.py \
              machine_type=stk foreground=true
bash
sbatch submit_jd.slurm

# Monitor
squeue -u $USER
⚠️ On Slurm, always pass foreground=true. The default background mode detaches workers and returns immediately; the batch script then exits and Slurm terminates those worker processes.
💡 Each Slurm task is a long-running worker that consumes multiple jobs over its lifetime — not one task per parameter combination. This minimizes Slurm scheduling overhead for large sweeps.
ℹ️ You can run Slurm workers and local workers simultaneously. All workers pull from the same queue, so jobs are distributed dynamically across all available resources.
💡 After a worker starts, its log line Registry DB: shows the resolved workers.db path (under JD_CACHE_PATH). Job data paths use JD_WORKSPACE_PATH. Run jd_worker_cli expId=digits-tune where on the login node with the same exports if you use management commands there.

End-to-End Checklist

  • Hub account created and email verified
  • Experiment created on the Hub Dashboard (digits-tune)
  • API key generated → copied and exported as JD_API_KEY
  • Docker installed (docs.docker.com/get-docker/)
  • Server started: JD_API_KEY=jd_xxx ./run.sh digits-tune
  • Dashboard PIN confirmed from Hub experiment detail page
  • train.py accepts argparse arguments and calls jd_job_dir()
  • Virtual environment created and pip install jd-worker scikit-learn run
  • Parameter grid submitted from server dashboard → 54 PENDING jobs
  • Workers launched: jd_worker_cli expId=digits-tune entry_script=train.py
  • (Optional) Slurm array submitted: sbatch submit_jd.slurm
  • Results visible in server dashboard and / or ~/jd_data/digits-tune/
⚠️ Is your job long enough? JobDistributor is designed for jobs that take at least a minute to run — and works best for jobs that take several minutes to several hours. If a single job completes in just a few seconds, the overhead of requesting a job from the server, launching a subprocess, and reporting the result back will be comparable to or greater than the actual work. In that case, reconsider how you define an atomic job: combine multiple lightweight operations into one job, or batch more work into each run so that the compute time clearly dominates the coordination overhead.