Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

job-orchestrator is an asynchronous job orchestration system for managing and distributing computational workloads across heterogeneous computing resources with intelligent quota-based load balancing.

What is job-orchestrator?

job-orchestrator is a central component of WeNMR, a worldwide e-Infrastructure for structural biology operated by the BonvinLab at Utrecht University. It serves as a reactive middleware layer that connects web applications to diverse computing resources, enabling efficient job distribution for scientific computing workflows.

Key Features

  • Asynchronous Job Management: Built with Rust and Tokio for high-performance async operations
  • Quota-Based Load Balancing: Per-user, per-service quotas prevent resource exhaustion
  • Dual-Mode Architecture: Runs as server (job orchestration) or client (job execution)
  • Multiple Backend Support: Extensible to integrate with various computing resources:
    • Native client mode for local job execution
    • DIRAC Interware (planned)
    • SLURM clusters (planned)
    • Educational cloud services (planned)
  • RESTful API: Simple HTTP interface for job submission and retrieval
  • Automatic Cleanup: Configurable retention policies for completed jobs

Use Cases

job-orchestrator is designed for scenarios requiring:

  • Scientific Computing Workflows: Distribute computational biology/chemistry jobs across clusters
  • Multi-Tenant Systems: Fair resource allocation with per-user quotas
  • Heterogeneous Computing: Route jobs to appropriate backends (local, HPC, cloud)
  • Web-Based Science Platforms: Decouple frontend from compute infrastructure
  • Batch Processing: Handle high-throughput job submissions with automatic queuing

Project Status

Current State: Production-ready with server/client architecture

Planned Features:

  • Auto-Scaling: Dynamic creation and termination of cloud-based client instances based on workload
  • DIRAC Interware integration
  • SLURM direct integration
  • Enhanced monitoring and metrics
  • Job priority queues
  • Advanced scheduling policies

Getting Help

License

MIT License - see LICENSE for details.

Installation

There are several ways to install job-orchestrator depending on your needs.

From crates.io

The easiest way to install job-orchestrator is via Cargo:

cargo install job-orchestrator

From Source

Clone the repository and build:

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
cargo build --release

The binary will be available at target/release/job-orchestrator.

Using Docker

Pull the pre-built image:

docker pull ghcr.io/rvhonorato/job-orchestrator:latest

Or build locally:

docker build -t job-orchestrator .

Prerequisites

For Building from Source

  • Rust: 1.75 or later (edition 2021)
  • SQLite: Development libraries

On Debian/Ubuntu:

apt-get install libsqlite3-dev

On macOS:

brew install sqlite

For Running

  • SQLite: Runtime library (usually included in most systems)
  • Filesystem access: Write permissions for database and job storage directories

Verifying Installation

After installation, verify it works:

job-orchestrator --version

You should see the version number displayed.

Next Steps

Quick Start

The fastest way to get job-orchestrator running is with Docker Compose.

Running with Docker Compose

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build

This starts:

  • Orchestrator server on port 5000
  • Example client on port 9000

Verify It’s Running

Check the server is responding:

curl http://localhost:5000/health

You should receive a health status response.

Access the API Documentation

Open your browser and navigate to:

http://localhost:5000/swagger-ui/

This provides interactive API documentation where you can explore and test all endpoints.

What’s Next?

Now that you have job-orchestrator running, proceed to Your First Job to learn how to submit and retrieve jobs.

Stopping the Services

To stop the services:

docker compose down

To stop and remove all data (volumes):

docker compose down -v

Your First Job

This guide walks you through submitting and retrieving your first job.

Prerequisites

Make sure you have job-orchestrator running. See Quick Start if you haven’t set it up yet.

Understanding Jobs

A job in job-orchestrator consists of:

  1. Files: One or more files to be processed
  2. A run.sh script: The entry point that gets executed
  3. User ID: Identifies who submitted the job (for quota tracking)
  4. Service: Which service/backend should process this job

Creating a Simple Job

Create a simple run.sh script:

cat > run.sh << 'EOF'
#!/bin/bash
echo "Hello from job-orchestrator!" > output.txt
echo "Processing complete at $(date)" >> output.txt
EOF
chmod +x run.sh

Submitting the Job

Submit the job using curl:

curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "user_id=1" \
  -F "service=example" | jq

You’ll receive a response like:

{
  "id": 1,
  "status": "Queued",
  "message": "Job successfully uploaded"
}

Note the id field - you’ll need this to check status and download results.

Checking Job Status

Check the job status via GET request:

curl http://localhost:5000/download/1

If the job is not yet completed, you’ll get a JSON response:

{
  "id": 1,
  "status": "Submitted",
  "message": ""
}

The status field will be one of: Queued, Processing, Submitted, Running, Completed, Failed, Invalid, Cleaned, or Unknown.

Downloading Results

Once the status is Completed, the same endpoint returns the ZIP file:

curl -o results.zip http://localhost:5000/download/1

Extract and view:

unzip results.zip
cat output.txt

You should see:

Hello from job-orchestrator!
Processing complete at <timestamp>

A More Complex Example

Here’s a job that processes an input file:

# Create an input file
echo "sample data" > input.txt

# Create a processing script
cat > run.sh << 'EOF'
#!/bin/bash
# Count lines and words in input
wc input.txt > stats.txt
# Transform the data
tr 'a-z' 'A-Z' < input.txt > output.txt
echo "Done!" >> output.txt
EOF
chmod +x run.sh

# Submit with multiple files
curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "file=@input.txt" \
  -F "user_id=1" \
  -F "service=example"

Important Notes

The run.sh Script

  • Must be named exactly run.sh
  • Must be executable (or start with #!/bin/bash)
  • Exit code 0 indicates success
  • Non-zero exit code indicates failure
  • All output files in the working directory are included in results

File Size Limits

The default maximum upload size is 400MB. This can be configured on the server.

Job Retention

Completed jobs are automatically cleaned up after the configured retention period (default: 48 hours). Make sure to download your results before they expire.

Next Steps

Architecture Overview

job-orchestrator uses a distributed architecture with a central server coordinating job execution across multiple client nodes.

High-Level Architecture

flowchart TB
 subgraph Tasks["Background Tasks"]
        Sender["Sender<br>500ms"]
        Getter["Getter<br>500ms"]
        Cleaner["Cleaner<br>60s"]
  end
 subgraph Server["Orchestrator Server"]
        API["REST API<br>upload/download"]
        DB[("SQLite<br>Persistent")]
        FS[/"Filesystem<br>Job Storage"/]
        Tasks
        Queue["Queue Manager<br>Quota Enforcement"]
  end
 subgraph Client["Client Service"]
        ClientAPI["REST API<br>submit/retrieve/load"]
        ClientDB[("SQLite<br>In-Memory")]
        ClientFS[/"Working Dir"/]
        Runner["Runner Task<br>500ms"]
        Executor["Bash Executor<br>run.sh"]
  end
    User(["User/Web App"]) -- POST /upload --> API
    User -- GET /download/:id --> API
    API --> DB & FS
    DB --> Queue
    Queue --> Sender
    Sender -- POST /submit --> ClientAPI
    Getter -- GET /retrieve/:id --> ClientAPI
    Getter --> FS
    Cleaner --> DB & FS
    ClientAPI --> ClientDB
    ClientDB --> Runner
    Runner --> Executor
    Executor --> ClientFS

Components

Orchestrator Server

The central server manages:

  • REST API: Handles job uploads and result downloads from users
  • SQLite Database: Persistent storage for job metadata and status
  • Filesystem Storage: Stores uploaded files and downloaded results
  • Queue Manager: Enforces per-user quotas and manages job distribution
  • Background Tasks: Automated processes for job distribution, result retrieval, and cleanup

Client Service

Each client node handles:

  • REST API: Receives jobs from server, returns results
  • In-Memory Database: Lightweight tracking of current payloads
  • Working Directory: Temporary storage for job execution
  • Runner Task: Monitors for new payloads and executes them
  • Bash Executor: Runs the run.sh script for each job

Background Tasks

Server Tasks

TaskIntervalPurpose
Sender500msPicks up queued jobs, enforces quotas, dispatches to clients
Getter500msRetrieves completed results from clients
Cleaner60sRemoves expired jobs from disk and database

Client Tasks

TaskIntervalPurpose
Runner500msExecutes prepared payloads, captures results

Data Flow

  1. User submits files via POST /upload
  2. Server stores files and creates job record (status: Queued)
  3. Sender task picks up job, checks quotas, sends to available client
  4. Client receives job, stores as payload (status: Prepared)
  5. Runner task executes run.sh, updates status to Completed
  6. Getter task retrieves results, stores locally
  7. User downloads results via GET /download/:id
  8. Cleaner task removes job after retention period

Auto-Scaling Architecture (Planned)

The orchestrator will support automatic scaling of client instances based on workload:

---
config:
  layout: dagre
---
flowchart TB
 subgraph Server["Orchestrator Server"]
        API["REST API"]
        Queue["Queue Manager"]
        AutoScaler["Auto-Scaler"]
        ServicePool["Service Pool"]
  end
 subgraph Cloud["Cloud Provider"]
        CloudAPI["Cloud API"]
  end
 subgraph Clients["Client Instances"]
        Dynamic["Dynamic Clients<br>Auto-created"]
        Static["Static Client<br>"]
  end
    User(["User/Web App"]) -- Submits/Retrieves --> API
    API --> Queue
    Queue -- Distribute jobs --> Clients
    ServicePool <-- Monitors --> Queue
    AutoScaler <-- Register/Trigger --> ServicePool
    AutoScaler -- Scale Up/Down --> CloudAPI
    CloudAPI -- Create/Terminate --> Clients

This feature will enable:

  • Dynamic creation of cloud-based client instances during high demand
  • Automatic termination of idle instances to reduce costs
  • Load-aware job distribution across available clients

Job Lifecycle

Understanding the job lifecycle is essential for working with job-orchestrator effectively.

Lifecycle Sequence

sequenceDiagram
    participant User
    participant Server
    participant Client
    participant Executor

    User->>Server: POST /upload (files, user_id, service)
    Server->>Server: Store job (status: Queued)
    Server-->>User: Job ID

    Note over Server: Sender task (500ms interval)
    Server->>Server: Update status: Processing
    Server->>Client: POST /submit (job files)
    Client->>Client: Store payload (status: Prepared)
    Client-->>Server: Payload ID
    Server->>Server: Update status: Submitted

    Note over Client: Runner task (500ms interval)
    Client->>Executor: Execute run.sh
    Executor->>Executor: Process files
    Executor-->>Client: Exit code
    Client->>Client: Update status: Completed

    Note over Server: Getter task (500ms interval)
    Server->>Client: GET /retrieve/:id
    Client-->>Server: ZIP results
    Server->>Server: Store results, status: Completed

    User->>Server: GET /download/:id
    Server-->>User: results.zip

    Note over Server: Cleaner task (60s interval)
    Server->>Server: Remove jobs older than MAX_AGE

Job States

stateDiagram-v2
    [*] --> Queued: Job submitted

    Queued --> Processing: Sender picks up job
    Processing --> Submitted: Sent to client
    Processing --> Failed: Client unreachable

    Submitted --> Completed: Execution successful
    Submitted --> Unknown: Retrieval failed or execution failed

    Unknown --> Completed: Retry successful

    Completed --> Cleaned: After MAX_AGE
    Failed --> Cleaned: After MAX_AGE
    Unknown --> Cleaned: After MAX_AGE (if applicable)

    Cleaned --> [*]

State Descriptions

StateDescription
QueuedJob received and waiting for dispatch
ProcessingServer is sending job to a client
SubmittedJob successfully sent to client, awaiting execution
CompletedJob finished successfully, results available
FailedJob failed permanently (client unreachable, execution error)
UnknownTemporary state when retrieval fails, will retry
CleanedJob data removed after retention period

Lifecycle Stages

1. Submission

User uploads files via POST /upload with:

  • One or more files (including run.sh)
  • user_id - identifies the submitting user
  • service - which backend should process this job

The server:

  1. Validates the request
  2. Creates a unique directory for the job
  3. Stores all uploaded files
  4. Creates a database record with status Queued
  5. Returns the job ID to the user

2. Queuing & Quota Check

The Sender background task (runs every 500ms):

  1. Finds jobs in Queued status
  2. Checks if user has available quota for the service
  3. If quota available, marks job as Processing
  4. If quota exceeded, job remains Queued

3. Distribution

For jobs in Processing status:

  1. Server packages job files
  2. Sends to configured client via POST /submit
  3. On success: updates status to Submitted, stores client’s payload ID
  4. On failure: updates status to Failed

4. Execution

On the client side:

  1. Runner task finds payloads in Prepared status
  2. Executes run.sh in the job directory
  3. Captures exit code and any output files
  4. Updates payload status to Completed (or Failed on error)

5. Retrieval

The Getter background task (runs every 500ms):

  1. Finds jobs in Submitted status
  2. Requests results from client via GET /retrieve/:id
  3. Downloads and stores the result ZIP
  4. Updates job status to Completed

6. Download

User can now:

  1. Check status via GET /download/:id (returns JSON with job state when not completed)
  2. Download results via GET /download/:id (returns ZIP when completed)
  3. Results are returned as a ZIP archive

7. Cleanup

The Cleaner background task (runs every 60s):

  1. Finds jobs older than MAX_AGE
  2. Deletes job files from filesystem
  3. Updates status to Cleaned or removes record

Error Handling

Client Unreachable

If the server cannot reach a client during distribution:

  • Job status changes to Failed
  • Job will not be retried automatically
  • User can resubmit if needed

Execution Failure

If run.sh exits with non-zero code:

  • Payload status changes to Failed
  • Server retrieves whatever output exists
  • Job status reflects the failure

Retrieval Failure

If the server cannot retrieve results:

  • Job status changes to Unknown
  • Server will retry on subsequent Getter cycles
  • Eventually succeeds or times out

Timing Considerations

EventTypical Duration
Upload to QueuedImmediate
Queued to ProcessingUp to 500ms (+ quota wait)
Processing to SubmittedDepends on file size and network
Submitted to CompletedDepends on job execution time
Completed to CleanedConfigured via MAX_AGE (default: 48 hours)

Server & Client Modes

job-orchestrator provides both server and client functionality in a single binary, configured via command-line arguments.

Dual-Mode Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Same Binary                               │
│                                                                  │
│   ┌─────────────────────┐       ┌─────────────────────┐         │
│   │    Server Mode      │       │    Client Mode      │         │
│   │                     │       │                     │         │
│   │  - Job orchestration│       │  - Job execution    │         │
│   │  - Quota management │       │  - Result packaging │         │
│   │  - Persistent DB    │       │  - In-memory DB     │         │
│   │  - User-facing API  │       │  - Server-facing API│         │
│   └─────────────────────┘       └─────────────────────┘         │
│                                                                  │
│              job-orchestrator server    job-orchestrator client  │
└─────────────────────────────────────────────────────────────────┘

Server Mode

The server is the central orchestrator that:

  • Receives job submissions from users/applications
  • Manages job queues and enforces quotas
  • Distributes jobs to available clients
  • Retrieves results and serves them to users
  • Handles cleanup of expired jobs

Starting the Server

job-orchestrator server --port 5000

Or with environment variables:

PORT=5000 job-orchestrator server

Server Responsibilities

ComponentPurpose
REST APIHandle /upload and /download requests
Queue ManagerEnforce per-user, per-service quotas
Sender TaskDispatch jobs to clients
Getter TaskRetrieve completed results
Cleaner TaskRemove expired jobs
SQLite DBPersistent job tracking

Server API Endpoints

EndpointMethodPurpose
/uploadPOSTSubmit new job
/download/:idGETGet results or status
/healthGETHealth check
/swagger-ui/GETAPI documentation

Client Mode

The client executes jobs on behalf of the server:

  • Receives job payloads from the server
  • Executes the run.sh script
  • Packages results for retrieval
  • Reports system load for scheduling decisions

Starting the Client

job-orchestrator client --port 9000

Or with environment variables:

PORT=9000 job-orchestrator client

Client Responsibilities

ComponentPurpose
REST APIHandle /submit and /retrieve requests
Runner TaskExecute prepared payloads
Bash ExecutorRun run.sh scripts
In-Memory DBLightweight payload tracking

Client API Endpoints

EndpointMethodPurpose
/submitPOSTReceive job from server
/retrieve/:idGETReturn completed results
/loadGETReport CPU usage
/healthGETHealth check

Communication Flow

User                Server                    Client
  │                   │                         │
  │──POST /upload────▶│                         │
  │◀─── Job ID ───────│                         │
  │                   │                         │
  │                   │──POST /submit──────────▶│
  │                   │◀─── Payload ID ─────────│
  │                   │                         │
  │                   │                    ┌────┴────┐
  │                   │                    │ Execute │
  │                   │                    │ run.sh  │
  │                   │                    └────┬────┘
  │                   │                         │
  │                   │──GET /retrieve/:id─────▶│
  │                   │◀─── results.zip ────────│
  │                   │                         │
  │──GET /download/:id▶│                         │
  │◀─── results.zip ──│                         │

Deployment Patterns

Single Machine (Development)

Both server and client on the same machine:

# Terminal 1
job-orchestrator server --port 5000

# Terminal 2
job-orchestrator client --port 9000

Distributed (Production)

Server on one machine, clients on compute nodes:

                    ┌─────────────┐
                    │   Server    │
                    │  (port 5000)│
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
   │  Client 1   │  │  Client 2   │  │  Client 3   │
   │ (compute-1) │  │ (compute-2) │  │ (compute-3) │
   └─────────────┘  └─────────────┘  └─────────────┘

Multi-Service Setup

Different clients for different services:

# Server configuration
SERVICE_EXAMPLE_UPLOAD_URL: http://client-example:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-example:9000/retrieve

SERVICE_HADDOCK_UPLOAD_URL: http://client-haddock:9001/submit
SERVICE_HADDOCK_DOWNLOAD_URL: http://client-haddock:9001/retrieve

Database Differences

Server Database (Persistent)

  • Uses SQLite file on disk
  • Survives restarts
  • Stores complete job history
  • Location configured via DB_PATH

Client Database (In-Memory)

  • SQLite in-memory database
  • Cleared on restart
  • Only tracks active payloads
  • Lightweight and fast

When to Scale

Add More Clients When:

  • Job queue is consistently backing up
  • Execution time is the bottleneck
  • You have available compute resources

Scale Server When:

  • Upload/download becomes slow
  • Many concurrent users
  • Database queries become slow

See Also

Server Configuration

The orchestrator server is configured primarily through environment variables.

Environment Variables

Core Settings

VariableDefaultDescription
PORT5000HTTP port the server listens on
DB_PATH./db.sqlitePath to SQLite database file
DATA_PATH./dataDirectory for job file storage
MAX_AGE172800Job retention time in seconds (default: 48 hours)

Service Configuration

For each service you want to support, configure these variables:

Variable PatternDescription
SERVICE_<NAME>_UPLOAD_URLClient endpoint for submitting jobs
SERVICE_<NAME>_DOWNLOAD_URLClient endpoint for retrieving results
SERVICE_<NAME>_RUNS_PER_USERMaximum concurrent jobs per user (default: 5)

Note: <NAME> must be uppercase. For a service called “example”, use SERVICE_EXAMPLE_*.

Example Configuration

Minimal Setup

export PORT=5000
export DB_PATH=/var/lib/job-orchestrator/db.sqlite
export DATA_PATH=/var/lib/job-orchestrator/data
export SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://localhost:9000/retrieve

Production Setup

# Core settings
export PORT=5000
export DB_PATH=/opt/orchestrator/db.sqlite
export DATA_PATH=/opt/orchestrator/data
export MAX_AGE=172800  # 48 hours

# Example service (general purpose)
export SERVICE_EXAMPLE_UPLOAD_URL=http://compute-1:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://compute-1:9000/retrieve
export SERVICE_EXAMPLE_RUNS_PER_USER=10

# HADDOCK service (specialized)
export SERVICE_HADDOCK_UPLOAD_URL=http://haddock-cluster:9001/submit
export SERVICE_HADDOCK_DOWNLOAD_URL=http://haddock-cluster:9001/retrieve
export SERVICE_HADDOCK_RUNS_PER_USER=3

Docker Compose

services:
  server:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: server
    ports:
      - "5000:5000"
    environment:
      PORT: 5000
      DB_PATH: /opt/data/db.sqlite
      DATA_PATH: /opt/data
      MAX_AGE: 172800
      SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
      SERVICE_EXAMPLE_RUNS_PER_USER: 5
    volumes:
      - server-data:/opt/data

volumes:
  server-data:

Configuration Details

PORT

The HTTP port for the REST API. Users will connect to this port to submit jobs and download results.

PORT=5000

DB_PATH

Path to the SQLite database file. The directory must exist and be writable.

DB_PATH=/var/lib/job-orchestrator/db.sqlite

The database is created automatically on first run. It stores:

  • Job metadata (ID, user, service, status)
  • Job locations and timestamps
  • Client payload references

DATA_PATH

Directory where job files are stored. Each job gets a unique subdirectory.

DATA_PATH=/var/lib/job-orchestrator/data

Structure:

/var/lib/job-orchestrator/data/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890/
│   ├── run.sh
│   ├── input.pdb
│   └── output.zip  (after completion)
├── b2c3d4e5-f6a7-8901-bcde-f12345678901/
│   └── ...

MAX_AGE

How long to keep completed jobs before cleanup, in seconds.

ValueDuration
36001 hour
8640024 hours
17280048 hours (default)
6048001 week
MAX_AGE=172800

Jobs older than this are removed by the Cleaner task.

Service URLs

Each service needs upload and download URLs pointing to a client:

SERVICE_MYSERVICE_UPLOAD_URL=http://client-host:9000/submit
SERVICE_MYSERVICE_DOWNLOAD_URL=http://client-host:9000/retrieve
  • UPLOAD_URL: Where to POST job files
  • DOWNLOAD_URL: Where to GET results (:id is appended automatically)

RUNS_PER_USER

Controls how many jobs a single user can have running simultaneously for a service:

SERVICE_EXAMPLE_RUNS_PER_USER=5
  • Jobs exceeding the quota remain in Queued status
  • They’re automatically dispatched when slots become available
  • Set higher for quick jobs, lower for resource-intensive jobs

File Permissions

Ensure the server process has:

  • Read/Write access to DB_PATH parent directory
  • Read/Write access to DATA_PATH directory
  • Network access to all configured client URLs

Validating Configuration

Start the server and check logs:

job-orchestrator server

You should see:

  • Port binding confirmation
  • Database initialization
  • Service configuration loaded

Test with a health check:

curl http://localhost:5000/health

See Also

Client Configuration

The client is configured through environment variables and runs as a job executor.

Environment Variables

VariableDefaultDescription
PORT9000HTTP port the client listens on

Example Configuration

Basic Setup

export PORT=9000
job-orchestrator client

Docker Compose

services:
  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    ports:
      - "9000:9000"
    environment:
      PORT: 9000
    volumes:
      - client-data:/opt/data

volumes:
  client-data:

How the Client Works

In-Memory Database

Unlike the server, the client uses an in-memory SQLite database:

  • Fast: No disk I/O for database operations
  • Ephemeral: Data is lost on restart
  • Lightweight: Minimal resource usage

This is intentional - the client only needs to track active payloads. The server maintains the authoritative job history.

Working Directory

The client stores job files in a working directory. Each payload gets a unique subdirectory:

/opt/data/
├── payload-uuid-1/
│   ├── run.sh
│   ├── input.pdb
│   └── output.txt  (created by run.sh)
├── payload-uuid-2/
│   └── ...

Execution Environment

When the Runner task executes a job:

  1. Changes to the payload directory
  2. Executes ./run.sh
  3. Captures the exit code
  4. All files in the directory are included in results

Resource Reporting

The client exposes a /load endpoint that reports CPU usage:

curl http://localhost:9000/load

Returns a float representing CPU usage percentage. This can be used by the server for load-aware scheduling (planned feature).

Multiple Clients

You can run multiple clients for:

  • Scaling: Handle more concurrent jobs
  • Isolation: Different services on different machines
  • Redundancy: Failover capability

Same Service, Multiple Clients

Currently, configure multiple URLs in the server (round-robin planned):

# On server - points to primary client
SERVICE_EXAMPLE_UPLOAD_URL=http://client-1:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL=http://client-1:9000/retrieve

Different Services

Run specialized clients for different workloads:

# Client for general jobs
PORT=9000 job-orchestrator client

# Client for heavy computation (different machine)
PORT=9001 job-orchestrator client

Server configuration:

SERVICE_LIGHT_UPLOAD_URL=http://client-1:9000/submit
SERVICE_HEAVY_UPLOAD_URL=http://client-2:9001/submit

Client Security

Network Access

The client should only be accessible by the orchestrator server:

  • Use internal networks / VPCs
  • Firewall rules to restrict access
  • Never expose client ports to the internet

Execution Sandbox

The client executes arbitrary run.sh scripts. Consider:

  • Running in containers with resource limits
  • Using separate user accounts with minimal permissions
  • Mounting only necessary directories
  • Network isolation if jobs don’t need internet

Docker Resource Limits

services:
  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '1'
          memory: 1G

Monitoring

Health Check

curl http://localhost:9000/health

Load Check

curl http://localhost:9000/load

Container Logs

docker logs -f client-container

Troubleshooting

Client Not Receiving Jobs

  1. Verify server can reach client URL
  2. Check firewall rules
  3. Verify service configuration on server

Jobs Stuck in Prepared

  1. Check if Runner task is running (look for logs)
  2. Verify run.sh is executable
  3. Check for permission issues in working directory

High Memory Usage

The in-memory database grows with active payloads. If memory is high:

  1. Check for stuck/zombie payloads
  2. Restart the client (safe - server tracks jobs)
  3. Consider more frequent cleanup

See Also

Quota System

The quota system ensures fair resource allocation by limiting concurrent jobs per user per service.

How Quotas Work

User 1 submits 10 jobs for "example" service
Quota: SERVICE_EXAMPLE_RUNS_PER_USER=5

┌─────────────────────────────────────────┐
│ Jobs 1-5:  Dispatched immediately       │
│ Jobs 6-10: Remain queued                │
└─────────────────────────────────────────┘

When Job 1 completes:
┌─────────────────────────────────────────┐
│ Job 6: Now dispatched (slot available)  │
└─────────────────────────────────────────┘

Configuration

Set quotas per service using environment variables:

SERVICE_<NAME>_RUNS_PER_USER=<limit>

Examples

# Allow 5 concurrent jobs per user for "example" service
SERVICE_EXAMPLE_RUNS_PER_USER=5

# Allow 3 concurrent jobs per user for "haddock" service
SERVICE_HADDOCK_RUNS_PER_USER=3

# Allow 10 concurrent jobs per user for "quick" service
SERVICE_QUICK_RUNS_PER_USER=10

Default Value

If not specified, the default quota is 5 concurrent jobs per user per service.

Quota Scope

Quotas are enforced per user, per service:

UserServiceQuotaCan Submit
user_1example5Up to 5 concurrent
user_1haddock3Up to 3 concurrent
user_2example5Up to 5 concurrent (independent of user_1)

Users don’t compete with each other - each user has their own quota allocation.

Quota States

Jobs transition through these states relative to quotas:

┌──────────┐     Quota      ┌────────────┐
│  Queued  │ ──Available──▶ │ Processing │
└──────────┘                └────────────┘
     │                            │
     │ Quota Exhausted            │
     ▼                            ▼
┌──────────────────┐      ┌────────────┐
│ Remains Queued   │      │ Submitted  │
│ (waits for slot) │      │ (running)  │
└──────────────────┘      └────────────┘

Choosing Quota Values

Factors to Consider

  1. Job Duration: Longer jobs need lower quotas
  2. Resource Usage: CPU/memory intensive jobs need lower quotas
  3. User Base: More users may need lower per-user quotas
  4. Client Capacity: Match quotas to available compute resources

Guidelines

Job TypeSuggested Quota
Quick jobs (< 1 min)10-20
Medium jobs (1-10 min)5-10
Long jobs (10+ min)2-5
Resource-intensive1-3

Example Scenarios

Scientific Computing Platform

# Quick validation jobs - high quota
SERVICE_VALIDATE_RUNS_PER_USER=20

# Standard analysis - medium quota
SERVICE_ANALYZE_RUNS_PER_USER=5

# Heavy simulation - low quota
SERVICE_SIMULATE_RUNS_PER_USER=2

Educational Platform

# Student exercises - moderate quota
SERVICE_EXERCISE_RUNS_PER_USER=3

# Final projects - allow more
SERVICE_PROJECT_RUNS_PER_USER=5

Monitoring Quota Usage

Check Queue Status

Jobs waiting due to quota exhaustion remain in Queued status:

# Check how many jobs are queued vs running
curl http://localhost:5000/swagger-ui/  # Use API explorer

Server Logs

The server logs quota decisions:

INFO: User 1 has 5/5 jobs running for service 'example', job 123 remains queued
INFO: User 1 slot available, dispatching job 123 to service 'example'

Testing Quotas

Submit multiple jobs to observe throttling:

# Submit 10 jobs with quota of 5
for i in {1..10}; do
  echo '#!/bin/bash
sleep 30
echo "Job complete" > output.txt' > run.sh

  curl -s -X POST http://localhost:5000/upload \
    -F "file=@run.sh" \
    -F "user_id=1" \
    -F "service=example" | jq -r '.status'
done

You’ll see:

  • First 5 jobs: Quickly move to Submitted
  • Jobs 6-10: Stay in Queued until slots open

Fair Scheduling

The quota system provides basic fairness:

  • Per-user isolation: One user can’t starve others
  • Per-service isolation: Heavy service usage doesn’t block other services
  • Automatic queuing: No jobs are rejected, just delayed

Limitations

Current limitations (improvements planned):

  • No priority queues (FIFO within quota constraints)
  • No global quotas (only per-user)
  • No time-based quotas (e.g., jobs per hour)
  • No burst allowances

See Also

Server API Endpoints

The orchestrator server exposes a REST API for job submission and retrieval.

Base URL

http://localhost:5000

Interactive Documentation

Swagger UI is available at:

http://localhost:5000/swagger-ui/

Endpoints

POST /upload

Submit a new job for processing.

Request

  • Content-Type: multipart/form-data
  • Max size: 400MB
FieldTypeRequiredDescription
filefileYesOne or more files (repeat for multiple)
user_idintegerYesUser identifier for quota tracking
servicestringYesService name (must be configured on server)

Example

curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "file=@input.pdb" \
  -F "user_id=1" \
  -F "service=example"

Response

{
  "id": 1,
  "status": "Queued",
  "message": "Job successfully uploaded"
}

Status Codes

CodeDescription
201Job created successfully
400Invalid request (missing fields, invalid service)
500Server error

Notes

  • At least one file must be named run.sh
  • The service must match a configured service on the server
  • dest_id is populated after the job is dispatched to a client

GET /download/

Check job status or download results.

Parameters

ParameterTypeDescription
idintegerJob ID from upload response

Example

# Check status (returns JSON when not completed)
curl http://localhost:5000/download/1

# Download results (returns ZIP when completed)
curl -o results.zip http://localhost:5000/download/1

Response

When the job is not yet completed, returns a JSON body:

{
  "id": 1,
  "status": "Submitted",
  "message": ""
}

When the job is completed, returns:

  • Content-Type: application/zip
  • Body: ZIP archive containing all result files

Status Codes

CodeDescription
200JSON status body or ZIP file (check Content-Type)
404Job not found
500Server error

Usage Pattern

Poll until status is Completed, then save the ZIP:

while true; do
  response=$(curl -s http://localhost:5000/download/1)
  status=$(echo "$response" | jq -r '.status // empty')
  if [ -z "$status" ]; then
    # No JSON status field means we got the ZIP
    curl -o results.zip http://localhost:5000/download/1
    break
  elif [ "$status" = "Completed" ]; then
    curl -o results.zip http://localhost:5000/download/1
    break
  else
    echo "Status: $status"
    sleep 5
  fi
done

GET /health

Health check endpoint.

Example

curl http://localhost:5000/health

Response

{
  "status": "healthy"
}

Status Codes

CodeDescription
200Server is healthy
500Server is unhealthy

GET /

Ping endpoint for basic connectivity check.

Example

curl http://localhost:5000/

Response

Simple acknowledgment that the server is running.


GET /swagger-ui/

Interactive API documentation.

Example

Open in browser:

http://localhost:5000/swagger-ui/

Provides:

  • Interactive API explorer
  • Request/response schemas
  • Try-it-out functionality

Error Responses

All error responses follow this format:

{
  "id": 0,
  "status": "Unknown",
  "message": "Description of the error"
}

Rate Limiting

The server does not implement rate limiting directly. Use a reverse proxy (nginx, traefik) for rate limiting in production.

Authentication

The server does not implement authentication directly. The user_id field is trusted as provided. Implement authentication at the reverse proxy layer or in your application.

See Also

Client API Endpoints

The client exposes endpoints for the orchestrator server to submit jobs and retrieve results.

Note: These endpoints are typically only accessed by the orchestrator server, not by end users.

Base URL

http://localhost:9000

Endpoints

POST /submit

Receive a job payload from the orchestrator server.

Request

  • Content-Type: multipart/form-data
FieldTypeRequiredDescription
filefileYesOne or more job files

Example

curl -X POST http://localhost:9000/submit \
  -F "file=@run.sh" \
  -F "file=@input.pdb"

Response

{
  "id": 1,
  "status": "Prepared",
  "loc": "/opt/data/abc123-def456"
}

Status Codes

CodeDescription
200Payload received successfully
500Server error

Notes

  • The client stores files and creates a payload record
  • Status starts as Prepared, waiting for the Runner task
  • The id is returned to the server and stored as dest_id

GET /retrieve/

Retrieve results of a completed payload.

Parameters

ParameterTypeDescription
idintegerPayload ID from submit response

Example

curl -o results.zip http://localhost:9000/retrieve/1

Response

When the payload is not yet completed, returns a JSON body:

{
  "id": 1,
  "status": "Running",
  "loc": "/opt/data/abc123-def456"
}

When the payload is completed, returns:

  • Content-Type: application/zip
  • Body: ZIP archive of all files in the payload directory

Status Codes

CodeDescription
200JSON payload status or ZIP file (check Content-Type)
404Payload not found
500Server error

Notes

  • The ZIP includes all files in the working directory after run.sh execution
  • Original input files are included unless deleted by run.sh
  • After successful retrieval, the payload may be cleaned up

GET /load

Report current CPU usage.

Example

curl http://localhost:9000/load

Response

45.2

Returns a float representing CPU usage percentage (0-100).

Use Cases

  • Load-aware job distribution (planned feature)
  • Monitoring client health
  • Capacity planning

GET /health

Health check endpoint.

Example

curl http://localhost:9000/health

Response

{
  "status": "healthy"
}

GET /

Ping endpoint for basic connectivity check.

Example

curl http://localhost:9000/

Payload States

Payloads on the client go through these states:

StateDescription
PreparedReceived from server, waiting for execution
RunningCurrently executing run.sh
CompletedExecution finished successfully
FailedExecution failed (non-zero exit code)

Security Considerations

The client API should never be exposed to the public internet:

  • No authentication is implemented
  • Arbitrary code execution via run.sh
  • Internal service communication only

Recommendations:

  • Use internal networks / VPCs
  • Firewall rules: allow only orchestrator server IP
  • Docker networks with no external exposure

See Also

Docker Deployment

Docker is the recommended way to deploy job-orchestrator in production.

Quick Start

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build

This starts:

  • Server on port 5000
  • Example client on port 9000

Docker Images

Official Image

docker pull ghcr.io/rvhonorato/job-orchestrator:latest

Build Locally

docker build -t job-orchestrator .

Docker Compose

Basic Setup

version: '3.8'

services:
  server:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: server
    ports:
      - "5000:5000"
    environment:
      PORT: 5000
      DB_PATH: /opt/data/db.sqlite
      DATA_PATH: /opt/data
      MAX_AGE: 172800
      SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
      SERVICE_EXAMPLE_RUNS_PER_USER: 5
    volumes:
      - server-data:/opt/data
    depends_on:
      - client

  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000
    volumes:
      - client-data:/opt/data

volumes:
  server-data:
  client-data:

Production Setup

See Production Deployment - Container Hardening for details on each security option.

Multiple Clients

Scaling Horizontally

services:
  server:
    # ... server config ...
    environment:
      SERVICE_EXAMPLE_UPLOAD_URL: http://client-1:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-1:9000/retrieve

  client-1:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

  client-2:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

  client-3:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

Multiple Services

services:
  server:
    environment:
      # Light jobs
      SERVICE_LIGHT_UPLOAD_URL: http://client-light:9000/submit
      SERVICE_LIGHT_DOWNLOAD_URL: http://client-light:9000/retrieve
      SERVICE_LIGHT_RUNS_PER_USER: 10

      # Heavy jobs
      SERVICE_HEAVY_UPLOAD_URL: http://client-heavy:9000/submit
      SERVICE_HEAVY_DOWNLOAD_URL: http://client-heavy:9000/retrieve
      SERVICE_HEAVY_RUNS_PER_USER: 2

  client-light:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G

  client-heavy:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '8'
          memory: 16G

Volume Management

Persistent Storage

Always use named volumes for production:

volumes:
  server-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/job-orchestrator/server

  client-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/job-orchestrator/client

Backup Strategy

# Stop services (optional, for consistent backup)
docker compose stop

# Backup server data
tar -czf backup-$(date +%Y%m%d).tar.gz /data/job-orchestrator/server

# Resume services
docker compose start

Networking

Internal Network

Keep client internal:

services:
  server:
    ports:
      - "5000:5000"  # Exposed to host
    networks:
      - internal
      - external

  client:
    # No ports exposed to host
    networks:
      - internal

networks:
  internal:
    internal: true
  external:

With Reverse Proxy

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - server

  server:
    # No ports exposed, accessed via nginx
    networks:
      - internal

Logging

View Logs

# All services
docker compose logs -f

# Server only
docker compose logs -f server

# Last 100 lines
docker compose logs --tail 100 server

Log Rotation

services:
  server:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Commands

# Start
docker compose up -d

# Stop
docker compose down

# Restart
docker compose restart

# Rebuild and start
docker compose up --build -d

# View status
docker compose ps

# Shell into container
docker compose exec server /bin/sh

See Also

Production Deployment

This guide covers best practices for deploying job-orchestrator in production environments.

Architecture Recommendations

Minimum Setup

┌─────────────┐     ┌─────────────┐
│   Server    │────▶│   Client    │
│ (1 instance)│     │ (1 instance)│
└─────────────┘     └─────────────┘
                    ┌──────────────┐
                    │ Load Balancer│
                    │   (nginx)    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │    Server    │
                    │  (1 instance)│
                    └──────┬───────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
   │  Client 1   │  │  Client 2   │  │  Client 3   │
   └─────────────┘  └─────────────┘  └─────────────┘

Security

Script Validation

The client includes a built-in script validator that rejects run.sh scripts containing obviously dangerous patterns before execution. This covers destructive commands (rm -rf /, mkfs), network exfiltration tools (curl, wget, socat), reverse shells (/dev/tcp/), privilege escalation (sudo, chmod +s), container escapes (nsenter, docker), obfuscated execution (base64 | bash, python -c), persistence mechanisms (crontab, systemctl), crypto miners, and environment secret access.

This is a sanity check, not a sandbox. It can be bypassed by determined actors. Input scripts are still expected to come from trusted or semi-trusted sources. True isolation must be enforced at the deployment level using the container hardening measures below.

Container Hardening

The client executes user-submitted scripts with the full privileges of the process. Apply all of the following to limit blast radius:

MeasureDocker ComposePurpose
Read-only rootfsread_only: truePrevent filesystem tampering
Drop all capabilitiescap_drop: [ALL]Remove kernel-level privileges
No new privilegessecurity_opt: [no-new-privileges:true]Block setuid/setgid escalation
CPU limitdeploy.resources.limits.cpusPrevent CPU starvation
Memory limitdeploy.resources.limits.memoryPrevent OOM on host
PIDs limitdeploy.resources.limits.pidsPrevent fork bombs
Internal networknetworks: [internal]Block outbound internet access
Writable tmpfstmpfs: [/tmp]Provide scratch space on read-only rootfs

Example (applied to the client service):

services:
  client:
    read_only: true
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    tmpfs:
      - /tmp
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 2G
          pids: 256
    networks:
      - internal

networks:
  internal:
    internal: true

Future improvement: Run the container as a non-root user (USER appuser in the Dockerfile). This requires migrating ownership of existing volumes first – see the TODO in the Dockerfile.

Network Security

  1. Never expose clients to the internet

    • Clients execute user-submitted scripts
    • Use internal networks only
    • Block all outbound access from client containers
  2. Use a reverse proxy

    • TLS termination
    • Rate limiting
    • Request filtering
  3. Firewall rules

    # Allow only orchestrator server to reach clients
    iptables -A INPUT -p tcp --dport 9000 -s <server-ip> -j ACCEPT
    iptables -A INPUT -p tcp --dport 9000 -j DROP
    

Reverse Proxy (nginx)

upstream orchestrator {
    server 127.0.0.1:5000;
}

server {
    listen 443 ssl http2;
    server_name jobs.example.com;

    ssl_certificate /etc/nginx/certs/cert.pem;
    ssl_certificate_key /etc/nginx/certs/key.pem;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=upload:10m rate=10r/s;

    location /upload {
        limit_req zone=upload burst=20 nodelay;
        client_max_body_size 400M;
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /download {
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://orchestrator;
    }

    # Block swagger in production (optional)
    location /swagger-ui {
        deny all;
    }
}

Authentication

job-orchestrator does not implement authentication. Options:

  1. Reverse proxy authentication

    location / {
        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass http://orchestrator;
    }
    
  2. Application-level authentication

    • Wrap the API in your application
    • Validate users before calling job-orchestrator
  3. OAuth2 Proxy

    • Use oauth2-proxy in front of the service
    • Integrates with identity providers

Resource Planning

Server Requirements

Load LevelCPUMemoryStorage
Light (< 100 jobs/day)1 core512MB10GB
Medium (100-1000 jobs/day)2 cores1GB50GB
Heavy (> 1000 jobs/day)4 cores2GB100GB+

Storage depends heavily on job file sizes and retention period.

Client Requirements

Depends entirely on your job workloads:

Job TypeCPUMemory
Text processing1 core512MB
Scientific computing4-8 cores8-16GB
ML/Deep learning8+ cores + GPU32GB+

Storage Calculation

Storage = (avg_job_size) × (jobs_per_day) × (retention_days)

Example:
- 10MB average job
- 500 jobs/day
- 2 day retention
= 10MB × 500 × 2 = 10GB

Monitoring

Health Checks

# Server health
curl -f http://localhost:5000/health

# Client health
curl -f http://localhost:9000/health

# Client load
curl http://localhost:9000/load

Prometheus Metrics (External)

Use a sidecar or external monitoring:

services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # Monitor container metrics
  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

Log Aggregation

services:
  server:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "job-orchestrator.server"

Backup & Recovery

What to Backup

  1. Server database (DB_PATH)

    • Contains job history and status
    • Critical for job tracking
  2. Server data directory (DATA_PATH)

    • Contains job files and results
    • Large, may use incremental backups

Backup Script

#!/bin/bash
BACKUP_DIR=/backups/job-orchestrator
DATE=$(date +%Y%m%d_%H%M%S)

# Backup database
sqlite3 /opt/data/db.sqlite ".backup '${BACKUP_DIR}/db_${DATE}.sqlite'"

# Backup data (incremental with rsync)
rsync -av --delete /opt/data/ ${BACKUP_DIR}/data/

# Cleanup old backups (keep 7 days)
find ${BACKUP_DIR} -name "db_*.sqlite" -mtime +7 -delete

Recovery

# Stop server
docker compose stop server

# Restore database
cp /backups/job-orchestrator/db_latest.sqlite /opt/data/db.sqlite

# Restore data
rsync -av /backups/job-orchestrator/data/ /opt/data/

# Start server
docker compose start server

High Availability

Current Limitations

  • Single server architecture
  • No built-in clustering
  • SQLite doesn’t support concurrent writes

Workarounds

  1. Quick recovery

    • Automated health checks
    • Container auto-restart
    • Fast backup restoration
  2. Stateless clients

    • Clients can be restarted freely
    • Jobs are tracked by server
  3. Future improvements

    • PostgreSQL support (planned)
    • Server clustering (planned)

Maintenance

Database Maintenance

# Vacuum database (reclaim space)
sqlite3 /opt/data/db.sqlite "VACUUM;"

# Check integrity
sqlite3 /opt/data/db.sqlite "PRAGMA integrity_check;"

Log Rotation

Ensure logs don’t fill disk:

services:
  server:
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

Updates

# Pull latest image
docker pull ghcr.io/rvhonorato/job-orchestrator:latest

# Recreate containers
docker compose up -d

Troubleshooting

See Troubleshooting Guide for common issues.

See Also

Building from Source

This guide covers building job-orchestrator from source code.

Prerequisites

Rust Toolchain

Install Rust via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Minimum version: Rust 1.75 (edition 2021)

Verify installation:

rustc --version
cargo --version

System Dependencies

Debian/Ubuntu

apt-get update
apt-get install -y build-essential libsqlite3-dev pkg-config

Fedora/RHEL

dnf install gcc sqlite-devel

macOS

brew install sqlite

Windows

Install Visual Studio Build Tools and SQLite development libraries.

Clone Repository

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator

Build Commands

Debug Build

Fast compilation, includes debug symbols:

cargo build

Binary location: target/debug/job-orchestrator

Release Build

Optimized for performance:

cargo build --release

Binary location: target/release/job-orchestrator

Check (No Build)

Verify code compiles without producing binary:

cargo check

Running

From Cargo

# Server mode
cargo run -- server --port 5000

# Client mode
cargo run -- client --port 9000

From Binary

# After release build
./target/release/job-orchestrator server --port 5000

Build Options

Features

Currently no optional features. All functionality is included by default.

Target Platforms

Cross-compile for different targets:

# Add target
rustup target add x86_64-unknown-linux-musl

# Build for target
cargo build --release --target x86_64-unknown-linux-musl

Common targets:

  • x86_64-unknown-linux-gnu - Linux (glibc)
  • x86_64-unknown-linux-musl - Linux (static)
  • x86_64-apple-darwin - macOS Intel
  • aarch64-apple-darwin - macOS Apple Silicon
  • x86_64-pc-windows-msvc - Windows

Docker Build

Using Dockerfile

docker build -t job-orchestrator .

Multi-stage Build

The Dockerfile uses multi-stage builds for smaller images:

  1. Builder stage: Compiles with full toolchain
  2. Runtime stage: Minimal image with just the binary

See Also

Testing

This guide covers running and writing tests for job-orchestrator.

Running Tests

All Tests

cargo test

With Output

See println! output from tests:

cargo test -- --nocapture

Specific Test

# By name
cargo test test_upload

# By module
cargo test orchestrator::tests

Ignored Tests

Some tests may be ignored by default (slow, require setup):

cargo test -- --ignored

Test Coverage

Using cargo-tarpaulin

Install:

cargo install cargo-tarpaulin

Generate coverage:

# HTML report
cargo tarpaulin --out Html --output-dir ./coverage

# XML report (for CI)
cargo tarpaulin --out Xml --output-dir ./coverage

View report:

open coverage/tarpaulin-report.html

Test Structure

Tests are organized alongside the code they test:

src/
├── lib.rs
├── orchestrator/
│   ├── mod.rs
│   └── tests.rs      # Orchestrator tests
├── client/
│   ├── mod.rs
│   └── tests.rs      # Client tests
└── utils/
    └── mod.rs        # Inline tests with #[cfg(test)]

Writing Tests

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_something() {
        let result = function_under_test();
        assert_eq!(result, expected_value);
    }
}
}

Async Tests

Use tokio::test for async functions:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_async_function() {
        let result = async_function().await;
        assert!(result.is_ok());
    }
}
}

Integration Tests

Create files in tests/ directory:

#![allow(unused)]
fn main() {
// tests/integration_test.rs
use job_orchestrator::*;

#[tokio::test]
async fn test_full_workflow() {
    // Setup
    // Test
    // Verify
}
}

Mocking

Using mockall

For trait-based mocking:

#![allow(unused)]
fn main() {
use mockall::automock;

#[automock]
trait Database {
    fn get(&self, id: i32) -> Option<Job>;
}

#[test]
fn test_with_mock() {
    let mut mock = MockDatabase::new();
    mock.expect_get()
        .with(eq(1))
        .returning(|_| Some(Job::default()));

    // Use mock in test
}
}

Using mockito

For HTTP mocking:

#![allow(unused)]
fn main() {
use mockito::Server;

#[tokio::test]
async fn test_http_client() {
    let mut server = Server::new();
    let mock = server.mock("GET", "/health")
        .with_status(200)
        .create();

    // Test against server.url()

    mock.assert();
}
}

Test Utilities

Test Fixtures

Create reusable test data:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod test_utils {
    pub fn create_test_job() -> Job {
        Job {
            id: 1,
            user_id: 1,
            service: "test".to_string(),
            status: Status::Queued,
            ..Default::default()
        }
    }
}
}

Temporary Directories

Use tempfile for temporary test directories:

#![allow(unused)]
fn main() {
use tempfile::TempDir;

#[test]
fn test_file_operations() {
    let temp_dir = TempDir::new().unwrap();
    let file_path = temp_dir.path().join("test.txt");

    // Test file operations

    // TempDir is automatically cleaned up
}
}

CI Testing

Tests run automatically on GitHub Actions:

# .github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo test

Linting

Clippy

Run Clippy for additional checks:

cargo clippy -- -D warnings

Common fixes:

  • #[allow(clippy::lint_name)] to suppress specific lints
  • Configure in clippy.toml or Cargo.toml

Formatting

Check formatting:

cargo fmt -- --check

Fix formatting:

cargo fmt

Debugging Tests

With println

#![allow(unused)]
fn main() {
#[test]
fn test_debug() {
    let value = compute_something();
    println!("Debug value: {:?}", value);  // Use --nocapture to see
    assert!(value.is_valid());
}
}

With RUST_BACKTRACE

RUST_BACKTRACE=1 cargo test

See Also

Contributing

Contributions to job-orchestrator are welcome! This guide explains how to contribute.

Ways to Contribute

  • Bug reports: Found a bug? Open an issue
  • Feature requests: Have an idea? Open an issue to discuss
  • Code contributions: Fix bugs or implement features
  • Documentation: Improve docs, fix typos, add examples
  • Testing: Add tests, report edge cases

Getting Started

1. Fork the Repository

Click “Fork” on GitHub, then clone your fork:

git clone https://github.com/YOUR_USERNAME/job-orchestrator.git
cd job-orchestrator

2. Set Up Development Environment

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install dependencies (Debian/Ubuntu)
apt-get install libsqlite3-dev

# Build
cargo build

# Run tests
cargo test

3. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix

Development Workflow

Making Changes

  1. Write code
  2. Add tests for new functionality
  3. Run tests: cargo test
  4. Run linter: cargo clippy -- -D warnings
  5. Format code: cargo fmt

Commit Messages

Follow conventional commits:

type(scope): description

[optional body]

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation
  • refactor: Code refactoring
  • test: Adding tests
  • chore: Maintenance

Examples:

feat(quota): add per-service quota limits
fix(client): handle non-zero exit codes correctly
docs(readme): update quick start instructions

Pull Request Process

  1. Push your branch:

    git push origin feature/your-feature-name
    
  2. Open a Pull Request on GitHub

  3. Fill in the PR template:

    • Describe your changes
    • Link related issues
    • Note any breaking changes
  4. Wait for review

    • CI must pass
    • Maintainer will review
    • Address feedback
  5. Merge!

Code Style

Rust Style

Follow standard Rust conventions:

  • Use cargo fmt for formatting
  • Use cargo clippy for linting
  • Prefer descriptive variable names
  • Add doc comments for public APIs
#![allow(unused)]
fn main() {
/// Uploads a job to the orchestrator.
///
/// # Arguments
///
/// * `files` - Files to upload
/// * `user_id` - User submitting the job
/// * `service` - Target service name
///
/// # Returns
///
/// The created job with its ID and initial status.
pub async fn upload_job(
    files: Vec<File>,
    user_id: i32,
    service: String,
) -> Result<Job, Error> {
    // Implementation
}
}

Error Handling

  • Use Result types, not panics
  • Provide context with error messages
  • Use ? operator for propagation
#![allow(unused)]
fn main() {
// Good
let file = File::open(path)
    .map_err(|e| Error::FileOpen { path: path.clone(), source: e })?;

// Avoid
let file = File::open(path).unwrap();
}

Testing Guidelines

Write Tests For

  • New functionality
  • Bug fixes (regression tests)
  • Edge cases

Test Structure

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_function_describes_behavior() {
        // Arrange
        let input = create_test_input();

        // Act
        let result = function_under_test(input);

        // Assert
        assert_eq!(result, expected);
    }
}
}

Async Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_async_operation() {
    let result = async_function().await;
    assert!(result.is_ok());
}
}

Documentation

Code Documentation

Add doc comments for:

  • Public functions
  • Public structs
  • Public modules
#![allow(unused)]
fn main() {
/// A job represents a unit of work to be processed.
pub struct Job {
    /// Unique identifier for the job.
    pub id: i32,
    /// User who submitted the job.
    pub user_id: i32,
    // ...
}
}

mdbook Documentation

Documentation is in docs/src/. To preview:

# Install mdbook
cargo install mdbook

# Serve locally
cd docs
mdbook serve --open

Reporting Issues

Bug Reports

Include:

  • job-orchestrator version
  • Operating system
  • Steps to reproduce
  • Expected vs actual behavior
  • Logs if relevant

Feature Requests

Include:

  • Use case description
  • Proposed solution (if any)
  • Alternatives considered

Code of Conduct

Be respectful and constructive. We’re all here to build something useful together.

Questions?

License

Contributions are licensed under the MIT License.

Troubleshooting

Common issues and solutions for job-orchestrator.

Server Issues

Server Won’t Start

Symptom: Server fails to start, exits immediately

Possible Causes:

  1. Port already in use

    Error: Address already in use
    

    Solution:

    # Find process using the port
    lsof -i :5000
    # Kill it or use a different port
    PORT=5001 job-orchestrator server
    
  2. Database path not writable

    Error: unable to open database file
    

    Solution:

    # Check directory exists and is writable
    mkdir -p /opt/data
    chmod 755 /opt/data
    
  3. Missing service configuration

    Error: No services configured
    

    Solution: Configure at least one service:

    export SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit
    export SERVICE_EXAMPLE_DOWNLOAD_URL=http://client:9000/retrieve
    

Jobs Stuck in Queued

Symptom: Jobs stay in Queued status indefinitely

Possible Causes:

  1. Quota exhausted

    Check if user has reached their limit:

    • Default quota is 5 concurrent jobs per user per service
    • Wait for running jobs to complete, or increase quota
  2. Client unreachable

    Verify client connectivity:

    curl http://client:9000/health
    
  3. Service misconfigured

    Verify service URLs are correct:

    echo $SERVICE_EXAMPLE_UPLOAD_URL
    curl -X POST $SERVICE_EXAMPLE_UPLOAD_URL  # Should return error, not timeout
    

Jobs Stuck in Submitted

Symptom: Jobs move to Submitted but never complete

Possible Causes:

  1. Client not executing jobs

    Check client logs for errors

    docker logs client
    
  2. run.sh hanging

    Your script may be waiting for input or stuck in a loop

  3. Getter task not running

    Server may need restart

Upload Fails with 400

Symptom: POST /upload returns 400 Bad Request

Possible Causes:

  1. Missing required fields

    # Ensure all fields are provided
    curl -X POST http://localhost:5000/upload \
      -F "file=@run.sh" \
      -F "user_id=1" \      # Required
      -F "service=example"   # Required
    
  2. Unknown service

    Service must be configured on server:

    export SERVICE_EXAMPLE_UPLOAD_URL=...
    
  3. File too large

    Default limit is 400MB. Check file sizes.

Client Issues

Client Not Receiving Jobs

Symptom: Client running but no jobs arrive

Check:

  1. Network connectivity

    # From server, can you reach client?
    curl http://client:9000/health
    
  2. Firewall rules

    # Client port must be accessible from server
    iptables -L -n | grep 9000
    
  3. Docker networking

    # Containers must be on same network
    docker network inspect job-orchestrator_default
    

Jobs Stuck in Prepared

Symptom: Payloads stay in Prepared status

Possible Causes:

  1. Runner task not running

    Check client logs, may need restart

  2. run.sh not found or not executable

    Ensure the script exists and is executable:

    # In your upload
    chmod +x run.sh
    
  3. Permission issues

    Client working directory may have permission issues

Execution Fails

Symptom: Jobs complete but with Failed status

Check:

  1. Exit code

    run.sh must exit with code 0 for success:

    #!/bin/bash
    # Your commands here
    exit 0  # Explicit success
    
  2. Script errors

    Check output files for error messages

  3. Missing dependencies

    Your script may need tools not available in the container

Database Issues

Database Locked

Symptom: “database is locked” errors

Causes: Multiple processes accessing SQLite

Solution:

  • Ensure only one server instance runs
  • Check for zombie processes
  • Restart server

Database Corrupted

Symptom: Strange errors, missing data

Solution:

  1. Stop server

  2. Backup current database

  3. Run integrity check:

    sqlite3 db.sqlite "PRAGMA integrity_check;"
    
  4. If corrupted, restore from backup or delete and restart

Out of Disk Space

Symptom: “disk full” errors

Solution:

  1. Check disk usage:

    df -h
    
  2. Clean old jobs:

    # Reduce MAX_AGE and restart
    export MAX_AGE=3600  # 1 hour
    
  3. Manually clean data directory

Docker Issues

Container Exits Immediately

Check logs:

docker logs container_name

Common causes:

  • Missing environment variables
  • Port conflicts
  • Permission issues

Cannot Connect Between Containers

Ensure same network:

services:
  server:
    networks:
      - app-network
  client:
    networks:
      - app-network

networks:
  app-network:

Use service names, not localhost:

# Wrong
SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit

# Correct
SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit

Volume Permission Issues

Symptom: Permission denied when writing to volumes

Solution:

services:
  server:
    user: "1000:1000"  # Match host user
    volumes:
      - ./data:/opt/data

Or fix permissions:

sudo chown -R 1000:1000 ./data

Performance Issues

Slow Job Processing

Possible Causes:

  1. Slow database

    • Use SSD storage for database
    • Run VACUUM periodically
  2. Network latency

    • Place server and clients on same network
    • Check for packet loss
  3. Client overloaded

    • Add more clients
    • Reduce RUNS_PER_USER

High Memory Usage

Server:

  • Memory grows with job count
  • Clean old jobs with lower MAX_AGE

Client:

  • In-memory database grows with payloads
  • Restart client to clear

Disk Usage Growing

Check:

du -sh /opt/data/*

Solutions:

  • Reduce MAX_AGE
  • Increase cleanup frequency
  • Archive old results externally

Getting Help

If you can’t resolve an issue:

  1. Check logs for specific error messages
  2. Search existing issues: GitHub Issues
  3. Open new issue with:
    • Version
    • Configuration
    • Steps to reproduce
    • Logs

See Also