Introduction

job-orchestrator is an asynchronous job orchestration system for managing and distributing computational workloads across heterogeneous computing resources with intelligent quota-based load balancing.

What is job-orchestrator?

job-orchestrator is a central component of WeNMR, a worldwide e-Infrastructure for structural biology operated by the BonvinLab at Utrecht University. It serves as a reactive middleware layer that connects web applications to diverse computing resources, enabling efficient job distribution for scientific computing workflows.

Key Features

Asynchronous Job Management: Built with Rust and Tokio for high-performance async operations
Quota-Based Load Balancing: Per-user, per-service quotas prevent resource exhaustion
Dual-Mode Architecture: Runs as server (job orchestration) or client (job execution)
Multiple Backend Support: Extensible to integrate with various computing resources:
- Native client mode for local job execution
- DIRAC Interware (planned)
- SLURM clusters (planned)
- Educational cloud services (planned)
RESTful API: Simple HTTP interface for job submission and retrieval
Automatic Cleanup: Configurable retention policies for completed jobs

Use Cases

job-orchestrator is designed for scenarios requiring:

Scientific Computing Workflows: Distribute computational biology/chemistry jobs across clusters
Multi-Tenant Systems: Fair resource allocation with per-user quotas
Heterogeneous Computing: Route jobs to appropriate backends (local, HPC, cloud)
Web-Based Science Platforms: Decouple frontend from compute infrastructure
Batch Processing: Handle high-throughput job submissions with automatic queuing

Project Status

Current State: Production-ready with server/client architecture

Planned Features:

Auto-Scaling: Dynamic creation and termination of cloud-based client instances based on workload
DIRAC Interware integration
SLURM direct integration
Enhanced monitoring and metrics
Job priority queues
Advanced scheduling policies

Getting Help

API Documentation: Available via Swagger UI at /swagger-ui/ when running
Issues: GitHub Issues
Email: Rodrigo V. Honorato rvhonorato@protonmail.com

License

MIT License - see LICENSE for details.

Installation

There are several ways to install job-orchestrator depending on your needs.

From crates.io

The easiest way to install job-orchestrator is via Cargo:

cargo install job-orchestrator

From Source

Clone the repository and build:

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
cargo build --release

The binary will be available at target/release/job-orchestrator.

Using Docker

Pull the pre-built image:

docker pull ghcr.io/rvhonorato/job-orchestrator:latest

Or build locally:

docker build -t job-orchestrator .

Prerequisites

For Building from Source

Rust: 1.75 or later (edition 2021)
SQLite: Development libraries

On Debian/Ubuntu:

apt-get install libsqlite3-dev

On macOS:

brew install sqlite

For Running

SQLite: Runtime library (usually included in most systems)
Filesystem access: Write permissions for database and job storage directories

Verifying Installation

After installation, verify it works:

job-orchestrator --version

You should see the version number displayed.

Next Steps

Quick Start - Get running with Docker Compose
Your First Job - Submit and retrieve a job

Quick Start

The fastest way to get job-orchestrator running is with Docker Compose.

Running with Docker Compose

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build

This starts:

Orchestrator server on port 5000
Example client on port 9000

Verify It’s Running

Check the server is responding:

curl http://localhost:5000/health

You should receive a health status response.

Access the API Documentation

Open your browser and navigate to:

http://localhost:5000/swagger-ui/

This provides interactive API documentation where you can explore and test all endpoints.

What’s Next?

Now that you have job-orchestrator running, proceed to Your First Job to learn how to submit and retrieve jobs.

Stopping the Services

To stop the services:

docker compose down

To stop and remove all data (volumes):

docker compose down -v

Your First Job

This guide walks you through submitting and retrieving your first job.

Prerequisites

Make sure you have job-orchestrator running. See Quick Start if you haven’t set it up yet.

Understanding Jobs

A job in job-orchestrator consists of:

Files: One or more files to be processed
A run.sh script: The entry point that gets executed
User ID: Identifies who submitted the job (for quota tracking)
Service: Which service/backend should process this job

Creating a Simple Job

Create a simple run.sh script:

cat > run.sh << 'EOF'
#!/bin/bash
echo "Hello from job-orchestrator!" > output.txt
echo "Processing complete at $(date)" >> output.txt
EOF
chmod +x run.sh

Submitting the Job

Submit the job using curl:

curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "user_id=1" \
  -F "service=example" | jq

You’ll receive a response like:

{
  "id": 1,
  "status": "Queued",
  "message": "Job successfully uploaded"
}

Note the id field - you’ll need this to check status and download results.

Checking Job Status

Check the job status via GET request:

curl http://localhost:5000/download/1

If the job is not yet completed, you’ll get a JSON response:

{
  "id": 1,
  "status": "Submitted",
  "message": ""
}

The status field will be one of: Queued, Processing, Submitted, Running, Completed, Failed, Invalid, Cleaned, or Unknown.

Downloading Results

Once the status is Completed, the same endpoint returns the ZIP file:

curl -o results.zip http://localhost:5000/download/1

Extract and view:

unzip results.zip
cat output.txt

You should see:

Hello from job-orchestrator!
Processing complete at <timestamp>

A More Complex Example

Here’s a job that processes an input file:

# Create an input file
echo "sample data" > input.txt

# Create a processing script
cat > run.sh << 'EOF'
#!/bin/bash
# Count lines and words in input
wc input.txt > stats.txt
# Transform the data
tr 'a-z' 'A-Z' < input.txt > output.txt
echo "Done!" >> output.txt
EOF
chmod +x run.sh

# Submit with multiple files
curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "file=@input.txt" \
  -F "user_id=1" \
  -F "service=example"

Important Notes

The `run.sh` Script

Must be named exactly run.sh
Must be executable (or start with #!/bin/bash)
Exit code 0 indicates success
Non-zero exit code indicates failure
All output files in the working directory are included in results

File Size Limits

The default maximum upload size is 400MB. This can be configured on the server.

Job Retention

Completed jobs are automatically cleaned up after the configured retention period (default: 48 hours). Make sure to download your results before they expire.

Next Steps

Learn about the Job Lifecycle
Configure Quotas for your users
Set up Production Deployment

Architecture Overview

job-orchestrator uses a distributed architecture with a central server coordinating job execution across multiple client nodes.

High-Level Architecture

flowchart TB
 subgraph Tasks["Background Tasks"]
        Sender["Sender<br>500ms"]
        Getter["Getter<br>500ms"]
        Cleaner["Cleaner<br>60s"]
  end
 subgraph Server["Orchestrator Server"]
        API["REST API<br>upload/download"]
        DB[("SQLite<br>Persistent")]
        FS[/"Filesystem<br>Job Storage"/]
        Tasks
        Queue["Queue Manager<br>Quota Enforcement"]
  end
 subgraph Client["Client Service"]
        ClientAPI["REST API<br>submit/retrieve/load"]
        ClientDB[("SQLite<br>In-Memory")]
        ClientFS[/"Working Dir"/]
        Runner["Runner Task<br>500ms"]
        Executor["Bash Executor<br>run.sh"]
  end
    User(["User/Web App"]) -- POST /upload --> API
    User -- GET /download/:id --> API
    API --> DB & FS
    DB --> Queue
    Queue --> Sender
    Sender -- POST /submit --> ClientAPI
    Getter -- GET /retrieve/:id --> ClientAPI
    Getter --> FS
    Cleaner --> DB & FS
    ClientAPI --> ClientDB
    ClientDB --> Runner
    Runner --> Executor
    Executor --> ClientFS

Components

Orchestrator Server

The central server manages:

REST API: Handles job uploads and result downloads from users
SQLite Database: Persistent storage for job metadata and status
Filesystem Storage: Stores uploaded files and downloaded results
Queue Manager: Enforces per-user quotas and manages job distribution
Background Tasks: Automated processes for job distribution, result retrieval, and cleanup

Client Service

Each client node handles:

REST API: Receives jobs from server, returns results
In-Memory Database: Lightweight tracking of current payloads
Working Directory: Temporary storage for job execution
Runner Task: Monitors for new payloads and executes them
Bash Executor: Runs the run.sh script for each job

Background Tasks

Server Tasks

Task	Interval	Purpose
Sender	500ms	Picks up queued jobs, enforces quotas, dispatches to clients
Getter	500ms	Retrieves completed results from clients
Cleaner	60s	Removes expired jobs from disk and database

Client Tasks

Task	Interval	Purpose
Runner	500ms	Executes prepared payloads, captures results

Data Flow

User submits files via POST /upload
Server stores files and creates job record (status: Queued)
Sender task picks up job, checks quotas, sends to available client
Client receives job, stores as payload (status: Prepared)
Runner task executes run.sh, updates status to Completed
Getter task retrieves results, stores locally
User downloads results via GET /download/:id
Cleaner task removes job after retention period

Auto-Scaling Architecture (Planned)

The orchestrator will support automatic scaling of client instances based on workload:

---
config:
  layout: dagre
---
flowchart TB
 subgraph Server["Orchestrator Server"]
        API["REST API"]
        Queue["Queue Manager"]
        AutoScaler["Auto-Scaler"]
        ServicePool["Service Pool"]
  end
 subgraph Cloud["Cloud Provider"]
        CloudAPI["Cloud API"]
  end
 subgraph Clients["Client Instances"]
        Dynamic["Dynamic Clients<br>Auto-created"]
        Static["Static Client<br>"]
  end
    User(["User/Web App"]) -- Submits/Retrieves --> API
    API --> Queue
    Queue -- Distribute jobs --> Clients
    ServicePool <-- Monitors --> Queue
    AutoScaler <-- Register/Trigger --> ServicePool
    AutoScaler -- Scale Up/Down --> CloudAPI
    CloudAPI -- Create/Terminate --> Clients

This feature will enable:

Dynamic creation of cloud-based client instances during high demand
Automatic termination of idle instances to reduce costs
Load-aware job distribution across available clients

Job Lifecycle

Understanding the job lifecycle is essential for working with job-orchestrator effectively.

Lifecycle Sequence

sequenceDiagram
    participant User
    participant Server
    participant Client
    participant Executor

    User->>Server: POST /upload (files, user_id, service)
    Server->>Server: Store job (status: Queued)
    Server-->>User: Job ID

    Note over Server: Sender task (500ms interval)
    Server->>Server: Update status: Processing
    Server->>Client: POST /submit (job files)
    Client->>Client: Store payload (status: Prepared)
    Client-->>Server: Payload ID
    Server->>Server: Update status: Submitted

    Note over Client: Runner task (500ms interval)
    Client->>Executor: Execute run.sh
    Executor->>Executor: Process files
    Executor-->>Client: Exit code
    Client->>Client: Update status: Completed

    Note over Server: Getter task (500ms interval)
    Server->>Client: GET /retrieve/:id
    Client-->>Server: ZIP results
    Server->>Server: Store results, status: Completed

    User->>Server: GET /download/:id
    Server-->>User: results.zip

    Note over Server: Cleaner task (60s interval)
    Server->>Server: Remove jobs older than MAX_AGE

Job States

stateDiagram-v2
    [*] --> Queued: Job submitted

    Queued --> Processing: Sender picks up job
    Processing --> Submitted: Sent to client
    Processing --> Failed: Client unreachable

    Submitted --> Completed: Execution successful
    Submitted --> Unknown: Retrieval failed or execution failed

    Unknown --> Completed: Retry successful

    Completed --> Cleaned: After MAX_AGE
    Failed --> Cleaned: After MAX_AGE
    Unknown --> Cleaned: After MAX_AGE (if applicable)

    Cleaned --> [*]

State Descriptions

State	Description
Queued	Job received and waiting for dispatch
Processing	Server is sending job to a client
Submitted	Job successfully sent to client, awaiting execution
Completed	Job finished successfully, results available
Failed	Job failed permanently (client unreachable, execution error)
Unknown	Temporary state when retrieval fails, will retry
Cleaned	Job data removed after retention period

Lifecycle Stages

1. Submission

User uploads files via POST /upload with:

One or more files (including run.sh)
user_id - identifies the submitting user
service - which backend should process this job

The server:

Validates the request
Creates a unique directory for the job
Stores all uploaded files
Creates a database record with status Queued
Returns the job ID to the user

2. Queuing & Quota Check

The Sender background task (runs every 500ms):

Finds jobs in Queued status
Checks if user has available quota for the service
If quota available, marks job as Processing
If quota exceeded, job remains Queued

3. Distribution

For jobs in Processing status:

Server packages job files
Sends to configured client via POST /submit
On success: updates status to Submitted, stores client’s payload ID
On failure: updates status to Failed

4. Execution

On the client side:

Runner task finds payloads in Prepared status
Executes run.sh in the job directory
Captures exit code and any output files
Updates payload status to Completed (or Failed on error)

5. Retrieval

The Getter background task (runs every 500ms):

Finds jobs in Submitted status
Requests results from client via GET /retrieve/:id
Downloads and stores the result ZIP
Updates job status to Completed

6. Download

User can now:

Check status via GET /download/:id (returns JSON with job state when not completed)
Download results via GET /download/:id (returns ZIP when completed)
Results are returned as a ZIP archive

7. Cleanup

The Cleaner background task (runs every 60s):

Finds jobs older than MAX_AGE
Deletes job files from filesystem
Updates status to Cleaned or removes record

Error Handling

Client Unreachable

If the server cannot reach a client during distribution:

Job status changes to Failed
Job will not be retried automatically
User can resubmit if needed

Execution Failure

If run.sh exits with non-zero code:

Payload status changes to Failed
Server retrieves whatever output exists
Job status reflects the failure

Retrieval Failure

If the server cannot retrieve results:

Job status changes to Unknown
Server will retry on subsequent Getter cycles
Eventually succeeds or times out

Timing Considerations

Event	Typical Duration
Upload to Queued	Immediate
Queued to Processing	Up to 500ms (+ quota wait)
Processing to Submitted	Depends on file size and network
Submitted to Completed	Depends on job execution time
Completed to Cleaned	Configured via `MAX_AGE` (default: 48 hours)

Server & Client Modes

job-orchestrator provides both server and client functionality in a single binary, configured via command-line arguments.

Dual-Mode Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Same Binary                               │
│                                                                  │
│   ┌─────────────────────┐       ┌─────────────────────┐         │
│   │    Server Mode      │       │    Client Mode      │         │
│   │                     │       │                     │         │
│   │  - Job orchestration│       │  - Job execution    │         │
│   │  - Quota management │       │  - Result packaging │         │
│   │  - Persistent DB    │       │  - In-memory DB     │         │
│   │  - User-facing API  │       │  - Server-facing API│         │
│   └─────────────────────┘       └─────────────────────┘         │
│                                                                  │
│              job-orchestrator server    job-orchestrator client  │
└─────────────────────────────────────────────────────────────────┘

Server Mode

The server is the central orchestrator that:

Receives job submissions from users/applications
Manages job queues and enforces quotas
Distributes jobs to available clients
Retrieves results and serves them to users
Handles cleanup of expired jobs

Starting the Server

job-orchestrator server --port 5000

Or with environment variables:

PORT=5000 job-orchestrator server

Server Responsibilities

Component	Purpose
REST API	Handle `/upload` and `/download` requests
Queue Manager	Enforce per-user, per-service quotas
Sender Task	Dispatch jobs to clients
Getter Task	Retrieve completed results
Cleaner Task	Remove expired jobs
SQLite DB	Persistent job tracking

Server API Endpoints

Endpoint	Method	Purpose
`/upload`	POST	Submit new job
`/download/:id`	GET	Get results or status
`/health`	GET	Health check
`/swagger-ui/`	GET	API documentation

Client Mode

The client executes jobs on behalf of the server:

Receives job payloads from the server
Executes the run.sh script
Packages results for retrieval
Reports system load for scheduling decisions

Starting the Client

job-orchestrator client --port 9000

Or with environment variables:

PORT=9000 job-orchestrator client

Client Responsibilities

Component	Purpose
REST API	Handle `/submit` and `/retrieve` requests
Runner Task	Execute prepared payloads
Bash Executor	Run `run.sh` scripts
In-Memory DB	Lightweight payload tracking

Client API Endpoints

Endpoint	Method	Purpose
`/submit`	POST	Receive job from server
`/retrieve/:id`	GET	Return completed results
`/load`	GET	Report CPU usage
`/health`	GET	Health check

Communication Flow

User                Server                    Client
  │                   │                         │
  │──POST /upload────▶│                         │
  │◀─── Job ID ───────│                         │
  │                   │                         │
  │                   │──POST /submit──────────▶│
  │                   │◀─── Payload ID ─────────│
  │                   │                         │
  │                   │                    ┌────┴────┐
  │                   │                    │ Execute │
  │                   │                    │ run.sh  │
  │                   │                    └────┬────┘
  │                   │                         │
  │                   │──GET /retrieve/:id─────▶│
  │                   │◀─── results.zip ────────│
  │                   │                         │
  │──GET /download/:id▶│                         │
  │◀─── results.zip ──│                         │

Deployment Patterns

Single Machine (Development)

Both server and client on the same machine:

# Terminal 1
job-orchestrator server --port 5000

# Terminal 2
job-orchestrator client --port 9000

Distributed (Production)

Server on one machine, clients on compute nodes:

                    ┌─────────────┐
                    │   Server    │
                    │  (port 5000)│
                    └──────┬──────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
   │  Client 1   │  │  Client 2   │  │  Client 3   │
   │ (compute-1) │  │ (compute-2) │  │ (compute-3) │
   └─────────────┘  └─────────────┘  └─────────────┘

Multi-Service Setup

Different clients for different services:

# Server configuration
SERVICE_EXAMPLE_UPLOAD_URL: http://client-example:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-example:9000/retrieve

SERVICE_HADDOCK_UPLOAD_URL: http://client-haddock:9001/submit
SERVICE_HADDOCK_DOWNLOAD_URL: http://client-haddock:9001/retrieve

Database Differences

Server Database (Persistent)

Uses SQLite file on disk
Survives restarts
Stores complete job history
Location configured via DB_PATH

Client Database (In-Memory)

SQLite in-memory database
Cleared on restart
Only tracks active payloads
Lightweight and fast

When to Scale

Add More Clients When:

Job queue is consistently backing up
Execution time is the bottleneck
You have available compute resources

Scale Server When:

Upload/download becomes slow
Many concurrent users
Database queries become slow

Server Configuration

The orchestrator server is configured primarily through environment variables.

Environment Variables

Core Settings

Variable	Default	Description
`PORT`	`5000`	HTTP port the server listens on
`DB_PATH`	`./db.sqlite`	Path to SQLite database file
`DATA_PATH`	`./data`	Directory for job file storage
`MAX_AGE`	`172800`	Job retention time in seconds (default: 48 hours)

Service Configuration

For each service you want to support, configure these variables:

Variable Pattern	Description
`SERVICE_<NAME>_UPLOAD_URL`	Client endpoint for submitting jobs
`SERVICE_<NAME>_DOWNLOAD_URL`	Client endpoint for retrieving results
`SERVICE_<NAME>_RUNS_PER_USER`	Maximum concurrent jobs per user (default: 5)

Note: <NAME> must be uppercase. For a service called “example”, use SERVICE_EXAMPLE_*.

Example Configuration

Minimal Setup

export PORT=5000
export DB_PATH=/var/lib/job-orchestrator/db.sqlite
export DATA_PATH=/var/lib/job-orchestrator/data
export SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://localhost:9000/retrieve

Production Setup

# Core settings
export PORT=5000
export DB_PATH=/opt/orchestrator/db.sqlite
export DATA_PATH=/opt/orchestrator/data
export MAX_AGE=172800  # 48 hours

# Example service (general purpose)
export SERVICE_EXAMPLE_UPLOAD_URL=http://compute-1:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://compute-1:9000/retrieve
export SERVICE_EXAMPLE_RUNS_PER_USER=10

# HADDOCK service (specialized)
export SERVICE_HADDOCK_UPLOAD_URL=http://haddock-cluster:9001/submit
export SERVICE_HADDOCK_DOWNLOAD_URL=http://haddock-cluster:9001/retrieve
export SERVICE_HADDOCK_RUNS_PER_USER=3

Docker Compose

services:
  server:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: server
    ports:
      - "5000:5000"
    environment:
      PORT: 5000
      DB_PATH: /opt/data/db.sqlite
      DATA_PATH: /opt/data
      MAX_AGE: 172800
      SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
      SERVICE_EXAMPLE_RUNS_PER_USER: 5
    volumes:
      - server-data:/opt/data

volumes:
  server-data:

Configuration Details

PORT

The HTTP port for the REST API. Users will connect to this port to submit jobs and download results.

PORT=5000

DB_PATH

Path to the SQLite database file. The directory must exist and be writable.

DB_PATH=/var/lib/job-orchestrator/db.sqlite

The database is created automatically on first run. It stores:

Job metadata (ID, user, service, status)
Job locations and timestamps
Client payload references

DATA_PATH

Directory where job files are stored. Each job gets a unique subdirectory.

DATA_PATH=/var/lib/job-orchestrator/data

Structure:

/var/lib/job-orchestrator/data/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890/
│   ├── run.sh
│   ├── input.pdb
│   └── output.zip  (after completion)
├── b2c3d4e5-f6a7-8901-bcde-f12345678901/
│   └── ...

MAX_AGE

How long to keep completed jobs before cleanup, in seconds.

Value	Duration
`3600`	1 hour
`86400`	24 hours
`172800`	48 hours (default)
`604800`	1 week

MAX_AGE=172800

Jobs older than this are removed by the Cleaner task.

Service URLs

Each service needs upload and download URLs pointing to a client:

SERVICE_MYSERVICE_UPLOAD_URL=http://client-host:9000/submit
SERVICE_MYSERVICE_DOWNLOAD_URL=http://client-host:9000/retrieve

UPLOAD_URL: Where to POST job files
DOWNLOAD_URL: Where to GET results (:id is appended automatically)

RUNS_PER_USER

Controls how many jobs a single user can have running simultaneously for a service:

SERVICE_EXAMPLE_RUNS_PER_USER=5

Jobs exceeding the quota remain in Queued status
They’re automatically dispatched when slots become available
Set higher for quick jobs, lower for resource-intensive jobs

File Permissions

Ensure the server process has:

Read/Write access to DB_PATH parent directory
Read/Write access to DATA_PATH directory
Network access to all configured client URLs

Validating Configuration

Start the server and check logs:

job-orchestrator server

You should see:

Port binding confirmation
Database initialization
Service configuration loaded

Test with a health check:

curl http://localhost:5000/health

Client Configuration

The client is configured through environment variables and runs as a job executor.

Environment Variables

Variable	Default	Description
`PORT`	`9000`	HTTP port the client listens on

Example Configuration

Basic Setup

export PORT=9000
job-orchestrator client

Docker Compose

services:
  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    ports:
      - "9000:9000"
    environment:
      PORT: 9000
    volumes:
      - client-data:/opt/data

volumes:
  client-data:

How the Client Works

In-Memory Database

Unlike the server, the client uses an in-memory SQLite database:

Fast: No disk I/O for database operations
Ephemeral: Data is lost on restart
Lightweight: Minimal resource usage

This is intentional - the client only needs to track active payloads. The server maintains the authoritative job history.

Working Directory

The client stores job files in a working directory. Each payload gets a unique subdirectory:

/opt/data/
├── payload-uuid-1/
│   ├── run.sh
│   ├── input.pdb
│   └── output.txt  (created by run.sh)
├── payload-uuid-2/
│   └── ...

Execution Environment

When the Runner task executes a job:

Changes to the payload directory
Executes ./run.sh
Captures the exit code
All files in the directory are included in results

Resource Reporting

The client exposes a /load endpoint that reports CPU usage:

curl http://localhost:9000/load

Returns a float representing CPU usage percentage. This can be used by the server for load-aware scheduling (planned feature).

Multiple Clients

You can run multiple clients for:

Scaling: Handle more concurrent jobs
Isolation: Different services on different machines
Redundancy: Failover capability

Same Service, Multiple Clients

Currently, configure multiple URLs in the server (round-robin planned):

# On server - points to primary client
SERVICE_EXAMPLE_UPLOAD_URL=http://client-1:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL=http://client-1:9000/retrieve

Different Services

Run specialized clients for different workloads:

# Client for general jobs
PORT=9000 job-orchestrator client

# Client for heavy computation (different machine)
PORT=9001 job-orchestrator client

Server configuration:

SERVICE_LIGHT_UPLOAD_URL=http://client-1:9000/submit
SERVICE_HEAVY_UPLOAD_URL=http://client-2:9001/submit

Client Security

Network Access

The client should only be accessible by the orchestrator server:

Use internal networks / VPCs
Firewall rules to restrict access
Never expose client ports to the internet

Execution Sandbox

The client executes arbitrary run.sh scripts. Consider:

Running in containers with resource limits
Using separate user accounts with minimal permissions
Mounting only necessary directories
Network isolation if jobs don’t need internet

Docker Resource Limits

services:
  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '1'
          memory: 1G

Monitoring

Health Check

curl http://localhost:9000/health

Load Check

curl http://localhost:9000/load

Container Logs

docker logs -f client-container

Troubleshooting

Client Not Receiving Jobs

Verify server can reach client URL
Check firewall rules
Verify service configuration on server

Jobs Stuck in Prepared

Check if Runner task is running (look for logs)
Verify run.sh is executable
Check for permission issues in working directory

High Memory Usage

The in-memory database grows with active payloads. If memory is high:

Check for stuck/zombie payloads
Restart the client (safe - server tracks jobs)
Consider more frequent cleanup

Quota System

The quota system ensures fair resource allocation by limiting concurrent jobs per user per service.

How Quotas Work

User 1 submits 10 jobs for "example" service
Quota: SERVICE_EXAMPLE_RUNS_PER_USER=5

┌─────────────────────────────────────────┐
│ Jobs 1-5:  Dispatched immediately       │
│ Jobs 6-10: Remain queued                │
└─────────────────────────────────────────┘

When Job 1 completes:
┌─────────────────────────────────────────┐
│ Job 6: Now dispatched (slot available)  │
└─────────────────────────────────────────┘

Configuration

Set quotas per service using environment variables:

SERVICE_<NAME>_RUNS_PER_USER=<limit>

Examples

# Allow 5 concurrent jobs per user for "example" service
SERVICE_EXAMPLE_RUNS_PER_USER=5

# Allow 3 concurrent jobs per user for "haddock" service
SERVICE_HADDOCK_RUNS_PER_USER=3

# Allow 10 concurrent jobs per user for "quick" service
SERVICE_QUICK_RUNS_PER_USER=10

Default Value

If not specified, the default quota is 5 concurrent jobs per user per service.

Quota Scope

Quotas are enforced per user, per service:

User	Service	Quota	Can Submit
user_1	example	5	Up to 5 concurrent
user_1	haddock	3	Up to 3 concurrent
user_2	example	5	Up to 5 concurrent (independent of user_1)

Users don’t compete with each other - each user has their own quota allocation.

Quota States

Jobs transition through these states relative to quotas:

┌──────────┐     Quota      ┌────────────┐
│  Queued  │ ──Available──▶ │ Processing │
└──────────┘                └────────────┘
     │                            │
     │ Quota Exhausted            │
     ▼                            ▼
┌──────────────────┐      ┌────────────┐
│ Remains Queued   │      │ Submitted  │
│ (waits for slot) │      │ (running)  │
└──────────────────┘      └────────────┘

Choosing Quota Values

Factors to Consider

Job Duration: Longer jobs need lower quotas
Resource Usage: CPU/memory intensive jobs need lower quotas
User Base: More users may need lower per-user quotas
Client Capacity: Match quotas to available compute resources

Guidelines

Job Type	Suggested Quota
Quick jobs (< 1 min)	10-20
Medium jobs (1-10 min)	5-10
Long jobs (10+ min)	2-5
Resource-intensive	1-3

Example Scenarios

Scientific Computing Platform

# Quick validation jobs - high quota
SERVICE_VALIDATE_RUNS_PER_USER=20

# Standard analysis - medium quota
SERVICE_ANALYZE_RUNS_PER_USER=5

# Heavy simulation - low quota
SERVICE_SIMULATE_RUNS_PER_USER=2

Educational Platform

# Student exercises - moderate quota
SERVICE_EXERCISE_RUNS_PER_USER=3

# Final projects - allow more
SERVICE_PROJECT_RUNS_PER_USER=5

Monitoring Quota Usage

Check Queue Status

Jobs waiting due to quota exhaustion remain in Queued status:

# Check how many jobs are queued vs running
curl http://localhost:5000/swagger-ui/  # Use API explorer

Server Logs

The server logs quota decisions:

INFO: User 1 has 5/5 jobs running for service 'example', job 123 remains queued
INFO: User 1 slot available, dispatching job 123 to service 'example'

Testing Quotas

Submit multiple jobs to observe throttling:

# Submit 10 jobs with quota of 5
for i in {1..10}; do
  echo '#!/bin/bash
sleep 30
echo "Job complete" > output.txt' > run.sh

  curl -s -X POST http://localhost:5000/upload \
    -F "file=@run.sh" \
    -F "user_id=1" \
    -F "service=example" | jq -r '.status'
done

You’ll see:

First 5 jobs: Quickly move to Submitted
Jobs 6-10: Stay in Queued until slots open

Fair Scheduling

The quota system provides basic fairness:

Per-user isolation: One user can’t starve others
Per-service isolation: Heavy service usage doesn’t block other services
Automatic queuing: No jobs are rejected, just delayed

Limitations

Current limitations (improvements planned):

No priority queues (FIFO within quota constraints)
No global quotas (only per-user)
No time-based quotas (e.g., jobs per hour)
No burst allowances

Server API Endpoints

The orchestrator server exposes a REST API for job submission and retrieval.

Base URL

http://localhost:5000

Interactive Documentation

Swagger UI is available at:

http://localhost:5000/swagger-ui/

Endpoints

POST /upload

Submit a new job for processing.

Request

Content-Type: multipart/form-data
Max size: 400MB

Field	Type	Required	Description
`file`	file	Yes	One or more files (repeat for multiple)
`user_id`	integer	Yes	User identifier for quota tracking
`service`	string	Yes	Service name (must be configured on server)

Example

curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "file=@input.pdb" \
  -F "user_id=1" \
  -F "service=example"

Response

{
  "id": 1,
  "status": "Queued",
  "message": "Job successfully uploaded"
}

Status Codes

Code	Description
`201`	Job created successfully
`400`	Invalid request (missing fields, invalid service)
`500`	Server error

Notes

At least one file must be named run.sh
The service must match a configured service on the server
dest_id is populated after the job is dispatched to a client

GET /download/

Check job status or download results.

Parameters

Parameter	Type	Description
`id`	integer	Job ID from upload response

Example

# Check status (returns JSON when not completed)
curl http://localhost:5000/download/1

# Download results (returns ZIP when completed)
curl -o results.zip http://localhost:5000/download/1

Response

When the job is not yet completed, returns a JSON body:

{
  "id": 1,
  "status": "Submitted",
  "message": ""
}

When the job is completed, returns:

Content-Type: application/zip
Body: ZIP archive containing all result files

Status Codes

Code	Description
`200`	JSON status body or ZIP file (check `Content-Type`)
`404`	Job not found
`500`	Server error

Usage Pattern

Poll until status is Completed, then save the ZIP:

while true; do
  response=$(curl -s http://localhost:5000/download/1)
  status=$(echo "$response" | jq -r '.status // empty')
  if [ -z "$status" ]; then
    # No JSON status field means we got the ZIP
    curl -o results.zip http://localhost:5000/download/1
    break
  elif [ "$status" = "Completed" ]; then
    curl -o results.zip http://localhost:5000/download/1
    break
  else
    echo "Status: $status"
    sleep 5
  fi
done

GET /health

Health check endpoint.

Example

curl http://localhost:5000/health

Response

{
  "status": "healthy"
}

Status Codes

Code	Description
`200`	Server is healthy
`500`	Server is unhealthy

GET /

Ping endpoint for basic connectivity check.

Example

curl http://localhost:5000/

Response

Simple acknowledgment that the server is running.

GET /swagger-ui/

Interactive API documentation.

Example

Open in browser:

http://localhost:5000/swagger-ui/

Provides:

Interactive API explorer
Request/response schemas
Try-it-out functionality

Error Responses

All error responses follow this format:

{
  "id": 0,
  "status": "Unknown",
  "message": "Description of the error"
}

Rate Limiting

The server does not implement rate limiting directly. Use a reverse proxy (nginx, traefik) for rate limiting in production.

Authentication

The server does not implement authentication directly. The user_id field is trusted as provided. Implement authentication at the reverse proxy layer or in your application.

Client API Endpoints

The client exposes endpoints for the orchestrator server to submit jobs and retrieve results.

Note: These endpoints are typically only accessed by the orchestrator server, not by end users.

Base URL

http://localhost:9000

Endpoints

POST /submit

Receive a job payload from the orchestrator server.

Request

Content-Type: multipart/form-data

Field	Type	Required	Description
`file`	file	Yes	One or more job files

Example

curl -X POST http://localhost:9000/submit \
  -F "file=@run.sh" \
  -F "file=@input.pdb"

Response

{
  "id": 1,
  "status": "Prepared",
  "loc": "/opt/data/abc123-def456"
}

Status Codes

Code	Description
`200`	Payload received successfully
`500`	Server error

Notes

The client stores files and creates a payload record
Status starts as Prepared, waiting for the Runner task
The id is returned to the server and stored as dest_id

GET /retrieve/

Retrieve results of a completed payload.

Parameters

Parameter	Type	Description
`id`	integer	Payload ID from submit response

Example

curl -o results.zip http://localhost:9000/retrieve/1

Response

When the payload is not yet completed, returns a JSON body:

{
  "id": 1,
  "status": "Running",
  "loc": "/opt/data/abc123-def456"
}

When the payload is completed, returns:

Content-Type: application/zip
Body: ZIP archive of all files in the payload directory

Status Codes

Code	Description
`200`	JSON payload status or ZIP file (check `Content-Type`)
`404`	Payload not found
`500`	Server error

Notes

The ZIP includes all files in the working directory after run.sh execution
Original input files are included unless deleted by run.sh
After successful retrieval, the payload may be cleaned up

GET /load

Report current CPU usage.

Example

curl http://localhost:9000/load

Response

45.2

Returns a float representing CPU usage percentage (0-100).

Use Cases

Load-aware job distribution (planned feature)
Monitoring client health
Capacity planning

GET /health

Health check endpoint.

Example

curl http://localhost:9000/health

Response

{
  "status": "healthy"
}

GET /

Ping endpoint for basic connectivity check.

Example

curl http://localhost:9000/

Payload States

Payloads on the client go through these states:

State	Description
`Prepared`	Received from server, waiting for execution
`Running`	Currently executing `run.sh`
`Completed`	Execution finished successfully
`Failed`	Execution failed (non-zero exit code)

Security Considerations

The client API should never be exposed to the public internet:

No authentication is implemented
Arbitrary code execution via run.sh
Internal service communication only

Recommendations:

Use internal networks / VPCs
Firewall rules: allow only orchestrator server IP
Docker networks with no external exposure

Docker Deployment

Docker is the recommended way to deploy job-orchestrator in production.

Quick Start

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build

This starts:

Server on port 5000
Example client on port 9000

Docker Images

Official Image

docker pull ghcr.io/rvhonorato/job-orchestrator:latest

Build Locally

docker build -t job-orchestrator .

Docker Compose

Basic Setup

version: '3.8'

services:
  server:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: server
    ports:
      - "5000:5000"
    environment:
      PORT: 5000
      DB_PATH: /opt/data/db.sqlite
      DATA_PATH: /opt/data
      MAX_AGE: 172800
      SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
      SERVICE_EXAMPLE_RUNS_PER_USER: 5
    volumes:
      - server-data:/opt/data
    depends_on:
      - client

  client:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000
    volumes:
      - client-data:/opt/data

volumes:
  server-data:
  client-data:

Production Setup

See Production Deployment - Container Hardening for details on each security option.

Multiple Clients

Scaling Horizontally

services:
  server:
    # ... server config ...
    environment:
      SERVICE_EXAMPLE_UPLOAD_URL: http://client-1:9000/submit
      SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-1:9000/retrieve

  client-1:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

  client-2:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

  client-3:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    environment:
      PORT: 9000

Multiple Services

services:
  server:
    environment:
      # Light jobs
      SERVICE_LIGHT_UPLOAD_URL: http://client-light:9000/submit
      SERVICE_LIGHT_DOWNLOAD_URL: http://client-light:9000/retrieve
      SERVICE_LIGHT_RUNS_PER_USER: 10

      # Heavy jobs
      SERVICE_HEAVY_UPLOAD_URL: http://client-heavy:9000/submit
      SERVICE_HEAVY_DOWNLOAD_URL: http://client-heavy:9000/retrieve
      SERVICE_HEAVY_RUNS_PER_USER: 2

  client-light:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G

  client-heavy:
    image: ghcr.io/rvhonorato/job-orchestrator:latest
    command: client
    deploy:
      resources:
        limits:
          cpus: '8'
          memory: 16G

Volume Management

Persistent Storage

Always use named volumes for production:

volumes:
  server-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/job-orchestrator/server

  client-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/job-orchestrator/client

Backup Strategy

# Stop services (optional, for consistent backup)
docker compose stop

# Backup server data
tar -czf backup-$(date +%Y%m%d).tar.gz /data/job-orchestrator/server

# Resume services
docker compose start

Networking

Internal Network

Keep client internal:

services:
  server:
    ports:
      - "5000:5000"  # Exposed to host
    networks:
      - internal
      - external

  client:
    # No ports exposed to host
    networks:
      - internal

networks:
  internal:
    internal: true
  external:

With Reverse Proxy

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - server

  server:
    # No ports exposed, accessed via nginx
    networks:
      - internal

Logging

View Logs

# All services
docker compose logs -f

# Server only
docker compose logs -f server

# Last 100 lines
docker compose logs --tail 100 server

Log Rotation

services:
  server:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Commands

# Start
docker compose up -d

# Stop
docker compose down

# Restart
docker compose restart

# Rebuild and start
docker compose up --build -d

# View status
docker compose ps

# Shell into container
docker compose exec server /bin/sh

Production Deployment

This guide covers best practices for deploying job-orchestrator in production environments.

Architecture Recommendations

Minimum Setup

┌─────────────┐     ┌─────────────┐
│   Server    │────▶│   Client    │
│ (1 instance)│     │ (1 instance)│
└─────────────┘     └─────────────┘

Recommended Setup

                    ┌──────────────┐
                    │ Load Balancer│
                    │   (nginx)    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │    Server    │
                    │  (1 instance)│
                    └──────┬───────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
   │  Client 1   │  │  Client 2   │  │  Client 3   │
   └─────────────┘  └─────────────┘  └─────────────┘

Security

Script Validation

The client includes a built-in script validator that rejects run.sh scripts containing obviously dangerous patterns before execution. This covers destructive commands (rm -rf /, mkfs), network exfiltration tools (curl, wget, socat), reverse shells (/dev/tcp/), privilege escalation (sudo, chmod +s), container escapes (nsenter, docker), obfuscated execution (base64 | bash, python -c), persistence mechanisms (crontab, systemctl), crypto miners, and environment secret access.

This is a sanity check, not a sandbox. It can be bypassed by determined actors. Input scripts are still expected to come from trusted or semi-trusted sources. True isolation must be enforced at the deployment level using the container hardening measures below.

Container Hardening

The client executes user-submitted scripts with the full privileges of the process. Apply all of the following to limit blast radius:

Measure	Docker Compose	Purpose
Read-only rootfs	`read_only: true`	Prevent filesystem tampering
Drop all capabilities	`cap_drop: [ALL]`	Remove kernel-level privileges
No new privileges	`security_opt: [no-new-privileges:true]`	Block `setuid`/`setgid` escalation
CPU limit	`deploy.resources.limits.cpus`	Prevent CPU starvation
Memory limit	`deploy.resources.limits.memory`	Prevent OOM on host
PIDs limit	`deploy.resources.limits.pids`	Prevent fork bombs
Internal network	`networks: [internal]`	Block outbound internet access
Writable tmpfs	`tmpfs: [/tmp]`	Provide scratch space on read-only rootfs

Example (applied to the client service):

services:
  client:
    read_only: true
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    tmpfs:
      - /tmp
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 2G
          pids: 256
    networks:
      - internal

networks:
  internal:
    internal: true

Future improvement: Run the container as a non-root user (USER appuser in the Dockerfile). This requires migrating ownership of existing volumes first – see the TODO in the Dockerfile.

Network Security

Never expose clients to the internet
- Clients execute user-submitted scripts
- Use internal networks only
- Block all outbound access from client containers
Use a reverse proxy
- TLS termination
- Rate limiting
- Request filtering

Firewall rules

# Allow only orchestrator server to reach clients
iptables -A INPUT -p tcp --dport 9000 -s <server-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 9000 -j DROP

Reverse Proxy (nginx)

upstream orchestrator {
    server 127.0.0.1:5000;
}

server {
    listen 443 ssl http2;
    server_name jobs.example.com;

    ssl_certificate /etc/nginx/certs/cert.pem;
    ssl_certificate_key /etc/nginx/certs/key.pem;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=upload:10m rate=10r/s;

    location /upload {
        limit_req zone=upload burst=20 nodelay;
        client_max_body_size 400M;
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /download {
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://orchestrator;
    }

    # Block swagger in production (optional)
    location /swagger-ui {
        deny all;
    }
}

Authentication

job-orchestrator does not implement authentication. Options:

Reverse proxy authentication

location / {
    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://orchestrator;
}

Application-level authentication
- Wrap the API in your application
- Validate users before calling job-orchestrator
OAuth2 Proxy
- Use oauth2-proxy in front of the service
- Integrates with identity providers

Resource Planning

Server Requirements

Load Level	CPU	Memory	Storage
Light (< 100 jobs/day)	1 core	512MB	10GB
Medium (100-1000 jobs/day)	2 cores	1GB	50GB
Heavy (> 1000 jobs/day)	4 cores	2GB	100GB+

Storage depends heavily on job file sizes and retention period.

Client Requirements

Depends entirely on your job workloads:

Job Type	CPU	Memory
Text processing	1 core	512MB
Scientific computing	4-8 cores	8-16GB
ML/Deep learning	8+ cores + GPU	32GB+

Storage Calculation

Storage = (avg_job_size) × (jobs_per_day) × (retention_days)

Example:
- 10MB average job
- 500 jobs/day
- 2 day retention
= 10MB × 500 × 2 = 10GB

Monitoring

Health Checks

# Server health
curl -f http://localhost:5000/health

# Client health
curl -f http://localhost:9000/health

# Client load
curl http://localhost:9000/load

Prometheus Metrics (External)

Use a sidecar or external monitoring:

services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # Monitor container metrics
  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

Log Aggregation

services:
  server:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "job-orchestrator.server"

Backup & Recovery

What to Backup

Server database (DB_PATH)
- Contains job history and status
- Critical for job tracking
Server data directory (DATA_PATH)
- Contains job files and results
- Large, may use incremental backups

Backup Script

#!/bin/bash
BACKUP_DIR=/backups/job-orchestrator
DATE=$(date +%Y%m%d_%H%M%S)

# Backup database
sqlite3 /opt/data/db.sqlite ".backup '${BACKUP_DIR}/db_${DATE}.sqlite'"

# Backup data (incremental with rsync)
rsync -av --delete /opt/data/ ${BACKUP_DIR}/data/

# Cleanup old backups (keep 7 days)
find ${BACKUP_DIR} -name "db_*.sqlite" -mtime +7 -delete

Recovery

# Stop server
docker compose stop server

# Restore database
cp /backups/job-orchestrator/db_latest.sqlite /opt/data/db.sqlite

# Restore data
rsync -av /backups/job-orchestrator/data/ /opt/data/

# Start server
docker compose start server

High Availability

Current Limitations

Single server architecture
No built-in clustering
SQLite doesn’t support concurrent writes

Workarounds

Quick recovery
- Automated health checks
- Container auto-restart
- Fast backup restoration
Stateless clients
- Clients can be restarted freely
- Jobs are tracked by server
Future improvements
- PostgreSQL support (planned)
- Server clustering (planned)

Maintenance

Database Maintenance

# Vacuum database (reclaim space)
sqlite3 /opt/data/db.sqlite "VACUUM;"

# Check integrity
sqlite3 /opt/data/db.sqlite "PRAGMA integrity_check;"

Log Rotation

Ensure logs don’t fill disk:

services:
  server:
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

Updates

# Pull latest image
docker pull ghcr.io/rvhonorato/job-orchestrator:latest

# Recreate containers
docker compose up -d

Troubleshooting

See Troubleshooting Guide for common issues.

Building from Source

This guide covers building job-orchestrator from source code.

Prerequisites

Rust Toolchain

Install Rust via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Minimum version: Rust 1.75 (edition 2021)

Verify installation:

rustc --version
cargo --version

System Dependencies

Debian/Ubuntu

apt-get update
apt-get install -y build-essential libsqlite3-dev pkg-config

Fedora/RHEL

dnf install gcc sqlite-devel

macOS

brew install sqlite

Windows

Install Visual Studio Build Tools and SQLite development libraries.

Clone Repository

git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator

Build Commands

Debug Build

Fast compilation, includes debug symbols:

cargo build

Binary location: target/debug/job-orchestrator

Release Build

Optimized for performance:

cargo build --release

Binary location: target/release/job-orchestrator

Check (No Build)

Verify code compiles without producing binary:

cargo check

Running

From Cargo

# Server mode
cargo run -- server --port 5000

# Client mode
cargo run -- client --port 9000

From Binary

# After release build
./target/release/job-orchestrator server --port 5000

# Add target
rustup target add x86_64-unknown-linux-musl

# Build for target
cargo build --release --target x86_64-unknown-linux-musl

Common targets:

x86_64-unknown-linux-gnu - Linux (glibc)
x86_64-unknown-linux-musl - Linux (static)
x86_64-apple-darwin - macOS Intel
aarch64-apple-darwin - macOS Apple Silicon
x86_64-pc-windows-msvc - Windows

Docker Build

Using Dockerfile

docker build -t job-orchestrator .

Multi-stage Build

The Dockerfile uses multi-stage builds for smaller images:

Builder stage: Compiles with full toolchain
Runtime stage: Minimal image with just the binary

Testing

This guide covers running and writing tests for job-orchestrator.

Running Tests

All Tests

cargo test

With Output

See println! output from tests:

cargo test -- --nocapture

Specific Test

# By name
cargo test test_upload

# By module
cargo test orchestrator::tests

Ignored Tests

Some tests may be ignored by default (slow, require setup):

cargo test -- --ignored

Test Coverage

Using cargo-tarpaulin

Install:

cargo install cargo-tarpaulin

Generate coverage:

# HTML report
cargo tarpaulin --out Html --output-dir ./coverage

# XML report (for CI)
cargo tarpaulin --out Xml --output-dir ./coverage

View report:

open coverage/tarpaulin-report.html

Test Structure

Tests are organized alongside the code they test:

src/
├── lib.rs
├── orchestrator/
│   ├── mod.rs
│   └── tests.rs      # Orchestrator tests
├── client/
│   ├── mod.rs
│   └── tests.rs      # Client tests
└── utils/
    └── mod.rs        # Inline tests with #[cfg(test)]

Writing Tests

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_something() {
        let result = function_under_test();
        assert_eq!(result, expected_value);
    }
}
}

Async Tests

Use tokio::test for async functions:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_async_function() {
        let result = async_function().await;
        assert!(result.is_ok());
    }
}
}

Integration Tests

Create files in tests/ directory:

#![allow(unused)]
fn main() {
// tests/integration_test.rs
use job_orchestrator::*;

#[tokio::test]
async fn test_full_workflow() {
    // Setup
    // Test
    // Verify
}
}

Mocking

Using mockall

For trait-based mocking:

#![allow(unused)]
fn main() {
use mockall::automock;

#[automock]
trait Database {
    fn get(&self, id: i32) -> Option<Job>;
}

#[test]
fn test_with_mock() {
    let mut mock = MockDatabase::new();
    mock.expect_get()
        .with(eq(1))
        .returning(|_| Some(Job::default()));

    // Use mock in test
}
}

Using mockito

For HTTP mocking:

#![allow(unused)]
fn main() {
use mockito::Server;

#[tokio::test]
async fn test_http_client() {
    let mut server = Server::new();
    let mock = server.mock("GET", "/health")
        .with_status(200)
        .create();

    // Test against server.url()

    mock.assert();
}
}

Test Utilities

Test Fixtures

Create reusable test data:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod test_utils {
    pub fn create_test_job() -> Job {
        Job {
            id: 1,
            user_id: 1,
            service: "test".to_string(),
            status: Status::Queued,
            ..Default::default()
        }
    }
}
}

Temporary Directories

Use tempfile for temporary test directories:

#![allow(unused)]
fn main() {
use tempfile::TempDir;

#[test]
fn test_file_operations() {
    let temp_dir = TempDir::new().unwrap();
    let file_path = temp_dir.path().join("test.txt");

    // Test file operations

    // TempDir is automatically cleaned up
}
}

CI Testing

Tests run automatically on GitHub Actions:

# .github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - run: cargo test

Linting

Clippy

Run Clippy for additional checks:

cargo clippy -- -D warnings

Common fixes:

#[allow(clippy::lint_name)] to suppress specific lints
Configure in clippy.toml or Cargo.toml

Formatting

Check formatting:

cargo fmt -- --check

Fix formatting:

cargo fmt

Debugging Tests

With println

#![allow(unused)]
fn main() {
#[test]
fn test_debug() {
    let value = compute_something();
    println!("Debug value: {:?}", value);  // Use --nocapture to see
    assert!(value.is_valid());
}
}

With RUST_BACKTRACE

RUST_BACKTRACE=1 cargo test

Contributing

Contributions to job-orchestrator are welcome! This guide explains how to contribute.

Ways to Contribute

Bug reports: Found a bug? Open an issue
Feature requests: Have an idea? Open an issue to discuss
Code contributions: Fix bugs or implement features
Documentation: Improve docs, fix typos, add examples
Testing: Add tests, report edge cases

Getting Started

1. Fork the Repository

Click “Fork” on GitHub, then clone your fork:

git clone https://github.com/YOUR_USERNAME/job-orchestrator.git
cd job-orchestrator

2. Set Up Development Environment

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install dependencies (Debian/Ubuntu)
apt-get install libsqlite3-dev

# Build
cargo build

# Run tests
cargo test

3. Create a Branch

git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix

Development Workflow

Making Changes

Write code
Add tests for new functionality
Run tests: cargo test
Run linter: cargo clippy -- -D warnings
Format code: cargo fmt

Commit Messages

Follow conventional commits:

type(scope): description

[optional body]

Types:

feat: New feature
fix: Bug fix
docs: Documentation
refactor: Code refactoring
test: Adding tests
chore: Maintenance

Examples:

feat(quota): add per-service quota limits
fix(client): handle non-zero exit codes correctly
docs(readme): update quick start instructions

Pull Request Process

Push your branch:

git push origin feature/your-feature-name

Open a Pull Request on GitHub
Fill in the PR template:
- Describe your changes
- Link related issues
- Note any breaking changes
Wait for review
- CI must pass
- Maintainer will review
- Address feedback
Merge!

Code Style

Rust Style

Follow standard Rust conventions:

Use cargo fmt for formatting
Use cargo clippy for linting
Prefer descriptive variable names
Add doc comments for public APIs

#![allow(unused)]
fn main() {
/// Uploads a job to the orchestrator.
///
/// # Arguments
///
/// * `files` - Files to upload
/// * `user_id` - User submitting the job
/// * `service` - Target service name
///
/// # Returns
///
/// The created job with its ID and initial status.
pub async fn upload_job(
    files: Vec<File>,
    user_id: i32,
    service: String,
) -> Result<Job, Error> {
    // Implementation
}
}

Error Handling

Use Result types, not panics
Provide context with error messages
Use ? operator for propagation

#![allow(unused)]
fn main() {
// Good
let file = File::open(path)
    .map_err(|e| Error::FileOpen { path: path.clone(), source: e })?;

// Avoid
let file = File::open(path).unwrap();
}

Testing Guidelines

Write Tests For

New functionality
Bug fixes (regression tests)
Edge cases

Test Structure

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_function_describes_behavior() {
        // Arrange
        let input = create_test_input();

        // Act
        let result = function_under_test(input);

        // Assert
        assert_eq!(result, expected);
    }
}
}

Async Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_async_operation() {
    let result = async_function().await;
    assert!(result.is_ok());
}
}

Documentation

Code Documentation

Add doc comments for:

Public functions
Public structs
Public modules

#![allow(unused)]
fn main() {
/// A job represents a unit of work to be processed.
pub struct Job {
    /// Unique identifier for the job.
    pub id: i32,
    /// User who submitted the job.
    pub user_id: i32,
    // ...
}
}

mdbook Documentation

Documentation is in docs/src/. To preview:

# Install mdbook
cargo install mdbook

# Serve locally
cd docs
mdbook serve --open

Reporting Issues

Bug Reports

Include:

job-orchestrator version
Operating system
Steps to reproduce
Expected vs actual behavior
Logs if relevant

Feature Requests

Include:

Use case description
Proposed solution (if any)
Alternatives considered

Code of Conduct

Be respectful and constructive. We’re all here to build something useful together.

Questions?

Open an issue for questions
Email: Rodrigo V. Honorato rvhonorato@protonmail.com

License

Contributions are licensed under the MIT License.

Troubleshooting

Common issues and solutions for job-orchestrator.

Server Issues

Server Won’t Start

Symptom: Server fails to start, exits immediately

Possible Causes:

Port already in use

Error: Address already in use

Solution:

# Find process using the port
lsof -i :5000
# Kill it or use a different port
PORT=5001 job-orchestrator server

Database path not writable

Error: unable to open database file

Solution:

# Check directory exists and is writable
mkdir -p /opt/data
chmod 755 /opt/data

Missing service configuration

Error: No services configured

Solution: Configure at least one service:

export SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://client:9000/retrieve

Jobs Stuck in Queued

Symptom: Jobs stay in Queued status indefinitely

Possible Causes:

Quota exhausted

Check if user has reached their limit:
- Default quota is 5 concurrent jobs per user per service
- Wait for running jobs to complete, or increase quota
Client unreachable

Verify client connectivity:
```
curl http://client:9000/health
```

Service misconfigured

Verify service URLs are correct:

echo $SERVICE_EXAMPLE_UPLOAD_URL
curl -X POST $SERVICE_EXAMPLE_UPLOAD_URL  # Should return error, not timeout

Jobs Stuck in Submitted

Symptom: Jobs move to Submitted but never complete

Possible Causes:

Client not executing jobs

Check client logs for errors
```
docker logs client
```
run.sh hanging

Your script may be waiting for input or stuck in a loop
Getter task not running

Server may need restart

Upload Fails with 400

Symptom: POST /upload returns 400 Bad Request

Possible Causes:

Missing required fields

# Ensure all fields are provided
curl -X POST http://localhost:5000/upload \
  -F "file=@run.sh" \
  -F "user_id=1" \      # Required
  -F "service=example"   # Required

Unknown service

Service must be configured on server:
```
export SERVICE_EXAMPLE_UPLOAD_URL=...
```
File too large

Default limit is 400MB. Check file sizes.

Client Issues

Client Not Receiving Jobs

Symptom: Client running but no jobs arrive

Check:

Network connectivity

# From server, can you reach client?
curl http://client:9000/health

Firewall rules

# Client port must be accessible from server
iptables -L -n | grep 9000

Docker networking

# Containers must be on same network
docker network inspect job-orchestrator_default

Jobs Stuck in Prepared

Symptom: Payloads stay in Prepared status

Possible Causes:

Runner task not running

Check client logs, may need restart
run.sh not found or not executable

Ensure the script exists and is executable:
```
# In your upload
chmod +x run.sh
```
Permission issues

Client working directory may have permission issues

Execution Fails

Symptom: Jobs complete but with Failed status

Check:

Exit code

run.sh must exit with code 0 for success:

#!/bin/bash
# Your commands here
exit 0  # Explicit success

Script errors

Check output files for error messages
Missing dependencies

Your script may need tools not available in the container

Database Issues

Database Locked

Symptom: “database is locked” errors

Causes: Multiple processes accessing SQLite

Solution:

Ensure only one server instance runs
Check for zombie processes
Restart server

Database Corrupted

Symptom: Strange errors, missing data

Solution:

Stop server
Backup current database

Run integrity check:

sqlite3 db.sqlite "PRAGMA integrity_check;"

If corrupted, restore from backup or delete and restart

Out of Disk Space

Symptom: “disk full” errors

Solution:

Check disk usage:
```
df -h
```

Clean old jobs:

# Reduce MAX_AGE and restart
export MAX_AGE=3600  # 1 hour

Manually clean data directory

Docker Issues

Container Exits Immediately

Check logs:

docker logs container_name

Common causes:

Missing environment variables
Port conflicts
Permission issues

Cannot Connect Between Containers

Ensure same network:

services:
  server:
    networks:
      - app-network
  client:
    networks:
      - app-network

networks:
  app-network:

Use service names, not localhost:

# Wrong
SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit

# Correct
SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit

Volume Permission Issues

Symptom: Permission denied when writing to volumes

Solution:

services:
  server:
    user: "1000:1000"  # Match host user
    volumes:
      - ./data:/opt/data

Or fix permissions:

sudo chown -R 1000:1000 ./data

Performance Issues

Slow Job Processing

Possible Causes:

Slow database
- Use SSD storage for database
- Run VACUUM periodically
Network latency
- Place server and clients on same network
- Check for packet loss
Client overloaded
- Add more clients
- Reduce RUNS_PER_USER

High Memory Usage

Server:

Memory grows with job count
Clean old jobs with lower MAX_AGE

Client:

In-memory database grows with payloads
Restart client to clear

Disk Usage Growing

Check:

du -sh /opt/data/*

Solutions:

Reduce MAX_AGE
Increase cleanup frequency
Archive old results externally

Getting Help

If you can’t resolve an issue:

Check logs for specific error messages
Search existing issues: GitHub Issues
Open new issue with:
- Version
- Configuration
- Steps to reproduce
- Logs

Keyboard shortcuts

job-orchestrator