Introduction
job-orchestrator is an asynchronous job orchestration system for managing and distributing computational workloads across heterogeneous computing resources with intelligent quota-based load balancing.
What is job-orchestrator?
job-orchestrator is a central component of WeNMR, a worldwide e-Infrastructure for structural biology operated by the BonvinLab at Utrecht University. It serves as a reactive middleware layer that connects web applications to diverse computing resources, enabling efficient job distribution for scientific computing workflows.
Key Features
- Asynchronous Job Management: Built with Rust and Tokio for high-performance async operations
- Quota-Based Load Balancing: Per-user, per-service quotas prevent resource exhaustion
- Dual-Mode Architecture: Runs as server (job orchestration) or client (job execution)
- Multiple Backend Support: Extensible to integrate with various computing resources:
- Native client mode for local job execution
- DIRAC Interware (planned)
- SLURM clusters (planned)
- Educational cloud services (planned)
- RESTful API: Simple HTTP interface for job submission and retrieval
- Automatic Cleanup: Configurable retention policies for completed jobs
Use Cases
job-orchestrator is designed for scenarios requiring:
- Scientific Computing Workflows: Distribute computational biology/chemistry jobs across clusters
- Multi-Tenant Systems: Fair resource allocation with per-user quotas
- Heterogeneous Computing: Route jobs to appropriate backends (local, HPC, cloud)
- Web-Based Science Platforms: Decouple frontend from compute infrastructure
- Batch Processing: Handle high-throughput job submissions with automatic queuing
Project Status
Current State: Production-ready with server/client architecture
Planned Features:
- Auto-Scaling: Dynamic creation and termination of cloud-based client instances based on workload
- DIRAC Interware integration
- SLURM direct integration
- Enhanced monitoring and metrics
- Job priority queues
- Advanced scheduling policies
Getting Help
- API Documentation: Available via Swagger UI at
/swagger-ui/when running - Issues: GitHub Issues
- Email: Rodrigo V. Honorato rvhonorato@protonmail.com
License
MIT License - see LICENSE for details.
Installation
There are several ways to install job-orchestrator depending on your needs.
From crates.io
The easiest way to install job-orchestrator is via Cargo:
cargo install job-orchestrator
From Source
Clone the repository and build:
git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
cargo build --release
The binary will be available at target/release/job-orchestrator.
Using Docker
Pull the pre-built image:
docker pull ghcr.io/rvhonorato/job-orchestrator:latest
Or build locally:
docker build -t job-orchestrator .
Prerequisites
For Building from Source
- Rust: 1.75 or later (edition 2021)
- SQLite: Development libraries
On Debian/Ubuntu:
apt-get install libsqlite3-dev
On macOS:
brew install sqlite
For Running
- SQLite: Runtime library (usually included in most systems)
- Filesystem access: Write permissions for database and job storage directories
Verifying Installation
After installation, verify it works:
job-orchestrator --version
You should see the version number displayed.
Next Steps
- Quick Start - Get running with Docker Compose
- Your First Job - Submit and retrieve a job
Quick Start
The fastest way to get job-orchestrator running is with Docker Compose.
Running with Docker Compose
git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build
This starts:
- Orchestrator server on port 5000
- Example client on port 9000
Verify It’s Running
Check the server is responding:
curl http://localhost:5000/health
You should receive a health status response.
Access the API Documentation
Open your browser and navigate to:
http://localhost:5000/swagger-ui/
This provides interactive API documentation where you can explore and test all endpoints.
What’s Next?
Now that you have job-orchestrator running, proceed to Your First Job to learn how to submit and retrieve jobs.
Stopping the Services
To stop the services:
docker compose down
To stop and remove all data (volumes):
docker compose down -v
Your First Job
This guide walks you through submitting and retrieving your first job.
Prerequisites
Make sure you have job-orchestrator running. See Quick Start if you haven’t set it up yet.
Understanding Jobs
A job in job-orchestrator consists of:
- Files: One or more files to be processed
- A
run.shscript: The entry point that gets executed - User ID: Identifies who submitted the job (for quota tracking)
- Service: Which service/backend should process this job
Creating a Simple Job
Create a simple run.sh script:
cat > run.sh << 'EOF'
#!/bin/bash
echo "Hello from job-orchestrator!" > output.txt
echo "Processing complete at $(date)" >> output.txt
EOF
chmod +x run.sh
Submitting the Job
Submit the job using curl:
curl -X POST http://localhost:5000/upload \
-F "file=@run.sh" \
-F "user_id=1" \
-F "service=example" | jq
You’ll receive a response like:
{
"id": 1,
"status": "Queued",
"message": "Job successfully uploaded"
}
Note the id field - you’ll need this to check status and download results.
Checking Job Status
Check the job status via GET request:
curl http://localhost:5000/download/1
If the job is not yet completed, you’ll get a JSON response:
{
"id": 1,
"status": "Submitted",
"message": ""
}
The status field will be one of: Queued, Processing, Submitted, Running, Completed, Failed, Invalid, Cleaned, or Unknown.
Downloading Results
Once the status is Completed, the same endpoint returns the ZIP file:
curl -o results.zip http://localhost:5000/download/1
Extract and view:
unzip results.zip
cat output.txt
You should see:
Hello from job-orchestrator!
Processing complete at <timestamp>
A More Complex Example
Here’s a job that processes an input file:
# Create an input file
echo "sample data" > input.txt
# Create a processing script
cat > run.sh << 'EOF'
#!/bin/bash
# Count lines and words in input
wc input.txt > stats.txt
# Transform the data
tr 'a-z' 'A-Z' < input.txt > output.txt
echo "Done!" >> output.txt
EOF
chmod +x run.sh
# Submit with multiple files
curl -X POST http://localhost:5000/upload \
-F "file=@run.sh" \
-F "file=@input.txt" \
-F "user_id=1" \
-F "service=example"
Important Notes
The run.sh Script
- Must be named exactly
run.sh - Must be executable (or start with
#!/bin/bash) - Exit code
0indicates success - Non-zero exit code indicates failure
- All output files in the working directory are included in results
File Size Limits
The default maximum upload size is 400MB. This can be configured on the server.
Job Retention
Completed jobs are automatically cleaned up after the configured retention period (default: 48 hours). Make sure to download your results before they expire.
Next Steps
- Learn about the Job Lifecycle
- Configure Quotas for your users
- Set up Production Deployment
Architecture Overview
job-orchestrator uses a distributed architecture with a central server coordinating job execution across multiple client nodes.
High-Level Architecture
flowchart TB
subgraph Tasks["Background Tasks"]
Sender["Sender<br>500ms"]
Getter["Getter<br>500ms"]
Cleaner["Cleaner<br>60s"]
end
subgraph Server["Orchestrator Server"]
API["REST API<br>upload/download"]
DB[("SQLite<br>Persistent")]
FS[/"Filesystem<br>Job Storage"/]
Tasks
Queue["Queue Manager<br>Quota Enforcement"]
end
subgraph Client["Client Service"]
ClientAPI["REST API<br>submit/retrieve/load"]
ClientDB[("SQLite<br>In-Memory")]
ClientFS[/"Working Dir"/]
Runner["Runner Task<br>500ms"]
Executor["Bash Executor<br>run.sh"]
end
User(["User/Web App"]) -- POST /upload --> API
User -- GET /download/:id --> API
API --> DB & FS
DB --> Queue
Queue --> Sender
Sender -- POST /submit --> ClientAPI
Getter -- GET /retrieve/:id --> ClientAPI
Getter --> FS
Cleaner --> DB & FS
ClientAPI --> ClientDB
ClientDB --> Runner
Runner --> Executor
Executor --> ClientFS
Components
Orchestrator Server
The central server manages:
- REST API: Handles job uploads and result downloads from users
- SQLite Database: Persistent storage for job metadata and status
- Filesystem Storage: Stores uploaded files and downloaded results
- Queue Manager: Enforces per-user quotas and manages job distribution
- Background Tasks: Automated processes for job distribution, result retrieval, and cleanup
Client Service
Each client node handles:
- REST API: Receives jobs from server, returns results
- In-Memory Database: Lightweight tracking of current payloads
- Working Directory: Temporary storage for job execution
- Runner Task: Monitors for new payloads and executes them
- Bash Executor: Runs the
run.shscript for each job
Background Tasks
Server Tasks
| Task | Interval | Purpose |
|---|---|---|
| Sender | 500ms | Picks up queued jobs, enforces quotas, dispatches to clients |
| Getter | 500ms | Retrieves completed results from clients |
| Cleaner | 60s | Removes expired jobs from disk and database |
Client Tasks
| Task | Interval | Purpose |
|---|---|---|
| Runner | 500ms | Executes prepared payloads, captures results |
Data Flow
- User submits files via
POST /upload - Server stores files and creates job record (status:
Queued) - Sender task picks up job, checks quotas, sends to available client
- Client receives job, stores as payload (status:
Prepared) - Runner task executes
run.sh, updates status toCompleted - Getter task retrieves results, stores locally
- User downloads results via
GET /download/:id - Cleaner task removes job after retention period
Auto-Scaling Architecture (Planned)
The orchestrator will support automatic scaling of client instances based on workload:
---
config:
layout: dagre
---
flowchart TB
subgraph Server["Orchestrator Server"]
API["REST API"]
Queue["Queue Manager"]
AutoScaler["Auto-Scaler"]
ServicePool["Service Pool"]
end
subgraph Cloud["Cloud Provider"]
CloudAPI["Cloud API"]
end
subgraph Clients["Client Instances"]
Dynamic["Dynamic Clients<br>Auto-created"]
Static["Static Client<br>"]
end
User(["User/Web App"]) -- Submits/Retrieves --> API
API --> Queue
Queue -- Distribute jobs --> Clients
ServicePool <-- Monitors --> Queue
AutoScaler <-- Register/Trigger --> ServicePool
AutoScaler -- Scale Up/Down --> CloudAPI
CloudAPI -- Create/Terminate --> Clients
This feature will enable:
- Dynamic creation of cloud-based client instances during high demand
- Automatic termination of idle instances to reduce costs
- Load-aware job distribution across available clients
Job Lifecycle
Understanding the job lifecycle is essential for working with job-orchestrator effectively.
Lifecycle Sequence
sequenceDiagram
participant User
participant Server
participant Client
participant Executor
User->>Server: POST /upload (files, user_id, service)
Server->>Server: Store job (status: Queued)
Server-->>User: Job ID
Note over Server: Sender task (500ms interval)
Server->>Server: Update status: Processing
Server->>Client: POST /submit (job files)
Client->>Client: Store payload (status: Prepared)
Client-->>Server: Payload ID
Server->>Server: Update status: Submitted
Note over Client: Runner task (500ms interval)
Client->>Executor: Execute run.sh
Executor->>Executor: Process files
Executor-->>Client: Exit code
Client->>Client: Update status: Completed
Note over Server: Getter task (500ms interval)
Server->>Client: GET /retrieve/:id
Client-->>Server: ZIP results
Server->>Server: Store results, status: Completed
User->>Server: GET /download/:id
Server-->>User: results.zip
Note over Server: Cleaner task (60s interval)
Server->>Server: Remove jobs older than MAX_AGE
Job States
stateDiagram-v2
[*] --> Queued: Job submitted
Queued --> Processing: Sender picks up job
Processing --> Submitted: Sent to client
Processing --> Failed: Client unreachable
Submitted --> Completed: Execution successful
Submitted --> Unknown: Retrieval failed or execution failed
Unknown --> Completed: Retry successful
Completed --> Cleaned: After MAX_AGE
Failed --> Cleaned: After MAX_AGE
Unknown --> Cleaned: After MAX_AGE (if applicable)
Cleaned --> [*]
State Descriptions
| State | Description |
|---|---|
| Queued | Job received and waiting for dispatch |
| Processing | Server is sending job to a client |
| Submitted | Job successfully sent to client, awaiting execution |
| Completed | Job finished successfully, results available |
| Failed | Job failed permanently (client unreachable, execution error) |
| Unknown | Temporary state when retrieval fails, will retry |
| Cleaned | Job data removed after retention period |
Lifecycle Stages
1. Submission
User uploads files via POST /upload with:
- One or more files (including
run.sh) user_id- identifies the submitting userservice- which backend should process this job
The server:
- Validates the request
- Creates a unique directory for the job
- Stores all uploaded files
- Creates a database record with status
Queued - Returns the job ID to the user
2. Queuing & Quota Check
The Sender background task (runs every 500ms):
- Finds jobs in
Queuedstatus - Checks if user has available quota for the service
- If quota available, marks job as
Processing - If quota exceeded, job remains
Queued
3. Distribution
For jobs in Processing status:
- Server packages job files
- Sends to configured client via
POST /submit - On success: updates status to
Submitted, stores client’s payload ID - On failure: updates status to
Failed
4. Execution
On the client side:
- Runner task finds payloads in
Preparedstatus - Executes
run.shin the job directory - Captures exit code and any output files
- Updates payload status to
Completed(orFailedon error)
5. Retrieval
The Getter background task (runs every 500ms):
- Finds jobs in
Submittedstatus - Requests results from client via
GET /retrieve/:id - Downloads and stores the result ZIP
- Updates job status to
Completed
6. Download
User can now:
- Check status via
GET /download/:id(returns JSON with job state when not completed) - Download results via
GET /download/:id(returns ZIP when completed) - Results are returned as a ZIP archive
7. Cleanup
The Cleaner background task (runs every 60s):
- Finds jobs older than
MAX_AGE - Deletes job files from filesystem
- Updates status to
Cleanedor removes record
Error Handling
Client Unreachable
If the server cannot reach a client during distribution:
- Job status changes to
Failed - Job will not be retried automatically
- User can resubmit if needed
Execution Failure
If run.sh exits with non-zero code:
- Payload status changes to
Failed - Server retrieves whatever output exists
- Job status reflects the failure
Retrieval Failure
If the server cannot retrieve results:
- Job status changes to
Unknown - Server will retry on subsequent Getter cycles
- Eventually succeeds or times out
Timing Considerations
| Event | Typical Duration |
|---|---|
| Upload to Queued | Immediate |
| Queued to Processing | Up to 500ms (+ quota wait) |
| Processing to Submitted | Depends on file size and network |
| Submitted to Completed | Depends on job execution time |
| Completed to Cleaned | Configured via MAX_AGE (default: 48 hours) |
Server & Client Modes
job-orchestrator provides both server and client functionality in a single binary, configured via command-line arguments.
Dual-Mode Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Same Binary │
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Server Mode │ │ Client Mode │ │
│ │ │ │ │ │
│ │ - Job orchestration│ │ - Job execution │ │
│ │ - Quota management │ │ - Result packaging │ │
│ │ - Persistent DB │ │ - In-memory DB │ │
│ │ - User-facing API │ │ - Server-facing API│ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ job-orchestrator server job-orchestrator client │
└─────────────────────────────────────────────────────────────────┘
Server Mode
The server is the central orchestrator that:
- Receives job submissions from users/applications
- Manages job queues and enforces quotas
- Distributes jobs to available clients
- Retrieves results and serves them to users
- Handles cleanup of expired jobs
Starting the Server
job-orchestrator server --port 5000
Or with environment variables:
PORT=5000 job-orchestrator server
Server Responsibilities
| Component | Purpose |
|---|---|
| REST API | Handle /upload and /download requests |
| Queue Manager | Enforce per-user, per-service quotas |
| Sender Task | Dispatch jobs to clients |
| Getter Task | Retrieve completed results |
| Cleaner Task | Remove expired jobs |
| SQLite DB | Persistent job tracking |
Server API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/upload | POST | Submit new job |
/download/:id | GET | Get results or status |
/health | GET | Health check |
/swagger-ui/ | GET | API documentation |
Client Mode
The client executes jobs on behalf of the server:
- Receives job payloads from the server
- Executes the
run.shscript - Packages results for retrieval
- Reports system load for scheduling decisions
Starting the Client
job-orchestrator client --port 9000
Or with environment variables:
PORT=9000 job-orchestrator client
Client Responsibilities
| Component | Purpose |
|---|---|
| REST API | Handle /submit and /retrieve requests |
| Runner Task | Execute prepared payloads |
| Bash Executor | Run run.sh scripts |
| In-Memory DB | Lightweight payload tracking |
Client API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/submit | POST | Receive job from server |
/retrieve/:id | GET | Return completed results |
/load | GET | Report CPU usage |
/health | GET | Health check |
Communication Flow
User Server Client
│ │ │
│──POST /upload────▶│ │
│◀─── Job ID ───────│ │
│ │ │
│ │──POST /submit──────────▶│
│ │◀─── Payload ID ─────────│
│ │ │
│ │ ┌────┴────┐
│ │ │ Execute │
│ │ │ run.sh │
│ │ └────┬────┘
│ │ │
│ │──GET /retrieve/:id─────▶│
│ │◀─── results.zip ────────│
│ │ │
│──GET /download/:id▶│ │
│◀─── results.zip ──│ │
Deployment Patterns
Single Machine (Development)
Both server and client on the same machine:
# Terminal 1
job-orchestrator server --port 5000
# Terminal 2
job-orchestrator client --port 9000
Distributed (Production)
Server on one machine, clients on compute nodes:
┌─────────────┐
│ Server │
│ (port 5000)│
└──────┬──────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Client 1 │ │ Client 2 │ │ Client 3 │
│ (compute-1) │ │ (compute-2) │ │ (compute-3) │
└─────────────┘ └─────────────┘ └─────────────┘
Multi-Service Setup
Different clients for different services:
# Server configuration
SERVICE_EXAMPLE_UPLOAD_URL: http://client-example:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-example:9000/retrieve
SERVICE_HADDOCK_UPLOAD_URL: http://client-haddock:9001/submit
SERVICE_HADDOCK_DOWNLOAD_URL: http://client-haddock:9001/retrieve
Database Differences
Server Database (Persistent)
- Uses SQLite file on disk
- Survives restarts
- Stores complete job history
- Location configured via
DB_PATH
Client Database (In-Memory)
- SQLite in-memory database
- Cleared on restart
- Only tracks active payloads
- Lightweight and fast
When to Scale
Add More Clients When:
- Job queue is consistently backing up
- Execution time is the bottleneck
- You have available compute resources
Scale Server When:
- Upload/download becomes slow
- Many concurrent users
- Database queries become slow
See Also
Server Configuration
The orchestrator server is configured primarily through environment variables.
Environment Variables
Core Settings
| Variable | Default | Description |
|---|---|---|
PORT | 5000 | HTTP port the server listens on |
DB_PATH | ./db.sqlite | Path to SQLite database file |
DATA_PATH | ./data | Directory for job file storage |
MAX_AGE | 172800 | Job retention time in seconds (default: 48 hours) |
Service Configuration
For each service you want to support, configure these variables:
| Variable Pattern | Description |
|---|---|
SERVICE_<NAME>_UPLOAD_URL | Client endpoint for submitting jobs |
SERVICE_<NAME>_DOWNLOAD_URL | Client endpoint for retrieving results |
SERVICE_<NAME>_RUNS_PER_USER | Maximum concurrent jobs per user (default: 5) |
Note: <NAME> must be uppercase. For a service called “example”, use SERVICE_EXAMPLE_*.
Example Configuration
Minimal Setup
export PORT=5000
export DB_PATH=/var/lib/job-orchestrator/db.sqlite
export DATA_PATH=/var/lib/job-orchestrator/data
export SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://localhost:9000/retrieve
Production Setup
# Core settings
export PORT=5000
export DB_PATH=/opt/orchestrator/db.sqlite
export DATA_PATH=/opt/orchestrator/data
export MAX_AGE=172800 # 48 hours
# Example service (general purpose)
export SERVICE_EXAMPLE_UPLOAD_URL=http://compute-1:9000/submit
export SERVICE_EXAMPLE_DOWNLOAD_URL=http://compute-1:9000/retrieve
export SERVICE_EXAMPLE_RUNS_PER_USER=10
# HADDOCK service (specialized)
export SERVICE_HADDOCK_UPLOAD_URL=http://haddock-cluster:9001/submit
export SERVICE_HADDOCK_DOWNLOAD_URL=http://haddock-cluster:9001/retrieve
export SERVICE_HADDOCK_RUNS_PER_USER=3
Docker Compose
services:
server:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: server
ports:
- "5000:5000"
environment:
PORT: 5000
DB_PATH: /opt/data/db.sqlite
DATA_PATH: /opt/data
MAX_AGE: 172800
SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
SERVICE_EXAMPLE_RUNS_PER_USER: 5
volumes:
- server-data:/opt/data
volumes:
server-data:
Configuration Details
PORT
The HTTP port for the REST API. Users will connect to this port to submit jobs and download results.
PORT=5000
DB_PATH
Path to the SQLite database file. The directory must exist and be writable.
DB_PATH=/var/lib/job-orchestrator/db.sqlite
The database is created automatically on first run. It stores:
- Job metadata (ID, user, service, status)
- Job locations and timestamps
- Client payload references
DATA_PATH
Directory where job files are stored. Each job gets a unique subdirectory.
DATA_PATH=/var/lib/job-orchestrator/data
Structure:
/var/lib/job-orchestrator/data/
├── a1b2c3d4-e5f6-7890-abcd-ef1234567890/
│ ├── run.sh
│ ├── input.pdb
│ └── output.zip (after completion)
├── b2c3d4e5-f6a7-8901-bcde-f12345678901/
│ └── ...
MAX_AGE
How long to keep completed jobs before cleanup, in seconds.
| Value | Duration |
|---|---|
3600 | 1 hour |
86400 | 24 hours |
172800 | 48 hours (default) |
604800 | 1 week |
MAX_AGE=172800
Jobs older than this are removed by the Cleaner task.
Service URLs
Each service needs upload and download URLs pointing to a client:
SERVICE_MYSERVICE_UPLOAD_URL=http://client-host:9000/submit
SERVICE_MYSERVICE_DOWNLOAD_URL=http://client-host:9000/retrieve
- UPLOAD_URL: Where to POST job files
- DOWNLOAD_URL: Where to GET results (
:idis appended automatically)
RUNS_PER_USER
Controls how many jobs a single user can have running simultaneously for a service:
SERVICE_EXAMPLE_RUNS_PER_USER=5
- Jobs exceeding the quota remain in
Queuedstatus - They’re automatically dispatched when slots become available
- Set higher for quick jobs, lower for resource-intensive jobs
File Permissions
Ensure the server process has:
- Read/Write access to
DB_PATHparent directory - Read/Write access to
DATA_PATHdirectory - Network access to all configured client URLs
Validating Configuration
Start the server and check logs:
job-orchestrator server
You should see:
- Port binding confirmation
- Database initialization
- Service configuration loaded
Test with a health check:
curl http://localhost:5000/health
See Also
Client Configuration
The client is configured through environment variables and runs as a job executor.
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT | 9000 | HTTP port the client listens on |
Example Configuration
Basic Setup
export PORT=9000
job-orchestrator client
Docker Compose
services:
client:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
ports:
- "9000:9000"
environment:
PORT: 9000
volumes:
- client-data:/opt/data
volumes:
client-data:
How the Client Works
In-Memory Database
Unlike the server, the client uses an in-memory SQLite database:
- Fast: No disk I/O for database operations
- Ephemeral: Data is lost on restart
- Lightweight: Minimal resource usage
This is intentional - the client only needs to track active payloads. The server maintains the authoritative job history.
Working Directory
The client stores job files in a working directory. Each payload gets a unique subdirectory:
/opt/data/
├── payload-uuid-1/
│ ├── run.sh
│ ├── input.pdb
│ └── output.txt (created by run.sh)
├── payload-uuid-2/
│ └── ...
Execution Environment
When the Runner task executes a job:
- Changes to the payload directory
- Executes
./run.sh - Captures the exit code
- All files in the directory are included in results
Resource Reporting
The client exposes a /load endpoint that reports CPU usage:
curl http://localhost:9000/load
Returns a float representing CPU usage percentage. This can be used by the server for load-aware scheduling (planned feature).
Multiple Clients
You can run multiple clients for:
- Scaling: Handle more concurrent jobs
- Isolation: Different services on different machines
- Redundancy: Failover capability
Same Service, Multiple Clients
Currently, configure multiple URLs in the server (round-robin planned):
# On server - points to primary client
SERVICE_EXAMPLE_UPLOAD_URL=http://client-1:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL=http://client-1:9000/retrieve
Different Services
Run specialized clients for different workloads:
# Client for general jobs
PORT=9000 job-orchestrator client
# Client for heavy computation (different machine)
PORT=9001 job-orchestrator client
Server configuration:
SERVICE_LIGHT_UPLOAD_URL=http://client-1:9000/submit
SERVICE_HEAVY_UPLOAD_URL=http://client-2:9001/submit
Client Security
Network Access
The client should only be accessible by the orchestrator server:
- Use internal networks / VPCs
- Firewall rules to restrict access
- Never expose client ports to the internet
Execution Sandbox
The client executes arbitrary run.sh scripts. Consider:
- Running in containers with resource limits
- Using separate user accounts with minimal permissions
- Mounting only necessary directories
- Network isolation if jobs don’t need internet
Docker Resource Limits
services:
client:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '1'
memory: 1G
Monitoring
Health Check
curl http://localhost:9000/health
Load Check
curl http://localhost:9000/load
Container Logs
docker logs -f client-container
Troubleshooting
Client Not Receiving Jobs
- Verify server can reach client URL
- Check firewall rules
- Verify service configuration on server
Jobs Stuck in Prepared
- Check if Runner task is running (look for logs)
- Verify
run.shis executable - Check for permission issues in working directory
High Memory Usage
The in-memory database grows with active payloads. If memory is high:
- Check for stuck/zombie payloads
- Restart the client (safe - server tracks jobs)
- Consider more frequent cleanup
See Also
Quota System
The quota system ensures fair resource allocation by limiting concurrent jobs per user per service.
How Quotas Work
User 1 submits 10 jobs for "example" service
Quota: SERVICE_EXAMPLE_RUNS_PER_USER=5
┌─────────────────────────────────────────┐
│ Jobs 1-5: Dispatched immediately │
│ Jobs 6-10: Remain queued │
└─────────────────────────────────────────┘
When Job 1 completes:
┌─────────────────────────────────────────┐
│ Job 6: Now dispatched (slot available) │
└─────────────────────────────────────────┘
Configuration
Set quotas per service using environment variables:
SERVICE_<NAME>_RUNS_PER_USER=<limit>
Examples
# Allow 5 concurrent jobs per user for "example" service
SERVICE_EXAMPLE_RUNS_PER_USER=5
# Allow 3 concurrent jobs per user for "haddock" service
SERVICE_HADDOCK_RUNS_PER_USER=3
# Allow 10 concurrent jobs per user for "quick" service
SERVICE_QUICK_RUNS_PER_USER=10
Default Value
If not specified, the default quota is 5 concurrent jobs per user per service.
Quota Scope
Quotas are enforced per user, per service:
| User | Service | Quota | Can Submit |
|---|---|---|---|
| user_1 | example | 5 | Up to 5 concurrent |
| user_1 | haddock | 3 | Up to 3 concurrent |
| user_2 | example | 5 | Up to 5 concurrent (independent of user_1) |
Users don’t compete with each other - each user has their own quota allocation.
Quota States
Jobs transition through these states relative to quotas:
┌──────────┐ Quota ┌────────────┐
│ Queued │ ──Available──▶ │ Processing │
└──────────┘ └────────────┘
│ │
│ Quota Exhausted │
▼ ▼
┌──────────────────┐ ┌────────────┐
│ Remains Queued │ │ Submitted │
│ (waits for slot) │ │ (running) │
└──────────────────┘ └────────────┘
Choosing Quota Values
Factors to Consider
- Job Duration: Longer jobs need lower quotas
- Resource Usage: CPU/memory intensive jobs need lower quotas
- User Base: More users may need lower per-user quotas
- Client Capacity: Match quotas to available compute resources
Guidelines
| Job Type | Suggested Quota |
|---|---|
| Quick jobs (< 1 min) | 10-20 |
| Medium jobs (1-10 min) | 5-10 |
| Long jobs (10+ min) | 2-5 |
| Resource-intensive | 1-3 |
Example Scenarios
Scientific Computing Platform
# Quick validation jobs - high quota
SERVICE_VALIDATE_RUNS_PER_USER=20
# Standard analysis - medium quota
SERVICE_ANALYZE_RUNS_PER_USER=5
# Heavy simulation - low quota
SERVICE_SIMULATE_RUNS_PER_USER=2
Educational Platform
# Student exercises - moderate quota
SERVICE_EXERCISE_RUNS_PER_USER=3
# Final projects - allow more
SERVICE_PROJECT_RUNS_PER_USER=5
Monitoring Quota Usage
Check Queue Status
Jobs waiting due to quota exhaustion remain in Queued status:
# Check how many jobs are queued vs running
curl http://localhost:5000/swagger-ui/ # Use API explorer
Server Logs
The server logs quota decisions:
INFO: User 1 has 5/5 jobs running for service 'example', job 123 remains queued
INFO: User 1 slot available, dispatching job 123 to service 'example'
Testing Quotas
Submit multiple jobs to observe throttling:
# Submit 10 jobs with quota of 5
for i in {1..10}; do
echo '#!/bin/bash
sleep 30
echo "Job complete" > output.txt' > run.sh
curl -s -X POST http://localhost:5000/upload \
-F "file=@run.sh" \
-F "user_id=1" \
-F "service=example" | jq -r '.status'
done
You’ll see:
- First 5 jobs: Quickly move to
Submitted - Jobs 6-10: Stay in
Queueduntil slots open
Fair Scheduling
The quota system provides basic fairness:
- Per-user isolation: One user can’t starve others
- Per-service isolation: Heavy service usage doesn’t block other services
- Automatic queuing: No jobs are rejected, just delayed
Limitations
Current limitations (improvements planned):
- No priority queues (FIFO within quota constraints)
- No global quotas (only per-user)
- No time-based quotas (e.g., jobs per hour)
- No burst allowances
See Also
Server API Endpoints
The orchestrator server exposes a REST API for job submission and retrieval.
Base URL
http://localhost:5000
Interactive Documentation
Swagger UI is available at:
http://localhost:5000/swagger-ui/
Endpoints
POST /upload
Submit a new job for processing.
Request
- Content-Type:
multipart/form-data - Max size: 400MB
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | One or more files (repeat for multiple) |
user_id | integer | Yes | User identifier for quota tracking |
service | string | Yes | Service name (must be configured on server) |
Example
curl -X POST http://localhost:5000/upload \
-F "file=@run.sh" \
-F "file=@input.pdb" \
-F "user_id=1" \
-F "service=example"
Response
{
"id": 1,
"status": "Queued",
"message": "Job successfully uploaded"
}
Status Codes
| Code | Description |
|---|---|
201 | Job created successfully |
400 | Invalid request (missing fields, invalid service) |
500 | Server error |
Notes
- At least one file must be named
run.sh - The
servicemust match a configured service on the server dest_idis populated after the job is dispatched to a client
GET /download/
Check job status or download results.
Parameters
| Parameter | Type | Description |
|---|---|---|
id | integer | Job ID from upload response |
Example
# Check status (returns JSON when not completed)
curl http://localhost:5000/download/1
# Download results (returns ZIP when completed)
curl -o results.zip http://localhost:5000/download/1
Response
When the job is not yet completed, returns a JSON body:
{
"id": 1,
"status": "Submitted",
"message": ""
}
When the job is completed, returns:
- Content-Type:
application/zip - Body: ZIP archive containing all result files
Status Codes
| Code | Description |
|---|---|
200 | JSON status body or ZIP file (check Content-Type) |
404 | Job not found |
500 | Server error |
Usage Pattern
Poll until status is Completed, then save the ZIP:
while true; do
response=$(curl -s http://localhost:5000/download/1)
status=$(echo "$response" | jq -r '.status // empty')
if [ -z "$status" ]; then
# No JSON status field means we got the ZIP
curl -o results.zip http://localhost:5000/download/1
break
elif [ "$status" = "Completed" ]; then
curl -o results.zip http://localhost:5000/download/1
break
else
echo "Status: $status"
sleep 5
fi
done
GET /health
Health check endpoint.
Example
curl http://localhost:5000/health
Response
{
"status": "healthy"
}
Status Codes
| Code | Description |
|---|---|
200 | Server is healthy |
500 | Server is unhealthy |
GET /
Ping endpoint for basic connectivity check.
Example
curl http://localhost:5000/
Response
Simple acknowledgment that the server is running.
GET /swagger-ui/
Interactive API documentation.
Example
Open in browser:
http://localhost:5000/swagger-ui/
Provides:
- Interactive API explorer
- Request/response schemas
- Try-it-out functionality
Error Responses
All error responses follow this format:
{
"id": 0,
"status": "Unknown",
"message": "Description of the error"
}
Rate Limiting
The server does not implement rate limiting directly. Use a reverse proxy (nginx, traefik) for rate limiting in production.
Authentication
The server does not implement authentication directly. The user_id field is trusted as provided. Implement authentication at the reverse proxy layer or in your application.
See Also
Client API Endpoints
The client exposes endpoints for the orchestrator server to submit jobs and retrieve results.
Note: These endpoints are typically only accessed by the orchestrator server, not by end users.
Base URL
http://localhost:9000
Endpoints
POST /submit
Receive a job payload from the orchestrator server.
Request
- Content-Type:
multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | One or more job files |
Example
curl -X POST http://localhost:9000/submit \
-F "file=@run.sh" \
-F "file=@input.pdb"
Response
{
"id": 1,
"status": "Prepared",
"loc": "/opt/data/abc123-def456"
}
Status Codes
| Code | Description |
|---|---|
200 | Payload received successfully |
500 | Server error |
Notes
- The client stores files and creates a payload record
- Status starts as
Prepared, waiting for the Runner task - The
idis returned to the server and stored asdest_id
GET /retrieve/
Retrieve results of a completed payload.
Parameters
| Parameter | Type | Description |
|---|---|---|
id | integer | Payload ID from submit response |
Example
curl -o results.zip http://localhost:9000/retrieve/1
Response
When the payload is not yet completed, returns a JSON body:
{
"id": 1,
"status": "Running",
"loc": "/opt/data/abc123-def456"
}
When the payload is completed, returns:
- Content-Type:
application/zip - Body: ZIP archive of all files in the payload directory
Status Codes
| Code | Description |
|---|---|
200 | JSON payload status or ZIP file (check Content-Type) |
404 | Payload not found |
500 | Server error |
Notes
- The ZIP includes all files in the working directory after
run.shexecution - Original input files are included unless deleted by
run.sh - After successful retrieval, the payload may be cleaned up
GET /load
Report current CPU usage.
Example
curl http://localhost:9000/load
Response
45.2
Returns a float representing CPU usage percentage (0-100).
Use Cases
- Load-aware job distribution (planned feature)
- Monitoring client health
- Capacity planning
GET /health
Health check endpoint.
Example
curl http://localhost:9000/health
Response
{
"status": "healthy"
}
GET /
Ping endpoint for basic connectivity check.
Example
curl http://localhost:9000/
Payload States
Payloads on the client go through these states:
| State | Description |
|---|---|
Prepared | Received from server, waiting for execution |
Running | Currently executing run.sh |
Completed | Execution finished successfully |
Failed | Execution failed (non-zero exit code) |
Security Considerations
The client API should never be exposed to the public internet:
- No authentication is implemented
- Arbitrary code execution via
run.sh - Internal service communication only
Recommendations:
- Use internal networks / VPCs
- Firewall rules: allow only orchestrator server IP
- Docker networks with no external exposure
See Also
Docker Deployment
Docker is the recommended way to deploy job-orchestrator in production.
Quick Start
git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
docker compose up --build
This starts:
- Server on port 5000
- Example client on port 9000
Docker Images
Official Image
docker pull ghcr.io/rvhonorato/job-orchestrator:latest
Build Locally
docker build -t job-orchestrator .
Docker Compose
Basic Setup
version: '3.8'
services:
server:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: server
ports:
- "5000:5000"
environment:
PORT: 5000
DB_PATH: /opt/data/db.sqlite
DATA_PATH: /opt/data
MAX_AGE: 172800
SERVICE_EXAMPLE_UPLOAD_URL: http://client:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client:9000/retrieve
SERVICE_EXAMPLE_RUNS_PER_USER: 5
volumes:
- server-data:/opt/data
depends_on:
- client
client:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
environment:
PORT: 9000
volumes:
- client-data:/opt/data
volumes:
server-data:
client-data:
Production Setup
See Production Deployment - Container Hardening for details on each security option.
Multiple Clients
Scaling Horizontally
services:
server:
# ... server config ...
environment:
SERVICE_EXAMPLE_UPLOAD_URL: http://client-1:9000/submit
SERVICE_EXAMPLE_DOWNLOAD_URL: http://client-1:9000/retrieve
client-1:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
environment:
PORT: 9000
client-2:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
environment:
PORT: 9000
client-3:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
environment:
PORT: 9000
Multiple Services
services:
server:
environment:
# Light jobs
SERVICE_LIGHT_UPLOAD_URL: http://client-light:9000/submit
SERVICE_LIGHT_DOWNLOAD_URL: http://client-light:9000/retrieve
SERVICE_LIGHT_RUNS_PER_USER: 10
# Heavy jobs
SERVICE_HEAVY_UPLOAD_URL: http://client-heavy:9000/submit
SERVICE_HEAVY_DOWNLOAD_URL: http://client-heavy:9000/retrieve
SERVICE_HEAVY_RUNS_PER_USER: 2
client-light:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
deploy:
resources:
limits:
cpus: '2'
memory: 2G
client-heavy:
image: ghcr.io/rvhonorato/job-orchestrator:latest
command: client
deploy:
resources:
limits:
cpus: '8'
memory: 16G
Volume Management
Persistent Storage
Always use named volumes for production:
volumes:
server-data:
driver: local
driver_opts:
type: none
o: bind
device: /data/job-orchestrator/server
client-data:
driver: local
driver_opts:
type: none
o: bind
device: /data/job-orchestrator/client
Backup Strategy
# Stop services (optional, for consistent backup)
docker compose stop
# Backup server data
tar -czf backup-$(date +%Y%m%d).tar.gz /data/job-orchestrator/server
# Resume services
docker compose start
Networking
Internal Network
Keep client internal:
services:
server:
ports:
- "5000:5000" # Exposed to host
networks:
- internal
- external
client:
# No ports exposed to host
networks:
- internal
networks:
internal:
internal: true
external:
With Reverse Proxy
services:
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- server
server:
# No ports exposed, accessed via nginx
networks:
- internal
Logging
View Logs
# All services
docker compose logs -f
# Server only
docker compose logs -f server
# Last 100 lines
docker compose logs --tail 100 server
Log Rotation
services:
server:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Commands
# Start
docker compose up -d
# Stop
docker compose down
# Restart
docker compose restart
# Rebuild and start
docker compose up --build -d
# View status
docker compose ps
# Shell into container
docker compose exec server /bin/sh
See Also
Production Deployment
This guide covers best practices for deploying job-orchestrator in production environments.
Architecture Recommendations
Minimum Setup
┌─────────────┐ ┌─────────────┐
│ Server │────▶│ Client │
│ (1 instance)│ │ (1 instance)│
└─────────────┘ └─────────────┘
Recommended Setup
┌──────────────┐
│ Load Balancer│
│ (nginx) │
└──────┬───────┘
│
┌──────▼───────┐
│ Server │
│ (1 instance)│
└──────┬───────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Client 1 │ │ Client 2 │ │ Client 3 │
└─────────────┘ └─────────────┘ └─────────────┘
Security
Script Validation
The client includes a built-in script validator that rejects run.sh
scripts containing obviously dangerous patterns before execution. This
covers destructive commands (rm -rf /, mkfs), network exfiltration
tools (curl, wget, socat), reverse shells (/dev/tcp/), privilege
escalation (sudo, chmod +s), container escapes (nsenter, docker),
obfuscated execution (base64 | bash, python -c), persistence
mechanisms (crontab, systemctl), crypto miners, and environment
secret access.
This is a sanity check, not a sandbox. It can be bypassed by determined actors. Input scripts are still expected to come from trusted or semi-trusted sources. True isolation must be enforced at the deployment level using the container hardening measures below.
Container Hardening
The client executes user-submitted scripts with the full privileges of the process. Apply all of the following to limit blast radius:
| Measure | Docker Compose | Purpose |
|---|---|---|
| Read-only rootfs | read_only: true | Prevent filesystem tampering |
| Drop all capabilities | cap_drop: [ALL] | Remove kernel-level privileges |
| No new privileges | security_opt: [no-new-privileges:true] | Block setuid/setgid escalation |
| CPU limit | deploy.resources.limits.cpus | Prevent CPU starvation |
| Memory limit | deploy.resources.limits.memory | Prevent OOM on host |
| PIDs limit | deploy.resources.limits.pids | Prevent fork bombs |
| Internal network | networks: [internal] | Block outbound internet access |
| Writable tmpfs | tmpfs: [/tmp] | Provide scratch space on read-only rootfs |
Example (applied to the client service):
services:
client:
read_only: true
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
tmpfs:
- /tmp
deploy:
resources:
limits:
cpus: "2"
memory: 2G
pids: 256
networks:
- internal
networks:
internal:
internal: true
Future improvement: Run the container as a non-root user (USER appuser
in the Dockerfile). This requires migrating ownership of existing volumes
first – see the TODO in the Dockerfile.
Network Security
-
Never expose clients to the internet
- Clients execute user-submitted scripts
- Use internal networks only
- Block all outbound access from client containers
-
Use a reverse proxy
- TLS termination
- Rate limiting
- Request filtering
-
Firewall rules
# Allow only orchestrator server to reach clients iptables -A INPUT -p tcp --dport 9000 -s <server-ip> -j ACCEPT iptables -A INPUT -p tcp --dport 9000 -j DROP
Reverse Proxy (nginx)
upstream orchestrator {
server 127.0.0.1:5000;
}
server {
listen 443 ssl http2;
server_name jobs.example.com;
ssl_certificate /etc/nginx/certs/cert.pem;
ssl_certificate_key /etc/nginx/certs/key.pem;
# Rate limiting
limit_req_zone $binary_remote_addr zone=upload:10m rate=10r/s;
location /upload {
limit_req zone=upload burst=20 nodelay;
client_max_body_size 400M;
proxy_pass http://orchestrator;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /download {
proxy_pass http://orchestrator;
proxy_set_header Host $host;
}
location /health {
proxy_pass http://orchestrator;
}
# Block swagger in production (optional)
location /swagger-ui {
deny all;
}
}
Authentication
job-orchestrator does not implement authentication. Options:
-
Reverse proxy authentication
location / { auth_basic "Restricted"; auth_basic_user_file /etc/nginx/.htpasswd; proxy_pass http://orchestrator; } -
Application-level authentication
- Wrap the API in your application
- Validate users before calling job-orchestrator
-
OAuth2 Proxy
- Use oauth2-proxy in front of the service
- Integrates with identity providers
Resource Planning
Server Requirements
| Load Level | CPU | Memory | Storage |
|---|---|---|---|
| Light (< 100 jobs/day) | 1 core | 512MB | 10GB |
| Medium (100-1000 jobs/day) | 2 cores | 1GB | 50GB |
| Heavy (> 1000 jobs/day) | 4 cores | 2GB | 100GB+ |
Storage depends heavily on job file sizes and retention period.
Client Requirements
Depends entirely on your job workloads:
| Job Type | CPU | Memory |
|---|---|---|
| Text processing | 1 core | 512MB |
| Scientific computing | 4-8 cores | 8-16GB |
| ML/Deep learning | 8+ cores + GPU | 32GB+ |
Storage Calculation
Storage = (avg_job_size) × (jobs_per_day) × (retention_days)
Example:
- 10MB average job
- 500 jobs/day
- 2 day retention
= 10MB × 500 × 2 = 10GB
Monitoring
Health Checks
# Server health
curl -f http://localhost:5000/health
# Client health
curl -f http://localhost:9000/health
# Client load
curl http://localhost:9000/load
Prometheus Metrics (External)
Use a sidecar or external monitoring:
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# Monitor container metrics
cadvisor:
image: gcr.io/cadvisor/cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
Log Aggregation
services:
server:
logging:
driver: "fluentd"
options:
fluentd-address: "localhost:24224"
tag: "job-orchestrator.server"
Backup & Recovery
What to Backup
-
Server database (
DB_PATH)- Contains job history and status
- Critical for job tracking
-
Server data directory (
DATA_PATH)- Contains job files and results
- Large, may use incremental backups
Backup Script
#!/bin/bash
BACKUP_DIR=/backups/job-orchestrator
DATE=$(date +%Y%m%d_%H%M%S)
# Backup database
sqlite3 /opt/data/db.sqlite ".backup '${BACKUP_DIR}/db_${DATE}.sqlite'"
# Backup data (incremental with rsync)
rsync -av --delete /opt/data/ ${BACKUP_DIR}/data/
# Cleanup old backups (keep 7 days)
find ${BACKUP_DIR} -name "db_*.sqlite" -mtime +7 -delete
Recovery
# Stop server
docker compose stop server
# Restore database
cp /backups/job-orchestrator/db_latest.sqlite /opt/data/db.sqlite
# Restore data
rsync -av /backups/job-orchestrator/data/ /opt/data/
# Start server
docker compose start server
High Availability
Current Limitations
- Single server architecture
- No built-in clustering
- SQLite doesn’t support concurrent writes
Workarounds
-
Quick recovery
- Automated health checks
- Container auto-restart
- Fast backup restoration
-
Stateless clients
- Clients can be restarted freely
- Jobs are tracked by server
-
Future improvements
- PostgreSQL support (planned)
- Server clustering (planned)
Maintenance
Database Maintenance
# Vacuum database (reclaim space)
sqlite3 /opt/data/db.sqlite "VACUUM;"
# Check integrity
sqlite3 /opt/data/db.sqlite "PRAGMA integrity_check;"
Log Rotation
Ensure logs don’t fill disk:
services:
server:
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
Updates
# Pull latest image
docker pull ghcr.io/rvhonorato/job-orchestrator:latest
# Recreate containers
docker compose up -d
Troubleshooting
See Troubleshooting Guide for common issues.
See Also
Building from Source
This guide covers building job-orchestrator from source code.
Prerequisites
Rust Toolchain
Install Rust via rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Minimum version: Rust 1.75 (edition 2021)
Verify installation:
rustc --version
cargo --version
System Dependencies
Debian/Ubuntu
apt-get update
apt-get install -y build-essential libsqlite3-dev pkg-config
Fedora/RHEL
dnf install gcc sqlite-devel
macOS
brew install sqlite
Windows
Install Visual Studio Build Tools and SQLite development libraries.
Clone Repository
git clone https://github.com/rvhonorato/job-orchestrator.git
cd job-orchestrator
Build Commands
Debug Build
Fast compilation, includes debug symbols:
cargo build
Binary location: target/debug/job-orchestrator
Release Build
Optimized for performance:
cargo build --release
Binary location: target/release/job-orchestrator
Check (No Build)
Verify code compiles without producing binary:
cargo check
Running
From Cargo
# Server mode
cargo run -- server --port 5000
# Client mode
cargo run -- client --port 9000
From Binary
# After release build
./target/release/job-orchestrator server --port 5000
Build Options
Features
Currently no optional features. All functionality is included by default.
Target Platforms
Cross-compile for different targets:
# Add target
rustup target add x86_64-unknown-linux-musl
# Build for target
cargo build --release --target x86_64-unknown-linux-musl
Common targets:
x86_64-unknown-linux-gnu- Linux (glibc)x86_64-unknown-linux-musl- Linux (static)x86_64-apple-darwin- macOS Intelaarch64-apple-darwin- macOS Apple Siliconx86_64-pc-windows-msvc- Windows
Docker Build
Using Dockerfile
docker build -t job-orchestrator .
Multi-stage Build
The Dockerfile uses multi-stage builds for smaller images:
- Builder stage: Compiles with full toolchain
- Runtime stage: Minimal image with just the binary
See Also
Testing
This guide covers running and writing tests for job-orchestrator.
Running Tests
All Tests
cargo test
With Output
See println! output from tests:
cargo test -- --nocapture
Specific Test
# By name
cargo test test_upload
# By module
cargo test orchestrator::tests
Ignored Tests
Some tests may be ignored by default (slow, require setup):
cargo test -- --ignored
Test Coverage
Using cargo-tarpaulin
Install:
cargo install cargo-tarpaulin
Generate coverage:
# HTML report
cargo tarpaulin --out Html --output-dir ./coverage
# XML report (for CI)
cargo tarpaulin --out Xml --output-dir ./coverage
View report:
open coverage/tarpaulin-report.html
Test Structure
Tests are organized alongside the code they test:
src/
├── lib.rs
├── orchestrator/
│ ├── mod.rs
│ └── tests.rs # Orchestrator tests
├── client/
│ ├── mod.rs
│ └── tests.rs # Client tests
└── utils/
└── mod.rs # Inline tests with #[cfg(test)]
Writing Tests
Unit Tests
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_something() {
let result = function_under_test();
assert_eq!(result, expected_value);
}
}
}
Async Tests
Use tokio::test for async functions:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_async_function() {
let result = async_function().await;
assert!(result.is_ok());
}
}
}
Integration Tests
Create files in tests/ directory:
#![allow(unused)]
fn main() {
// tests/integration_test.rs
use job_orchestrator::*;
#[tokio::test]
async fn test_full_workflow() {
// Setup
// Test
// Verify
}
}
Mocking
Using mockall
For trait-based mocking:
#![allow(unused)]
fn main() {
use mockall::automock;
#[automock]
trait Database {
fn get(&self, id: i32) -> Option<Job>;
}
#[test]
fn test_with_mock() {
let mut mock = MockDatabase::new();
mock.expect_get()
.with(eq(1))
.returning(|_| Some(Job::default()));
// Use mock in test
}
}
Using mockito
For HTTP mocking:
#![allow(unused)]
fn main() {
use mockito::Server;
#[tokio::test]
async fn test_http_client() {
let mut server = Server::new();
let mock = server.mock("GET", "/health")
.with_status(200)
.create();
// Test against server.url()
mock.assert();
}
}
Test Utilities
Test Fixtures
Create reusable test data:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod test_utils {
pub fn create_test_job() -> Job {
Job {
id: 1,
user_id: 1,
service: "test".to_string(),
status: Status::Queued,
..Default::default()
}
}
}
}
Temporary Directories
Use tempfile for temporary test directories:
#![allow(unused)]
fn main() {
use tempfile::TempDir;
#[test]
fn test_file_operations() {
let temp_dir = TempDir::new().unwrap();
let file_path = temp_dir.path().join("test.txt");
// Test file operations
// TempDir is automatically cleaned up
}
}
CI Testing
Tests run automatically on GitHub Actions:
# .github/workflows/ci.yml
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- run: cargo test
Linting
Clippy
Run Clippy for additional checks:
cargo clippy -- -D warnings
Common fixes:
#[allow(clippy::lint_name)]to suppress specific lints- Configure in
clippy.tomlorCargo.toml
Formatting
Check formatting:
cargo fmt -- --check
Fix formatting:
cargo fmt
Debugging Tests
With println
#![allow(unused)]
fn main() {
#[test]
fn test_debug() {
let value = compute_something();
println!("Debug value: {:?}", value); // Use --nocapture to see
assert!(value.is_valid());
}
}
With RUST_BACKTRACE
RUST_BACKTRACE=1 cargo test
See Also
Contributing
Contributions to job-orchestrator are welcome! This guide explains how to contribute.
Ways to Contribute
- Bug reports: Found a bug? Open an issue
- Feature requests: Have an idea? Open an issue to discuss
- Code contributions: Fix bugs or implement features
- Documentation: Improve docs, fix typos, add examples
- Testing: Add tests, report edge cases
Getting Started
1. Fork the Repository
Click “Fork” on GitHub, then clone your fork:
git clone https://github.com/YOUR_USERNAME/job-orchestrator.git
cd job-orchestrator
2. Set Up Development Environment
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install dependencies (Debian/Ubuntu)
apt-get install libsqlite3-dev
# Build
cargo build
# Run tests
cargo test
3. Create a Branch
git checkout -b feature/your-feature-name
# or
git checkout -b fix/your-bug-fix
Development Workflow
Making Changes
- Write code
- Add tests for new functionality
- Run tests:
cargo test - Run linter:
cargo clippy -- -D warnings - Format code:
cargo fmt
Commit Messages
Follow conventional commits:
type(scope): description
[optional body]
Types:
feat: New featurefix: Bug fixdocs: Documentationrefactor: Code refactoringtest: Adding testschore: Maintenance
Examples:
feat(quota): add per-service quota limits
fix(client): handle non-zero exit codes correctly
docs(readme): update quick start instructions
Pull Request Process
-
Push your branch:
git push origin feature/your-feature-name -
Open a Pull Request on GitHub
-
Fill in the PR template:
- Describe your changes
- Link related issues
- Note any breaking changes
-
Wait for review
- CI must pass
- Maintainer will review
- Address feedback
-
Merge!
Code Style
Rust Style
Follow standard Rust conventions:
- Use
cargo fmtfor formatting - Use
cargo clippyfor linting - Prefer descriptive variable names
- Add doc comments for public APIs
#![allow(unused)]
fn main() {
/// Uploads a job to the orchestrator.
///
/// # Arguments
///
/// * `files` - Files to upload
/// * `user_id` - User submitting the job
/// * `service` - Target service name
///
/// # Returns
///
/// The created job with its ID and initial status.
pub async fn upload_job(
files: Vec<File>,
user_id: i32,
service: String,
) -> Result<Job, Error> {
// Implementation
}
}
Error Handling
- Use
Resulttypes, not panics - Provide context with error messages
- Use
?operator for propagation
#![allow(unused)]
fn main() {
// Good
let file = File::open(path)
.map_err(|e| Error::FileOpen { path: path.clone(), source: e })?;
// Avoid
let file = File::open(path).unwrap();
}
Testing Guidelines
Write Tests For
- New functionality
- Bug fixes (regression tests)
- Edge cases
Test Structure
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_function_describes_behavior() {
// Arrange
let input = create_test_input();
// Act
let result = function_under_test(input);
// Assert
assert_eq!(result, expected);
}
}
}
Async Tests
#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_async_operation() {
let result = async_function().await;
assert!(result.is_ok());
}
}
Documentation
Code Documentation
Add doc comments for:
- Public functions
- Public structs
- Public modules
#![allow(unused)]
fn main() {
/// A job represents a unit of work to be processed.
pub struct Job {
/// Unique identifier for the job.
pub id: i32,
/// User who submitted the job.
pub user_id: i32,
// ...
}
}
mdbook Documentation
Documentation is in docs/src/. To preview:
# Install mdbook
cargo install mdbook
# Serve locally
cd docs
mdbook serve --open
Reporting Issues
Bug Reports
Include:
- job-orchestrator version
- Operating system
- Steps to reproduce
- Expected vs actual behavior
- Logs if relevant
Feature Requests
Include:
- Use case description
- Proposed solution (if any)
- Alternatives considered
Code of Conduct
Be respectful and constructive. We’re all here to build something useful together.
Questions?
- Open an issue for questions
- Email: Rodrigo V. Honorato rvhonorato@protonmail.com
License
Contributions are licensed under the MIT License.
Troubleshooting
Common issues and solutions for job-orchestrator.
Server Issues
Server Won’t Start
Symptom: Server fails to start, exits immediately
Possible Causes:
-
Port already in use
Error: Address already in useSolution:
# Find process using the port lsof -i :5000 # Kill it or use a different port PORT=5001 job-orchestrator server -
Database path not writable
Error: unable to open database fileSolution:
# Check directory exists and is writable mkdir -p /opt/data chmod 755 /opt/data -
Missing service configuration
Error: No services configuredSolution: Configure at least one service:
export SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit export SERVICE_EXAMPLE_DOWNLOAD_URL=http://client:9000/retrieve
Jobs Stuck in Queued
Symptom: Jobs stay in Queued status indefinitely
Possible Causes:
-
Quota exhausted
Check if user has reached their limit:
- Default quota is 5 concurrent jobs per user per service
- Wait for running jobs to complete, or increase quota
-
Client unreachable
Verify client connectivity:
curl http://client:9000/health -
Service misconfigured
Verify service URLs are correct:
echo $SERVICE_EXAMPLE_UPLOAD_URL curl -X POST $SERVICE_EXAMPLE_UPLOAD_URL # Should return error, not timeout
Jobs Stuck in Submitted
Symptom: Jobs move to Submitted but never complete
Possible Causes:
-
Client not executing jobs
Check client logs for errors
docker logs client -
run.shhangingYour script may be waiting for input or stuck in a loop
-
Getter task not running
Server may need restart
Upload Fails with 400
Symptom: POST /upload returns 400 Bad Request
Possible Causes:
-
Missing required fields
# Ensure all fields are provided curl -X POST http://localhost:5000/upload \ -F "file=@run.sh" \ -F "user_id=1" \ # Required -F "service=example" # Required -
Unknown service
Service must be configured on server:
export SERVICE_EXAMPLE_UPLOAD_URL=... -
File too large
Default limit is 400MB. Check file sizes.
Client Issues
Client Not Receiving Jobs
Symptom: Client running but no jobs arrive
Check:
-
Network connectivity
# From server, can you reach client? curl http://client:9000/health -
Firewall rules
# Client port must be accessible from server iptables -L -n | grep 9000 -
Docker networking
# Containers must be on same network docker network inspect job-orchestrator_default
Jobs Stuck in Prepared
Symptom: Payloads stay in Prepared status
Possible Causes:
-
Runner task not running
Check client logs, may need restart
-
run.shnot found or not executableEnsure the script exists and is executable:
# In your upload chmod +x run.sh -
Permission issues
Client working directory may have permission issues
Execution Fails
Symptom: Jobs complete but with Failed status
Check:
-
Exit code
run.shmust exit with code 0 for success:#!/bin/bash # Your commands here exit 0 # Explicit success -
Script errors
Check output files for error messages
-
Missing dependencies
Your script may need tools not available in the container
Database Issues
Database Locked
Symptom: “database is locked” errors
Causes: Multiple processes accessing SQLite
Solution:
- Ensure only one server instance runs
- Check for zombie processes
- Restart server
Database Corrupted
Symptom: Strange errors, missing data
Solution:
-
Stop server
-
Backup current database
-
Run integrity check:
sqlite3 db.sqlite "PRAGMA integrity_check;" -
If corrupted, restore from backup or delete and restart
Out of Disk Space
Symptom: “disk full” errors
Solution:
-
Check disk usage:
df -h -
Clean old jobs:
# Reduce MAX_AGE and restart export MAX_AGE=3600 # 1 hour -
Manually clean data directory
Docker Issues
Container Exits Immediately
Check logs:
docker logs container_name
Common causes:
- Missing environment variables
- Port conflicts
- Permission issues
Cannot Connect Between Containers
Ensure same network:
services:
server:
networks:
- app-network
client:
networks:
- app-network
networks:
app-network:
Use service names, not localhost:
# Wrong
SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit
# Correct
SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit
Volume Permission Issues
Symptom: Permission denied when writing to volumes
Solution:
services:
server:
user: "1000:1000" # Match host user
volumes:
- ./data:/opt/data
Or fix permissions:
sudo chown -R 1000:1000 ./data
Performance Issues
Slow Job Processing
Possible Causes:
-
Slow database
- Use SSD storage for database
- Run VACUUM periodically
-
Network latency
- Place server and clients on same network
- Check for packet loss
-
Client overloaded
- Add more clients
- Reduce RUNS_PER_USER
High Memory Usage
Server:
- Memory grows with job count
- Clean old jobs with lower MAX_AGE
Client:
- In-memory database grows with payloads
- Restart client to clear
Disk Usage Growing
Check:
du -sh /opt/data/*
Solutions:
- Reduce MAX_AGE
- Increase cleanup frequency
- Archive old results externally
Getting Help
If you can’t resolve an issue:
- Check logs for specific error messages
- Search existing issues: GitHub Issues
- Open new issue with:
- Version
- Configuration
- Steps to reproduce
- Logs