Troubleshooting
Common issues and solutions for job-orchestrator.
Server Issues
Server Won’t Start
Symptom: Server fails to start, exits immediately
Possible Causes:
-
Port already in use
Error: Address already in useSolution:
# Find process using the port lsof -i :5000 # Kill it or use a different port PORT=5001 job-orchestrator server -
Database path not writable
Error: unable to open database fileSolution:
# Check directory exists and is writable mkdir -p /opt/data chmod 755 /opt/data -
Missing service configuration
Error: No services configuredSolution: Configure at least one service:
export SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit export SERVICE_EXAMPLE_DOWNLOAD_URL=http://client:9000/retrieve
Jobs Stuck in Queued
Symptom: Jobs stay in Queued status indefinitely
Possible Causes:
-
Quota exhausted
Check if user has reached their limit:
- Default quota is 5 concurrent jobs per user per service
- Wait for running jobs to complete, or increase quota
-
Client unreachable
Verify client connectivity:
curl http://client:9000/health -
Service misconfigured
Verify service URLs are correct:
echo $SERVICE_EXAMPLE_UPLOAD_URL curl -X POST $SERVICE_EXAMPLE_UPLOAD_URL # Should return error, not timeout
Jobs Stuck in Submitted
Symptom: Jobs move to Submitted but never complete
Possible Causes:
-
Client not executing jobs
Check client logs for errors
docker logs client -
run.shhangingYour script may be waiting for input or stuck in a loop
-
Getter task not running
Server may need restart
Upload Fails with 400
Symptom: POST /upload returns 400 Bad Request
Possible Causes:
-
Missing required fields
# Ensure all fields are provided curl -X POST http://localhost:5000/upload \ -F "file=@run.sh" \ -F "user_id=1" \ # Required -F "service=example" # Required -
Unknown service
Service must be configured on server:
export SERVICE_EXAMPLE_UPLOAD_URL=... -
File too large
Default limit is 400MB. Check file sizes.
Client Issues
Client Not Receiving Jobs
Symptom: Client running but no jobs arrive
Check:
-
Network connectivity
# From server, can you reach client? curl http://client:9000/health -
Firewall rules
# Client port must be accessible from server iptables -L -n | grep 9000 -
Docker networking
# Containers must be on same network docker network inspect job-orchestrator_default
Jobs Stuck in Prepared
Symptom: Payloads stay in Prepared status
Possible Causes:
-
Runner task not running
Check client logs, may need restart
-
run.shnot found or not executableEnsure the script exists and is executable:
# In your upload chmod +x run.sh -
Permission issues
Client working directory may have permission issues
Execution Fails
Symptom: Jobs complete but with Failed status
Check:
-
Exit code
run.shmust exit with code 0 for success:#!/bin/bash # Your commands here exit 0 # Explicit success -
Script errors
Check output files for error messages
-
Missing dependencies
Your script may need tools not available in the container
Database Issues
Database Locked
Symptom: “database is locked” errors
Causes: Multiple processes accessing SQLite
Solution:
- Ensure only one server instance runs
- Check for zombie processes
- Restart server
Database Corrupted
Symptom: Strange errors, missing data
Solution:
-
Stop server
-
Backup current database
-
Run integrity check:
sqlite3 db.sqlite "PRAGMA integrity_check;" -
If corrupted, restore from backup or delete and restart
Out of Disk Space
Symptom: “disk full” errors
Solution:
-
Check disk usage:
df -h -
Clean old jobs:
# Reduce MAX_AGE and restart export MAX_AGE=3600 # 1 hour -
Manually clean data directory
Docker Issues
Container Exits Immediately
Check logs:
docker logs container_name
Common causes:
- Missing environment variables
- Port conflicts
- Permission issues
Cannot Connect Between Containers
Ensure same network:
services:
server:
networks:
- app-network
client:
networks:
- app-network
networks:
app-network:
Use service names, not localhost:
# Wrong
SERVICE_EXAMPLE_UPLOAD_URL=http://localhost:9000/submit
# Correct
SERVICE_EXAMPLE_UPLOAD_URL=http://client:9000/submit
Volume Permission Issues
Symptom: Permission denied when writing to volumes
Solution:
services:
server:
user: "1000:1000" # Match host user
volumes:
- ./data:/opt/data
Or fix permissions:
sudo chown -R 1000:1000 ./data
Performance Issues
Slow Job Processing
Possible Causes:
-
Slow database
- Use SSD storage for database
- Run VACUUM periodically
-
Network latency
- Place server and clients on same network
- Check for packet loss
-
Client overloaded
- Add more clients
- Reduce RUNS_PER_USER
High Memory Usage
Server:
- Memory grows with job count
- Clean old jobs with lower MAX_AGE
Client:
- In-memory database grows with payloads
- Restart client to clear
Disk Usage Growing
Check:
du -sh /opt/data/*
Solutions:
- Reduce MAX_AGE
- Increase cleanup frequency
- Archive old results externally
Getting Help
If you can’t resolve an issue:
- Check logs for specific error messages
- Search existing issues: GitHub Issues
- Open new issue with:
- Version
- Configuration
- Steps to reproduce
- Logs