Job Lifecycle
Understanding the job lifecycle is essential for working with job-orchestrator effectively.
Lifecycle Sequence
sequenceDiagram
participant User
participant Server
participant Client
participant Executor
User->>Server: POST /upload (files, user_id, service)
Server->>Server: Store job (status: Queued)
Server-->>User: Job ID
Note over Server: Sender task (500ms interval)
Server->>Server: Update status: Processing
Server->>Client: POST /submit (job files)
Client->>Client: Store payload (status: Prepared)
Client-->>Server: Payload ID
Server->>Server: Update status: Submitted
Note over Client: Runner task (500ms interval)
Client->>Executor: Execute run.sh
Executor->>Executor: Process files
Executor-->>Client: Exit code
Client->>Client: Update status: Completed
Note over Server: Getter task (500ms interval)
Server->>Client: GET /retrieve/:id
Client-->>Server: ZIP results
Server->>Server: Store results, status: Completed
User->>Server: GET /download/:id
Server-->>User: results.zip
Note over Server: Cleaner task (60s interval)
Server->>Server: Remove jobs older than MAX_AGE
Job States
stateDiagram-v2
[*] --> Queued: Job submitted
Queued --> Processing: Sender picks up job
Processing --> Submitted: Sent to client
Processing --> Failed: Client unreachable
Submitted --> Completed: Execution successful
Submitted --> Unknown: Retrieval failed or execution failed
Unknown --> Completed: Retry successful
Completed --> Cleaned: After MAX_AGE
Failed --> Cleaned: After MAX_AGE
Unknown --> Cleaned: After MAX_AGE (if applicable)
Cleaned --> [*]
State Descriptions
| State | Description |
|---|---|
| Queued | Job received and waiting for dispatch |
| Processing | Server is sending job to a client |
| Submitted | Job successfully sent to client, awaiting execution |
| Completed | Job finished successfully, results available |
| Failed | Job failed permanently (client unreachable, execution error) |
| Unknown | Temporary state when retrieval fails, will retry |
| Cleaned | Job data removed after retention period |
Lifecycle Stages
1. Submission
User uploads files via POST /upload with:
- One or more files (including
run.sh) user_id- identifies the submitting userservice- which backend should process this job
The server:
- Validates the request
- Creates a unique directory for the job
- Stores all uploaded files
- Creates a database record with status
Queued - Returns the job ID to the user
2. Queuing & Quota Check
The Sender background task (runs every 500ms):
- Finds jobs in
Queuedstatus - Checks if user has available quota for the service
- If quota available, marks job as
Processing - If quota exceeded, job remains
Queued
3. Distribution
For jobs in Processing status:
- Server packages job files
- Sends to configured client via
POST /submit - On success: updates status to
Submitted, stores client’s payload ID - On failure: updates status to
Failed
4. Execution
On the client side:
- Runner task finds payloads in
Preparedstatus - Executes
run.shin the job directory - Captures exit code and any output files
- Updates payload status to
Completed(orFailedon error)
5. Retrieval
The Getter background task (runs every 500ms):
- Finds jobs in
Submittedstatus - Requests results from client via
GET /retrieve/:id - Downloads and stores the result ZIP
- Updates job status to
Completed
6. Download
User can now:
- Check status via
GET /download/:id(returns JSON with job state when not completed) - Download results via
GET /download/:id(returns ZIP when completed) - Results are returned as a ZIP archive
7. Cleanup
The Cleaner background task (runs every 60s):
- Finds jobs older than
MAX_AGE - Deletes job files from filesystem
- Updates status to
Cleanedor removes record
Error Handling
Client Unreachable
If the server cannot reach a client during distribution:
- Job status changes to
Failed - Job will not be retried automatically
- User can resubmit if needed
Execution Failure
If run.sh exits with non-zero code:
- Payload status changes to
Failed - Server retrieves whatever output exists
- Job status reflects the failure
Retrieval Failure
If the server cannot retrieve results:
- Job status changes to
Unknown - Server will retry on subsequent Getter cycles
- Eventually succeeds or times out
Timing Considerations
| Event | Typical Duration |
|---|---|
| Upload to Queued | Immediate |
| Queued to Processing | Up to 500ms (+ quota wait) |
| Processing to Submitted | Depends on file size and network |
| Submitted to Completed | Depends on job execution time |
| Completed to Cleaned | Configured via MAX_AGE (default: 48 hours) |