# 08 - Scheduling & Cron

Concurrency control and periodic task execution. The scheduler provides lane-based isolation and per-session serialization. Cron extends the agent loop with time-triggered behavior.

> Cron jobs and run logs are stored in the `cron_jobs` and `cron_run_logs` PostgreSQL tables. Cache invalidation propagates via the `cache:cron` event on the message bus.

### Responsibilities

- Scheduler: lane-based concurrency control, per-session message queue serialization
- Cron: three schedule kinds (at/every/cron), run logging, retry with exponential backoff

---

## 1. Scheduler Lanes

Named worker pools (semaphore-based) with configurable concurrency limits. Each lane processes requests independently. Unknown lane names fall back to the `main` lane.

```mermaid
flowchart TD
    subgraph "Lane: main (concurrency = 30)"
        M1["User chat 1"]
        M2["User chat 2"]
        M3["..."]
    end

    subgraph "Lane: subagent (concurrency = 50)"
        S1["Subagent 1"]
        S2["Subagent 2"]
        S3["..."]
    end

    subgraph "Lane: team (concurrency = 100)"
        D1["Delegation 1"]
        D2["Delegation 2"]
        D3["..."]
    end

    subgraph "Lane: cron (concurrency = 30)"
        C1["Cron job 1"]
        C2["Cron job 2"]
        C3["..."]
    end

    REQ["Incoming request"] --> SCHED["Scheduler.Schedule(ctx, lane, req)"]
    SCHED --> QUEUE["getOrCreateSession(sessionKey, lane)"]
    QUEUE --> SQ["SessionQueue.Enqueue()"]
    SQ --> LANE["Lane.Submit(fn)"]
```

### Lane Defaults

| Lane | Concurrency | Env Override | Purpose |
|------|:-----------:|-------------|---------|
| `main` | 30 | `GOCLAW_LANE_MAIN` | Primary user chat sessions |
| `subagent` | 50 | `GOCLAW_LANE_SUBAGENT` | Sub-agents spawned by the main agent |
| `team` | 100 | `GOCLAW_LANE_TEAM` | Agent team/delegation executions |
| `cron` | 30 | `GOCLAW_LANE_CRON` | Scheduled cron jobs (per-session serialization prevents same-job races) |

`GetOrCreate()` allows creating new lanes on demand with custom concurrency. All lane concurrency values are configurable via environment variables.

---

## 2. Session Queue

Each session key gets a dedicated queue that manages agent runs. The queue supports configurable concurrent runs per session and adaptive throttling.

### Concurrent Runs

The scheduler configuration defines a default `MaxConcurrent` value (typically 1 for serial execution). Per-request overrides are available via `ScheduleWithOpts()`:

| Context | `maxConcurrent` | Rationale |
|---------|:--------------:|-----------|
| DMs | 1 | Single-threaded per user (no interleaving) |
| Groups | 3+ | Multiple users can get responses in parallel |

Application code (not the scheduler) decides whether to override based on channel type.

**Adaptive throttle**: When session history exceeds 60% of the context window, concurrency automatically drops to 1 to prevent context window overflow. Controlled by optional `TokenEstimateFunc` callback set on the scheduler.

### Queue Modes

| Mode | Behavior |
|------|----------|
| `queue` (default) | FIFO -- messages wait until a run slot is available |
| `followup` | Same as `queue` -- messages are queued as follow-ups |
| `interrupt` | Cancel the active run, drain the queue, start the new message immediately |

### Drop Policies

When the queue reaches capacity, one of two drop policies applies.

| Policy | When Queue Is Full | Error Returned |
|--------|-------------------|----------------|
| `old` (default) | Drop the oldest queued message, add the new one | `ErrQueueDropped` |
| `new` | Reject the incoming message | `ErrQueueFull` |

### Queue Config Defaults

| Parameter | Default | Description |
|-----------|---------|-------------|
| `mode` | `queue` | Queue mode (queue, followup, interrupt) |
| `cap` | 10 | Maximum messages in the queue |
| `drop` | `old` | Drop policy when full (old or new) |
| `debounce_ms` | 800 | Collapse rapid messages within this window |

---

## 3. /stop and /stopall Commands

Cancel commands for Telegram and other channels.

| Command | Behavior |
|---------|----------|
| `/stop` | Cancel the oldest running task; others keep going |
| `/stopall` | Cancel all running tasks + drain the queue |

### Implementation Details

- **Debouncer bypass**: `/stop` and `/stopall` are intercepted before the 800ms debouncer to avoid being merged with the next user message
- **Cancel mechanism**: `SessionQueue.CancelOne()` (for `/stop`) and `SessionQueue.CancelAll()` (for `/stopall`) expose the cancel functions. Context cancellation propagates to the agent loop
- **Stale message skipping**: `/stopall` sets an abort cutoff timestamp. Messages enqueued before the cutoff are skipped on next scheduling, preventing old messages from running after an abort
- **Empty outbound**: On cancel, an empty outbound message is published to trigger cleanup (stop typing indicator, clear reactions)
- **Trace finalization**: When `ctx.Err() != nil`, trace finalization falls back to `context.Background()` for the final DB write. Status is set to `"cancelled"`
- **Context survival**: Context values (traceID, collector) survive cancellation -- only the Done channel fires
- **Generation counter**: Each `SessionQueue` tracks a generation counter. When reset (e.g., during SIGUSR1 in-process restart), old generations are ignored, preventing stale completions from interfering with new requests

---

## 4. Adaptive Concurrency Control

The scheduler can automatically reduce concurrency based on token usage. When a session's context history approaches the summary threshold (60% of context window), the effective `MaxConcurrent` is reduced to 1, enforcing serial execution to prevent overflow.

**Implementation:**
- Set via `Scheduler.SetTokenEstimateFunc(fn TokenEstimateFunc)`
- `TokenEstimateFunc` returns `(tokens int, contextWindow int)` for a session
- Checked in `SessionQueue.effectiveMaxConcurrent()` before starting new runs
- Does not affect already-running tasks, only gates new task starts

---

## 5. Cron Lifecycle

Scheduled tasks that run agent turns automatically. The run loop checks every second for due jobs.

```mermaid
stateDiagram-v2
    [*] --> Created: AddJob()
    Created --> Scheduled: Compute nextRunAtMS
    Scheduled --> DueCheck: runLoop (every 1s)
    DueCheck --> Scheduled: Not yet due
    DueCheck --> Executing: nextRunAtMS <= now
    Executing --> Completed: Success
    Executing --> Failed: Failure
    Failed --> Retrying: retry < MaxRetries (0-3)
    Retrying --> Executing: Backoff delay (2s to 30s)
    Failed --> ErrorLogged: Retries exhausted
    Completed --> Scheduled: Compute next nextRunAtMS (every/cron)
    Completed --> Deleted: deleteAfterRun (at jobs)
    Scheduled --> Paused: Paused via EnableJob(false)
    Paused --> Scheduled: Re-enabled via EnableJob(true)
```

### Schedule Types

| Type | Parameter | Example |
|------|-----------|---------|
| `at` | `atMs` (epoch ms) | Reminder at 3PM tomorrow, auto-deleted after execution |
| `every` | `everyMs` | Every 30 minutes (1,800,000 ms) |
| `cron` | `expr` (5-field) | `"0 9 * * 1-5"` (9AM on weekdays) |

### Job States

Jobs have an `Enabled` boolean flag. When `false`, the job is skipped during the due-job check. When re-enabled, the next run is recomputed. Run results are logged in-memory (last 200 entries) and persisted to the PostgreSQL `cron_run_logs` table. Job state changes propagate via the message bus cache invalidation (`cache:cron` event).

### Retry -- Exponential Backoff with Jitter

When a cron job execution fails, it's automatically retried with exponential backoff before being logged as an error.

| Parameter | Default |
|-----------|---------|
| MaxRetries | 3 |
| BaseDelay | 2 seconds |
| MaxDelay | 30 seconds |

**Formula**: `delay = min(base × 2^attempt, max) ± 25% jitter`

Example retry sequence: fail → wait 2s → retry → fail → wait 4s → retry → fail → wait 8s → retry → fail → wait 16s → stop.

Retries are transparent to the user; final run status (ok or error) is logged to the `cron_run_logs` table.

---

## File Reference

### Scheduler (Lane-Based Concurrency)
| File | Description |
|------|-------------|
| `internal/scheduler/lanes.go` | Lane and LaneManager (semaphore-based worker pools) |
| `internal/scheduler/queue.go` | SessionQueue, Scheduler, drop policies, debounce, cancel mechanics |
| `internal/scheduler/scheduler.go` | Scheduler top-level API, draining mode for graceful shutdown |
| `internal/scheduler/errors.go` | Error types: ErrQueueFull, ErrQueueDropped, ErrMessageStale, ErrGatewayDraining, ErrLaneCleared |

### Cron Service (In-Memory)
| File | Description |
|------|-------------|
| `internal/cron/service.go` | Cron service lifecycle (start/stop), job CRUD |
| `internal/cron/service_execution.go` | Run loop (every 1s), job execution, schedule parsing, persistence |
| `internal/cron/retry.go` | Retry with exponential backoff + jitter, output truncation |
| `internal/cron/types.go` | Job, Schedule, JobState, RunLogEntry types |

### Cron Persistence (PostgreSQL)
| File | Description |
|------|-------------|
| `internal/store/cron_store.go` | CronStore interface (jobs + run logs) |
| `internal/store/pg/cron.go` | PostgreSQL cron operations (create, list, update, delete) |
| `internal/store/pg/cron_crud.go` | CRUD helpers for job mutations |
| `internal/store/pg/cron_scheduler.go` | PG job cache, due-job detection, execution |
| `internal/store/pg/cron_exec.go` | Execution flow and result recording |
| `internal/store/pg/cron_scan.go` | Row scanning for jobs and run logs |
| `internal/store/pg/cron_update.go` | Job state updates in PostgreSQL |

### Gateway Integration
| File | Description |
|------|-------------|
| `cmd/gateway_cron.go` | makeCronJobHandler (routes cron execution to scheduler) |
| `cmd/gateway_agents.go` | Agent initialization and run loop setup |
| `internal/gateway/methods/cron.go` | RPC method handlers (list, create, update, delete, toggle, run, runs) |

---

## Cross-References

| Document | Relevant Content |
|----------|-----------------|
| [00-architecture-overview.md](./00-architecture-overview.md) | Scheduler lanes in startup sequence |
| [01-agent-loop.md](./01-agent-loop.md) | Agent loop triggered by scheduler |
| [06-store-data-model.md](./06-store-data-model.md) | cron_jobs, cron_run_logs tables |