mirror of
https://github.com/tiennm99/goclaw.git
synced 2026-06-11 20:10:59 +00:00
7b3ac84152
- Update lane name, env var, channel references across all docs - Mark delegate tool as removed in tools system docs - Update routing tables and architecture diagrams
238 lines
10 KiB
Markdown
238 lines
10 KiB
Markdown
# 08 - Scheduling & Cron
|
||
|
||
Concurrency control and periodic task execution. The scheduler provides lane-based isolation and per-session serialization. Cron extends the agent loop with time-triggered behavior.
|
||
|
||
> Cron jobs and run logs are stored in the `cron_jobs` and `cron_run_logs` PostgreSQL tables. Cache invalidation propagates via the `cache:cron` event on the message bus.
|
||
|
||
### Responsibilities
|
||
|
||
- Scheduler: lane-based concurrency control, per-session message queue serialization
|
||
- Cron: three schedule kinds (at/every/cron), run logging, retry with exponential backoff
|
||
|
||
---
|
||
|
||
## 1. Scheduler Lanes
|
||
|
||
Named worker pools (semaphore-based) with configurable concurrency limits. Each lane processes requests independently. Unknown lane names fall back to the `main` lane.
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
subgraph "Lane: main (concurrency = 30)"
|
||
M1["User chat 1"]
|
||
M2["User chat 2"]
|
||
M3["..."]
|
||
end
|
||
|
||
subgraph "Lane: subagent (concurrency = 50)"
|
||
S1["Subagent 1"]
|
||
S2["Subagent 2"]
|
||
S3["..."]
|
||
end
|
||
|
||
subgraph "Lane: team (concurrency = 100)"
|
||
D1["Delegation 1"]
|
||
D2["Delegation 2"]
|
||
D3["..."]
|
||
end
|
||
|
||
subgraph "Lane: cron (concurrency = 30)"
|
||
C1["Cron job 1"]
|
||
C2["Cron job 2"]
|
||
C3["..."]
|
||
end
|
||
|
||
REQ["Incoming request"] --> SCHED["Scheduler.Schedule(ctx, lane, req)"]
|
||
SCHED --> QUEUE["getOrCreateSession(sessionKey, lane)"]
|
||
QUEUE --> SQ["SessionQueue.Enqueue()"]
|
||
SQ --> LANE["Lane.Submit(fn)"]
|
||
```
|
||
|
||
### Lane Defaults
|
||
|
||
| Lane | Concurrency | Env Override | Purpose |
|
||
|------|:-----------:|-------------|---------|
|
||
| `main` | 30 | `GOCLAW_LANE_MAIN` | Primary user chat sessions |
|
||
| `subagent` | 50 | `GOCLAW_LANE_SUBAGENT` | Sub-agents spawned by the main agent |
|
||
| `team` | 100 | `GOCLAW_LANE_TEAM` | Agent team/delegation executions |
|
||
| `cron` | 30 | `GOCLAW_LANE_CRON` | Scheduled cron jobs (per-session serialization prevents same-job races) |
|
||
|
||
`GetOrCreate()` allows creating new lanes on demand with custom concurrency. All lane concurrency values are configurable via environment variables.
|
||
|
||
---
|
||
|
||
## 2. Session Queue
|
||
|
||
Each session key gets a dedicated queue that manages agent runs. The queue supports configurable concurrent runs per session and adaptive throttling.
|
||
|
||
### Concurrent Runs
|
||
|
||
The scheduler configuration defines a default `MaxConcurrent` value (typically 1 for serial execution). Per-request overrides are available via `ScheduleWithOpts()`:
|
||
|
||
| Context | `maxConcurrent` | Rationale |
|
||
|---------|:--------------:|-----------|
|
||
| DMs | 1 | Single-threaded per user (no interleaving) |
|
||
| Groups | 3+ | Multiple users can get responses in parallel |
|
||
|
||
Application code (not the scheduler) decides whether to override based on channel type.
|
||
|
||
**Adaptive throttle**: When session history exceeds 60% of the context window, concurrency automatically drops to 1 to prevent context window overflow. Controlled by optional `TokenEstimateFunc` callback set on the scheduler.
|
||
|
||
### Queue Modes
|
||
|
||
| Mode | Behavior |
|
||
|------|----------|
|
||
| `queue` (default) | FIFO -- messages wait until a run slot is available |
|
||
| `followup` | Same as `queue` -- messages are queued as follow-ups |
|
||
| `interrupt` | Cancel the active run, drain the queue, start the new message immediately |
|
||
|
||
### Drop Policies
|
||
|
||
When the queue reaches capacity, one of two drop policies applies.
|
||
|
||
| Policy | When Queue Is Full | Error Returned |
|
||
|--------|-------------------|----------------|
|
||
| `old` (default) | Drop the oldest queued message, add the new one | `ErrQueueDropped` |
|
||
| `new` | Reject the incoming message | `ErrQueueFull` |
|
||
|
||
### Queue Config Defaults
|
||
|
||
| Parameter | Default | Description |
|
||
|-----------|---------|-------------|
|
||
| `mode` | `queue` | Queue mode (queue, followup, interrupt) |
|
||
| `cap` | 10 | Maximum messages in the queue |
|
||
| `drop` | `old` | Drop policy when full (old or new) |
|
||
| `debounce_ms` | 800 | Collapse rapid messages within this window |
|
||
|
||
---
|
||
|
||
## 3. /stop and /stopall Commands
|
||
|
||
Cancel commands for Telegram and other channels.
|
||
|
||
| Command | Behavior |
|
||
|---------|----------|
|
||
| `/stop` | Cancel the oldest running task; others keep going |
|
||
| `/stopall` | Cancel all running tasks + drain the queue |
|
||
|
||
### Implementation Details
|
||
|
||
- **Debouncer bypass**: `/stop` and `/stopall` are intercepted before the 800ms debouncer to avoid being merged with the next user message
|
||
- **Cancel mechanism**: `SessionQueue.CancelOne()` (for `/stop`) and `SessionQueue.CancelAll()` (for `/stopall`) expose the cancel functions. Context cancellation propagates to the agent loop
|
||
- **Stale message skipping**: `/stopall` sets an abort cutoff timestamp. Messages enqueued before the cutoff are skipped on next scheduling, preventing old messages from running after an abort
|
||
- **Empty outbound**: On cancel, an empty outbound message is published to trigger cleanup (stop typing indicator, clear reactions)
|
||
- **Trace finalization**: When `ctx.Err() != nil`, trace finalization falls back to `context.Background()` for the final DB write. Status is set to `"cancelled"`
|
||
- **Context survival**: Context values (traceID, collector) survive cancellation -- only the Done channel fires
|
||
- **Generation counter**: Each `SessionQueue` tracks a generation counter. When reset (e.g., during SIGUSR1 in-process restart), old generations are ignored, preventing stale completions from interfering with new requests
|
||
|
||
---
|
||
|
||
## 4. Adaptive Concurrency Control
|
||
|
||
The scheduler can automatically reduce concurrency based on token usage. When a session's context history approaches the summary threshold (60% of context window), the effective `MaxConcurrent` is reduced to 1, enforcing serial execution to prevent overflow.
|
||
|
||
**Implementation:**
|
||
- Set via `Scheduler.SetTokenEstimateFunc(fn TokenEstimateFunc)`
|
||
- `TokenEstimateFunc` returns `(tokens int, contextWindow int)` for a session
|
||
- Checked in `SessionQueue.effectiveMaxConcurrent()` before starting new runs
|
||
- Does not affect already-running tasks, only gates new task starts
|
||
|
||
---
|
||
|
||
## 5. Cron Lifecycle
|
||
|
||
Scheduled tasks that run agent turns automatically. The run loop checks every second for due jobs.
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
[*] --> Created: AddJob()
|
||
Created --> Scheduled: Compute nextRunAtMS
|
||
Scheduled --> DueCheck: runLoop (every 1s)
|
||
DueCheck --> Scheduled: Not yet due
|
||
DueCheck --> Executing: nextRunAtMS <= now
|
||
Executing --> Completed: Success
|
||
Executing --> Failed: Failure
|
||
Failed --> Retrying: retry < MaxRetries (0-3)
|
||
Retrying --> Executing: Backoff delay (2s to 30s)
|
||
Failed --> ErrorLogged: Retries exhausted
|
||
Completed --> Scheduled: Compute next nextRunAtMS (every/cron)
|
||
Completed --> Deleted: deleteAfterRun (at jobs)
|
||
Scheduled --> Paused: Paused via EnableJob(false)
|
||
Paused --> Scheduled: Re-enabled via EnableJob(true)
|
||
```
|
||
|
||
### Schedule Types
|
||
|
||
| Type | Parameter | Example |
|
||
|------|-----------|---------|
|
||
| `at` | `atMs` (epoch ms) | Reminder at 3PM tomorrow, auto-deleted after execution |
|
||
| `every` | `everyMs` | Every 30 minutes (1,800,000 ms) |
|
||
| `cron` | `expr` (5-field) | `"0 9 * * 1-5"` (9AM on weekdays) |
|
||
|
||
### Job States
|
||
|
||
Jobs have an `Enabled` boolean flag. When `false`, the job is skipped during the due-job check. When re-enabled, the next run is recomputed. Run results are logged in-memory (last 200 entries) and persisted to the PostgreSQL `cron_run_logs` table. Job state changes propagate via the message bus cache invalidation (`cache:cron` event).
|
||
|
||
### Retry -- Exponential Backoff with Jitter
|
||
|
||
When a cron job execution fails, it's automatically retried with exponential backoff before being logged as an error.
|
||
|
||
| Parameter | Default |
|
||
|-----------|---------|
|
||
| MaxRetries | 3 |
|
||
| BaseDelay | 2 seconds |
|
||
| MaxDelay | 30 seconds |
|
||
|
||
**Formula**: `delay = min(base × 2^attempt, max) ± 25% jitter`
|
||
|
||
Example retry sequence: fail → wait 2s → retry → fail → wait 4s → retry → fail → wait 8s → retry → fail → wait 16s → stop.
|
||
|
||
Retries are transparent to the user; final run status (ok or error) is logged to the `cron_run_logs` table.
|
||
|
||
---
|
||
|
||
## File Reference
|
||
|
||
### Scheduler (Lane-Based Concurrency)
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `internal/scheduler/lanes.go` | Lane and LaneManager (semaphore-based worker pools) |
|
||
| `internal/scheduler/queue.go` | SessionQueue, Scheduler, drop policies, debounce, cancel mechanics |
|
||
| `internal/scheduler/scheduler.go` | Scheduler top-level API, draining mode for graceful shutdown |
|
||
| `internal/scheduler/errors.go` | Error types: ErrQueueFull, ErrQueueDropped, ErrMessageStale, ErrGatewayDraining, ErrLaneCleared |
|
||
|
||
### Cron Service (In-Memory)
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `internal/cron/service.go` | Cron service lifecycle (start/stop), job CRUD |
|
||
| `internal/cron/service_execution.go` | Run loop (every 1s), job execution, schedule parsing, persistence |
|
||
| `internal/cron/retry.go` | Retry with exponential backoff + jitter, output truncation |
|
||
| `internal/cron/types.go` | Job, Schedule, JobState, RunLogEntry types |
|
||
|
||
### Cron Persistence (PostgreSQL)
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `internal/store/cron_store.go` | CronStore interface (jobs + run logs) |
|
||
| `internal/store/pg/cron.go` | PostgreSQL cron operations (create, list, update, delete) |
|
||
| `internal/store/pg/cron_crud.go` | CRUD helpers for job mutations |
|
||
| `internal/store/pg/cron_scheduler.go` | PG job cache, due-job detection, execution |
|
||
| `internal/store/pg/cron_exec.go` | Execution flow and result recording |
|
||
| `internal/store/pg/cron_scan.go` | Row scanning for jobs and run logs |
|
||
| `internal/store/pg/cron_update.go` | Job state updates in PostgreSQL |
|
||
|
||
### Gateway Integration
|
||
| File | Description |
|
||
|------|-------------|
|
||
| `cmd/gateway_cron.go` | makeCronJobHandler (routes cron execution to scheduler) |
|
||
| `cmd/gateway_agents.go` | Agent initialization and run loop setup |
|
||
| `internal/gateway/methods/cron.go` | RPC method handlers (list, create, update, delete, toggle, run, runs) |
|
||
|
||
---
|
||
|
||
## Cross-References
|
||
|
||
| Document | Relevant Content |
|
||
|----------|-----------------|
|
||
| [00-architecture-overview.md](./00-architecture-overview.md) | Scheduler lanes in startup sequence |
|
||
| [01-agent-loop.md](./01-agent-loop.md) | Agent loop triggered by scheduler |
|
||
| [06-store-data-model.md](./06-store-data-model.md) | cron_jobs, cron_run_logs tables |
|