Files
goclaw/skills/pdf/scripts/fill_fillable_fields.py
T
Viet Tran ace07509b7 feat(skills): system skills integration — toggle, dep checking, per-item install (#161)
* feat(infra): add runtime package support for skills

Install nodejs, npm, pandoc, github-cli + pre-install Python packages
(openpyxl, pandas, python-pptx, markitdown) and Node packages
(docx, pptxgenjs). Configure runtime dirs for agent pip/npm installs
with PIP_TARGET, NPM_CONFIG_PREFIX, NODE_PATH to enable dynamic
package installation in read-only container environment.

* feat(infra): add bundled skills with runtime package support

- Add 5 bundled skills: docx, pdf, pptx, xlsx, skill-creator from container skills-store
- Wire GOCLAW_BUILTIN_SKILLS_DIR env var in gateway and CLI
- Support optional runtime packages alongside dynamic skill loading
- Update Dockerfile to COPY bundled-skills at /app/bundled-skills/
- Add PIP_CACHE_DIR in docker-entrypoint.sh for clean pip installs
- Document bundled skills in 14-skills-runtime.md section 6

* feat(infra): remove ai-multimodal skill directory from bundled skills

Remove the ai-multimodal skill package as part of consolidating runtime
package support for bundled skills. This directory is no longer needed
in the bundled skills structure.

* feat(ci): add semantic release and Docker Hub publishing

Add go-semantic-release workflow to auto-create semver tags on merge to
main. Extend docker-publish to push all variants to both GHCR and
Docker Hub (digitop/goclaw).

* feat(skills): add system skills infrastructure with is_system column, dep scanning, and seeder

- Migration 000017: add is_system boolean column with partial index
- Store layer: UpsertSystemSkill, delete protection, IsSystemSkill
- ListAccessible auto-includes system skills (no grants needed)
- ListWithGrantStatus returns is_system field
- Dependency scanner: auto-detect deps from scripts/ or skill-manifest.json
- Dependency checker: verify system binaries, Python/Node packages
- Seeder: seed bundled skills into DB on startup (idempotent via hash)
- Gateway wiring: GOCLAW_BUNDLED_SKILLS_DIR env for bundled skills
- HTTP: delete guard (403), slug conflict check (409), rescan-deps endpoint
- UI: System badge, hide delete for system skills, rescan deps button
- Agent skills tab: "Always available" for system skills
- i18n: en/vi/zh keys for system skills, deps scanning

* feat(skills): conditional system prompt, skill manifests, and Zip Slip fix

- System prompt: only show package list when python3/node are available
- Add skill-manifest.json for pdf, docx, xlsx, pptx bundled skills
- Fix Zip Slip vulnerability in office/unpack.py (all 3 copies)

* refactor(skills): extract shared office code to _shared/ and deduplicate

Move office scripts (pack, unpack, validate, schemas, validators) from
duplicated copies in docx/xlsx/pptx to skills/_shared/office/ with
symlinks. Remove soffice.py (non-functional in containers) and update
SKILL.md references to use soffice binary directly. Update seeder
copyDir to follow symlinks.

Removes ~45K lines of duplicate code across 3 skills.

* fix(skills): address code review findings for system skills integration

- H1: Remove dead symlink branch in copyDir (filepath.Walk follows symlinks)
- H3: Fix rescan-deps to query ALL skills (including archived) and re-activate
  when deps become available; add ListAllSkills() + Status field to SkillInfo
- H4: Add Status field to SkillCreateParams, stop overloading Visibility
- M1: Batch Python/Node dep checks into single subprocess per runtime
- M4: Add rows.Err() check in ListSkills to prevent caching partial results

* feat(skills): async dep checking with realtime WS events

Split Seed() into sync DB upsert + async CheckDepsAsync() goroutine.
Gateway startup no longer blocks on Python/Node subprocess dep checks.

- Seed() returns seeded skills list, all initially status="active"
- CheckDepsAsync() runs in background, emits skill.deps.checked per-skill
- skill.deps.complete event emitted when all checks finish
- Each failed dep check: archives skill + BumpVersion() for immediate
  cache invalidation so next agent turn picks up the change
- UI: use-query-invalidation listens to skill.deps.* events → auto-refresh
  skills list in realtime

* feat(skills): system skills integration with toggle, dep checking, and per-item install

- Add is_system, deps, enabled columns to skills table (migration 017)
- Seed bundled core skills (pdf, docx, pptx, xlsx, skill-creator) on startup
- PYTHONPATH-based dep detection — eliminates false positives from local modules
- Per-item dep install UI with individual status (installing/success/error)
- Enable/disable toggle for core and custom skills (independent of dep status)
- Re-run dep check when skill is toggled back on
- Inline skill thresholds: 40 skills / 5000 tokens before switching to search mode
- Fix UpsertSystemSkill: backfill null file_hash without bumping DB version
- Remove redundant skill-manifest.json files (replaced by deps JSONB column)
- Show author from frontmatter in custom skills tab
- Runtime checker for python3/pip3/node/npm availability
- WS events for dep checking/installing progress
- docs: add 15-core-skills-system.md, 16-skill-publishing.md

---------

Co-authored-by: Goon <duy@wearetopgroup.com>
2026-03-12 09:20:41 +07:00

99 lines
3.7 KiB
Python

import json
import sys
from pypdf import PdfReader, PdfWriter
from extract_form_field_info import get_field_info
def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
with open(fields_json_path) as f:
fields = json.load(f)
fields_by_page = {}
for field in fields:
if "value" in field:
field_id = field["field_id"]
page = field["page"]
if page not in fields_by_page:
fields_by_page[page] = {}
fields_by_page[page][field_id] = field["value"]
reader = PdfReader(input_pdf_path)
has_error = False
field_info = get_field_info(reader)
fields_by_ids = {f["field_id"]: f for f in field_info}
for field in fields:
existing_field = fields_by_ids.get(field["field_id"])
if not existing_field:
has_error = True
print(f"ERROR: `{field['field_id']}` is not a valid field ID")
elif field["page"] != existing_field["page"]:
has_error = True
print(f"ERROR: Incorrect page number for `{field['field_id']}` (got {field['page']}, expected {existing_field['page']})")
else:
if "value" in field:
err = validation_error_for_field_value(existing_field, field["value"])
if err:
print(err)
has_error = True
if has_error:
sys.exit(1)
writer = PdfWriter(clone_from=reader)
for page, field_values in fields_by_page.items():
writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)
writer.set_need_appearances_writer(True)
with open(output_pdf_path, "wb") as f:
writer.write(f)
def validation_error_for_field_value(field_info, field_value):
field_type = field_info["type"]
field_id = field_info["field_id"]
if field_type == "checkbox":
checked_val = field_info["checked_value"]
unchecked_val = field_info["unchecked_value"]
if field_value != checked_val and field_value != unchecked_val:
return f'ERROR: Invalid value "{field_value}" for checkbox field "{field_id}". The checked value is "{checked_val}" and the unchecked value is "{unchecked_val}"'
elif field_type == "radio_group":
option_values = [opt["value"] for opt in field_info["radio_options"]]
if field_value not in option_values:
return f'ERROR: Invalid value "{field_value}" for radio group field "{field_id}". Valid values are: {option_values}'
elif field_type == "choice":
choice_values = [opt["value"] for opt in field_info["choice_options"]]
if field_value not in choice_values:
return f'ERROR: Invalid value "{field_value}" for choice field "{field_id}". Valid values are: {choice_values}'
return None
def monkeypatch_pydpf_method():
from pypdf.generic import DictionaryObject
from pypdf.constants import FieldDictionaryAttributes
original_get_inherited = DictionaryObject.get_inherited
def patched_get_inherited(self, key: str, default = None):
result = original_get_inherited(self, key, default)
if key == FieldDictionaryAttributes.Opt:
if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
result = [r[0] for r in result]
return result
DictionaryObject.get_inherited = patched_get_inherited
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: fill_fillable_fields.py [input pdf] [field_values.json] [output pdf]")
sys.exit(1)
monkeypatch_pydpf_method()
input_pdf = sys.argv[1]
fields_json = sys.argv[2]
output_pdf = sys.argv[3]
fill_pdf_fields(input_pdf, fields_json, output_pdf)