Files
goclaw/skills/_shared/office/pack.py
T
Viet Tran ace07509b7 feat(skills): system skills integration — toggle, dep checking, per-item install (#161)
* feat(infra): add runtime package support for skills

Install nodejs, npm, pandoc, github-cli + pre-install Python packages
(openpyxl, pandas, python-pptx, markitdown) and Node packages
(docx, pptxgenjs). Configure runtime dirs for agent pip/npm installs
with PIP_TARGET, NPM_CONFIG_PREFIX, NODE_PATH to enable dynamic
package installation in read-only container environment.

* feat(infra): add bundled skills with runtime package support

- Add 5 bundled skills: docx, pdf, pptx, xlsx, skill-creator from container skills-store
- Wire GOCLAW_BUILTIN_SKILLS_DIR env var in gateway and CLI
- Support optional runtime packages alongside dynamic skill loading
- Update Dockerfile to COPY bundled-skills at /app/bundled-skills/
- Add PIP_CACHE_DIR in docker-entrypoint.sh for clean pip installs
- Document bundled skills in 14-skills-runtime.md section 6

* feat(infra): remove ai-multimodal skill directory from bundled skills

Remove the ai-multimodal skill package as part of consolidating runtime
package support for bundled skills. This directory is no longer needed
in the bundled skills structure.

* feat(ci): add semantic release and Docker Hub publishing

Add go-semantic-release workflow to auto-create semver tags on merge to
main. Extend docker-publish to push all variants to both GHCR and
Docker Hub (digitop/goclaw).

* feat(skills): add system skills infrastructure with is_system column, dep scanning, and seeder

- Migration 000017: add is_system boolean column with partial index
- Store layer: UpsertSystemSkill, delete protection, IsSystemSkill
- ListAccessible auto-includes system skills (no grants needed)
- ListWithGrantStatus returns is_system field
- Dependency scanner: auto-detect deps from scripts/ or skill-manifest.json
- Dependency checker: verify system binaries, Python/Node packages
- Seeder: seed bundled skills into DB on startup (idempotent via hash)
- Gateway wiring: GOCLAW_BUNDLED_SKILLS_DIR env for bundled skills
- HTTP: delete guard (403), slug conflict check (409), rescan-deps endpoint
- UI: System badge, hide delete for system skills, rescan deps button
- Agent skills tab: "Always available" for system skills
- i18n: en/vi/zh keys for system skills, deps scanning

* feat(skills): conditional system prompt, skill manifests, and Zip Slip fix

- System prompt: only show package list when python3/node are available
- Add skill-manifest.json for pdf, docx, xlsx, pptx bundled skills
- Fix Zip Slip vulnerability in office/unpack.py (all 3 copies)

* refactor(skills): extract shared office code to _shared/ and deduplicate

Move office scripts (pack, unpack, validate, schemas, validators) from
duplicated copies in docx/xlsx/pptx to skills/_shared/office/ with
symlinks. Remove soffice.py (non-functional in containers) and update
SKILL.md references to use soffice binary directly. Update seeder
copyDir to follow symlinks.

Removes ~45K lines of duplicate code across 3 skills.

* fix(skills): address code review findings for system skills integration

- H1: Remove dead symlink branch in copyDir (filepath.Walk follows symlinks)
- H3: Fix rescan-deps to query ALL skills (including archived) and re-activate
  when deps become available; add ListAllSkills() + Status field to SkillInfo
- H4: Add Status field to SkillCreateParams, stop overloading Visibility
- M1: Batch Python/Node dep checks into single subprocess per runtime
- M4: Add rows.Err() check in ListSkills to prevent caching partial results

* feat(skills): async dep checking with realtime WS events

Split Seed() into sync DB upsert + async CheckDepsAsync() goroutine.
Gateway startup no longer blocks on Python/Node subprocess dep checks.

- Seed() returns seeded skills list, all initially status="active"
- CheckDepsAsync() runs in background, emits skill.deps.checked per-skill
- skill.deps.complete event emitted when all checks finish
- Each failed dep check: archives skill + BumpVersion() for immediate
  cache invalidation so next agent turn picks up the change
- UI: use-query-invalidation listens to skill.deps.* events → auto-refresh
  skills list in realtime

* feat(skills): system skills integration with toggle, dep checking, and per-item install

- Add is_system, deps, enabled columns to skills table (migration 017)
- Seed bundled core skills (pdf, docx, pptx, xlsx, skill-creator) on startup
- PYTHONPATH-based dep detection — eliminates false positives from local modules
- Per-item dep install UI with individual status (installing/success/error)
- Enable/disable toggle for core and custom skills (independent of dep status)
- Re-run dep check when skill is toggled back on
- Inline skill thresholds: 40 skills / 5000 tokens before switching to search mode
- Fix UpsertSystemSkill: backfill null file_hash without bumping DB version
- Remove redundant skill-manifest.json files (replaced by deps JSONB column)
- Show author from frontmatter in custom skills tab
- Runtime checker for python3/pip3/node/npm availability
- WS events for dep checking/installing progress
- docs: add 15-core-skills-system.md, 16-skill-publishing.md

---------

Co-authored-by: Goon <duy@wearetopgroup.com>
2026-03-12 09:20:41 +07:00

160 lines
4.9 KiB
Python

"""Pack a directory into a DOCX, PPTX, or XLSX file.
Validates with auto-repair, condenses XML formatting, and creates the Office file.
Usage:
python pack.py <input_directory> <output_file> [--original <file>] [--validate true|false]
Examples:
python pack.py unpacked/ output.docx --original input.docx
python pack.py unpacked/ output.pptx --validate false
"""
import argparse
import sys
import shutil
import tempfile
import zipfile
from pathlib import Path
import defusedxml.minidom
from validators import DOCXSchemaValidator, PPTXSchemaValidator, RedliningValidator
def pack(
input_directory: str,
output_file: str,
original_file: str | None = None,
validate: bool = True,
infer_author_func=None,
) -> tuple[None, str]:
input_dir = Path(input_directory)
output_path = Path(output_file)
suffix = output_path.suffix.lower()
if not input_dir.is_dir():
return None, f"Error: {input_dir} is not a directory"
if suffix not in {".docx", ".pptx", ".xlsx"}:
return None, f"Error: {output_file} must be a .docx, .pptx, or .xlsx file"
if validate and original_file:
original_path = Path(original_file)
if original_path.exists():
success, output = _run_validation(
input_dir, original_path, suffix, infer_author_func
)
if output:
print(output)
if not success:
return None, f"Error: Validation failed for {input_dir}"
with tempfile.TemporaryDirectory() as temp_dir:
temp_content_dir = Path(temp_dir) / "content"
shutil.copytree(input_dir, temp_content_dir)
for pattern in ["*.xml", "*.rels"]:
for xml_file in temp_content_dir.rglob(pattern):
_condense_xml(xml_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(output_path, "w", zipfile.ZIP_DEFLATED) as zf:
for f in temp_content_dir.rglob("*"):
if f.is_file():
zf.write(f, f.relative_to(temp_content_dir))
return None, f"Successfully packed {input_dir} to {output_file}"
def _run_validation(
unpacked_dir: Path,
original_file: Path,
suffix: str,
infer_author_func=None,
) -> tuple[bool, str | None]:
output_lines = []
validators = []
if suffix == ".docx":
author = "Claude"
if infer_author_func:
try:
author = infer_author_func(unpacked_dir, original_file)
except ValueError as e:
print(f"Warning: {e} Using default author 'Claude'.", file=sys.stderr)
validators = [
DOCXSchemaValidator(unpacked_dir, original_file),
RedliningValidator(unpacked_dir, original_file, author=author),
]
elif suffix == ".pptx":
validators = [PPTXSchemaValidator(unpacked_dir, original_file)]
if not validators:
return True, None
total_repairs = sum(v.repair() for v in validators)
if total_repairs:
output_lines.append(f"Auto-repaired {total_repairs} issue(s)")
success = all(v.validate() for v in validators)
if success:
output_lines.append("All validations PASSED!")
return success, "\n".join(output_lines) if output_lines else None
def _condense_xml(xml_file: Path) -> None:
try:
with open(xml_file, encoding="utf-8") as f:
dom = defusedxml.minidom.parse(f)
for element in dom.getElementsByTagName("*"):
if element.tagName.endswith(":t"):
continue
for child in list(element.childNodes):
if (
child.nodeType == child.TEXT_NODE
and child.nodeValue
and child.nodeValue.strip() == ""
) or child.nodeType == child.COMMENT_NODE:
element.removeChild(child)
xml_file.write_bytes(dom.toxml(encoding="UTF-8"))
except Exception as e:
print(f"ERROR: Failed to parse {xml_file.name}: {e}", file=sys.stderr)
raise
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Pack a directory into a DOCX, PPTX, or XLSX file"
)
parser.add_argument("input_directory", help="Unpacked Office document directory")
parser.add_argument("output_file", help="Output Office file (.docx/.pptx/.xlsx)")
parser.add_argument(
"--original",
help="Original file for validation comparison",
)
parser.add_argument(
"--validate",
type=lambda x: x.lower() == "true",
default=True,
metavar="true|false",
help="Run validation with auto-repair (default: true)",
)
args = parser.parse_args()
_, message = pack(
args.input_directory,
args.output_file,
original_file=args.original,
validate=args.validate,
)
print(message)
if "Error" in message:
sys.exit(1)