Files
goclaw/skills/_shared/office/validators/pptx.py
T
Viet Tran ace07509b7 feat(skills): system skills integration — toggle, dep checking, per-item install (#161)
* feat(infra): add runtime package support for skills

Install nodejs, npm, pandoc, github-cli + pre-install Python packages
(openpyxl, pandas, python-pptx, markitdown) and Node packages
(docx, pptxgenjs). Configure runtime dirs for agent pip/npm installs
with PIP_TARGET, NPM_CONFIG_PREFIX, NODE_PATH to enable dynamic
package installation in read-only container environment.

* feat(infra): add bundled skills with runtime package support

- Add 5 bundled skills: docx, pdf, pptx, xlsx, skill-creator from container skills-store
- Wire GOCLAW_BUILTIN_SKILLS_DIR env var in gateway and CLI
- Support optional runtime packages alongside dynamic skill loading
- Update Dockerfile to COPY bundled-skills at /app/bundled-skills/
- Add PIP_CACHE_DIR in docker-entrypoint.sh for clean pip installs
- Document bundled skills in 14-skills-runtime.md section 6

* feat(infra): remove ai-multimodal skill directory from bundled skills

Remove the ai-multimodal skill package as part of consolidating runtime
package support for bundled skills. This directory is no longer needed
in the bundled skills structure.

* feat(ci): add semantic release and Docker Hub publishing

Add go-semantic-release workflow to auto-create semver tags on merge to
main. Extend docker-publish to push all variants to both GHCR and
Docker Hub (digitop/goclaw).

* feat(skills): add system skills infrastructure with is_system column, dep scanning, and seeder

- Migration 000017: add is_system boolean column with partial index
- Store layer: UpsertSystemSkill, delete protection, IsSystemSkill
- ListAccessible auto-includes system skills (no grants needed)
- ListWithGrantStatus returns is_system field
- Dependency scanner: auto-detect deps from scripts/ or skill-manifest.json
- Dependency checker: verify system binaries, Python/Node packages
- Seeder: seed bundled skills into DB on startup (idempotent via hash)
- Gateway wiring: GOCLAW_BUNDLED_SKILLS_DIR env for bundled skills
- HTTP: delete guard (403), slug conflict check (409), rescan-deps endpoint
- UI: System badge, hide delete for system skills, rescan deps button
- Agent skills tab: "Always available" for system skills
- i18n: en/vi/zh keys for system skills, deps scanning

* feat(skills): conditional system prompt, skill manifests, and Zip Slip fix

- System prompt: only show package list when python3/node are available
- Add skill-manifest.json for pdf, docx, xlsx, pptx bundled skills
- Fix Zip Slip vulnerability in office/unpack.py (all 3 copies)

* refactor(skills): extract shared office code to _shared/ and deduplicate

Move office scripts (pack, unpack, validate, schemas, validators) from
duplicated copies in docx/xlsx/pptx to skills/_shared/office/ with
symlinks. Remove soffice.py (non-functional in containers) and update
SKILL.md references to use soffice binary directly. Update seeder
copyDir to follow symlinks.

Removes ~45K lines of duplicate code across 3 skills.

* fix(skills): address code review findings for system skills integration

- H1: Remove dead symlink branch in copyDir (filepath.Walk follows symlinks)
- H3: Fix rescan-deps to query ALL skills (including archived) and re-activate
  when deps become available; add ListAllSkills() + Status field to SkillInfo
- H4: Add Status field to SkillCreateParams, stop overloading Visibility
- M1: Batch Python/Node dep checks into single subprocess per runtime
- M4: Add rows.Err() check in ListSkills to prevent caching partial results

* feat(skills): async dep checking with realtime WS events

Split Seed() into sync DB upsert + async CheckDepsAsync() goroutine.
Gateway startup no longer blocks on Python/Node subprocess dep checks.

- Seed() returns seeded skills list, all initially status="active"
- CheckDepsAsync() runs in background, emits skill.deps.checked per-skill
- skill.deps.complete event emitted when all checks finish
- Each failed dep check: archives skill + BumpVersion() for immediate
  cache invalidation so next agent turn picks up the change
- UI: use-query-invalidation listens to skill.deps.* events → auto-refresh
  skills list in realtime

* feat(skills): system skills integration with toggle, dep checking, and per-item install

- Add is_system, deps, enabled columns to skills table (migration 017)
- Seed bundled core skills (pdf, docx, pptx, xlsx, skill-creator) on startup
- PYTHONPATH-based dep detection — eliminates false positives from local modules
- Per-item dep install UI with individual status (installing/success/error)
- Enable/disable toggle for core and custom skills (independent of dep status)
- Re-run dep check when skill is toggled back on
- Inline skill thresholds: 40 skills / 5000 tokens before switching to search mode
- Fix UpsertSystemSkill: backfill null file_hash without bumping DB version
- Remove redundant skill-manifest.json files (replaced by deps JSONB column)
- Show author from frontmatter in custom skills tab
- Runtime checker for python3/pip3/node/npm availability
- WS events for dep checking/installing progress
- docs: add 15-core-skills-system.md, 16-skill-publishing.md

---------

Co-authored-by: Goon <duy@wearetopgroup.com>
2026-03-12 09:20:41 +07:00

276 lines
9.6 KiB
Python

"""
Validator for PowerPoint presentation XML files against XSD schemas.
"""
import re
from .base import BaseSchemaValidator
class PPTXSchemaValidator(BaseSchemaValidator):
PRESENTATIONML_NAMESPACE = (
"http://schemas.openxmlformats.org/presentationml/2006/main"
)
ELEMENT_RELATIONSHIP_TYPES = {
"sldid": "slide",
"sldmasterid": "slidemaster",
"notesmasterid": "notesmaster",
"sldlayoutid": "slidelayout",
"themeid": "theme",
"tablestyleid": "tablestyles",
}
def validate(self):
if not self.validate_xml():
return False
all_valid = True
if not self.validate_namespaces():
all_valid = False
if not self.validate_unique_ids():
all_valid = False
if not self.validate_uuid_ids():
all_valid = False
if not self.validate_file_references():
all_valid = False
if not self.validate_slide_layout_ids():
all_valid = False
if not self.validate_content_types():
all_valid = False
if not self.validate_against_xsd():
all_valid = False
if not self.validate_notes_slide_references():
all_valid = False
if not self.validate_all_relationship_ids():
all_valid = False
if not self.validate_no_duplicate_slide_layouts():
all_valid = False
return all_valid
def validate_uuid_ids(self):
import lxml.etree
errors = []
uuid_pattern = re.compile(
r"^[\{\(]?[0-9A-Fa-f]{8}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{4}-?[0-9A-Fa-f]{12}[\}\)]?$"
)
for xml_file in self.xml_files:
try:
root = lxml.etree.parse(str(xml_file)).getroot()
for elem in root.iter():
for attr, value in elem.attrib.items():
attr_name = attr.split("}")[-1].lower()
if attr_name == "id" or attr_name.endswith("id"):
if self._looks_like_uuid(value):
if not uuid_pattern.match(value):
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: "
f"Line {elem.sourceline}: ID '{value}' appears to be a UUID but contains invalid hex characters"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {xml_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} UUID ID validation errors:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - All UUID-like IDs contain valid hex values")
return True
def _looks_like_uuid(self, value):
clean_value = value.strip("{}()").replace("-", "")
return len(clean_value) == 32 and all(c.isalnum() for c in clean_value)
def validate_slide_layout_ids(self):
import lxml.etree
errors = []
slide_masters = list(self.unpacked_dir.glob("ppt/slideMasters/*.xml"))
if not slide_masters:
if self.verbose:
print("PASSED - No slide masters found")
return True
for slide_master in slide_masters:
try:
root = lxml.etree.parse(str(slide_master)).getroot()
rels_file = slide_master.parent / "_rels" / f"{slide_master.name}.rels"
if not rels_file.exists():
errors.append(
f" {slide_master.relative_to(self.unpacked_dir)}: "
f"Missing relationships file: {rels_file.relative_to(self.unpacked_dir)}"
)
continue
rels_root = lxml.etree.parse(str(rels_file)).getroot()
valid_layout_rids = set()
for rel in rels_root.findall(
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
):
rel_type = rel.get("Type", "")
if "slideLayout" in rel_type:
valid_layout_rids.add(rel.get("Id"))
for sld_layout_id in root.findall(
f".//{{{self.PRESENTATIONML_NAMESPACE}}}sldLayoutId"
):
r_id = sld_layout_id.get(
f"{{{self.OFFICE_RELATIONSHIPS_NAMESPACE}}}id"
)
layout_id = sld_layout_id.get("id")
if r_id and r_id not in valid_layout_rids:
errors.append(
f" {slide_master.relative_to(self.unpacked_dir)}: "
f"Line {sld_layout_id.sourceline}: sldLayoutId with id='{layout_id}' "
f"references r:id='{r_id}' which is not found in slide layout relationships"
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {slide_master.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print(f"FAILED - Found {len(errors)} slide layout ID validation errors:")
for error in errors:
print(error)
print(
"Remove invalid references or add missing slide layouts to the relationships file."
)
return False
else:
if self.verbose:
print("PASSED - All slide layout IDs reference valid slide layouts")
return True
def validate_no_duplicate_slide_layouts(self):
import lxml.etree
errors = []
slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels"))
for rels_file in slide_rels_files:
try:
root = lxml.etree.parse(str(rels_file)).getroot()
layout_rels = [
rel
for rel in root.findall(
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
)
if "slideLayout" in rel.get("Type", "")
]
if len(layout_rels) > 1:
errors.append(
f" {rels_file.relative_to(self.unpacked_dir)}: has {len(layout_rels)} slideLayout references"
)
except Exception as e:
errors.append(
f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
if errors:
print("FAILED - Found slides with duplicate slideLayout references:")
for error in errors:
print(error)
return False
else:
if self.verbose:
print("PASSED - All slides have exactly one slideLayout reference")
return True
def validate_notes_slide_references(self):
import lxml.etree
errors = []
notes_slide_references = {}
slide_rels_files = list(self.unpacked_dir.glob("ppt/slides/_rels/*.xml.rels"))
if not slide_rels_files:
if self.verbose:
print("PASSED - No slide relationship files found")
return True
for rels_file in slide_rels_files:
try:
root = lxml.etree.parse(str(rels_file)).getroot()
for rel in root.findall(
f".//{{{self.PACKAGE_RELATIONSHIPS_NAMESPACE}}}Relationship"
):
rel_type = rel.get("Type", "")
if "notesSlide" in rel_type:
target = rel.get("Target", "")
if target:
normalized_target = target.replace("../", "")
slide_name = rels_file.stem.replace(
".xml", ""
)
if normalized_target not in notes_slide_references:
notes_slide_references[normalized_target] = []
notes_slide_references[normalized_target].append(
(slide_name, rels_file)
)
except (lxml.etree.XMLSyntaxError, Exception) as e:
errors.append(
f" {rels_file.relative_to(self.unpacked_dir)}: Error: {e}"
)
for target, references in notes_slide_references.items():
if len(references) > 1:
slide_names = [ref[0] for ref in references]
errors.append(
f" Notes slide '{target}' is referenced by multiple slides: {', '.join(slide_names)}"
)
for slide_name, rels_file in references:
errors.append(f" - {rels_file.relative_to(self.unpacked_dir)}")
if errors:
print(
f"FAILED - Found {len([e for e in errors if not e.startswith(' ')])} notes slide reference validation errors:"
)
for error in errors:
print(error)
print("Each slide may optionally have its own slide file.")
return False
else:
if self.verbose:
print("PASSED - All notes slide references are unique")
return True
if __name__ == "__main__":
raise RuntimeError("This module should not be run directly.")