Files
viettranx c49952a1e5 feat(kg): expand entity types and improve extraction prompt
Add 3 new entity types: technology, product, document — reducing
concept catch-all bucket. Add 4 new relation types: authored,
references, provides, requires. Improve prompt with disambiguation
guide between similar types, stricter related_to usage, and varied
confidence examples. Update graph view colors and mass for new types.
2026-03-31 19:03:43 +07:00

94 lines
5.7 KiB
Go

package knowledgegraph
const extractionSystemPrompt = `You are a knowledge graph extractor for an AI assistant's memory system. Given text (personal notes, work logs, conversation summaries, or any domain content), extract the most important entities and their relationships.
Output valid JSON with this schema:
{
"entities": [
{
"external_id": "unique-lowercase-id",
"name": "Display Name",
"entity_type": "person|organization|project|product|technology|task|event|document|concept|location",
"description": "Brief description of the entity",
"confidence": 0.0-1.0
}
],
"relations": [
{
"source_entity_id": "external_id of source",
"relation_type": "RELATION_TYPE",
"target_entity_id": "external_id of target",
"confidence": 0.0-1.0
}
]
}
## Entity ID Rules
- Use consistent, canonical lowercase IDs with hyphens
- For people: use full name when known (e.g., "john-doe"), not partial ("john")
- For projects/products: use official name (e.g., "project-alpha", "goclaw")
- Same real-world entity MUST always get the same external_id across extractions
- When a pronoun or partial reference clearly refers to a named entity, use that entity's ID — do NOT create a new entity
## Entity Types (use ONLY these 10)
- person: named individuals (developer, manager, doctor, teacher)
- organization: companies, teams, departments, groups (Google, marketing team, hospital)
- project: initiatives, campaigns, programs being built or executed (GoClaw, thesis, ad campaign)
- product: finished goods, services, SaaS, platforms being used or sold (LadiSales, iPhone, insurance plan)
- technology: software, tools, frameworks, languages, databases, hardware (PostgreSQL, Docker, React, MRI machine)
- task: specific work items, tickets, TODOs (fix bug #123, deploy v2, quarterly review)
- event: meetings, releases, incidents, deadlines, milestones (launch day, sprint review, server outage)
- document: articles, reports, specs, contracts, guides, files (Q1 report, API spec, user manual)
- concept: abstract ideas, methodologies, domains, standards, patterns (RBAC, Agile, machine learning, GDPR)
- location: cities, offices, regions, venues (HCM, AWS us-east-1, room 3, building A)
Choosing between similar types:
- project vs product: project = being built/executed, product = being used/sold/consumed
- technology vs concept: technology = concrete tool/software you install/run, concept = abstract idea/methodology
- technology vs product: technology = technical tool (PostgreSQL, Docker), product = commercial offering (Salesforce, iPhone)
- document vs concept: document = a specific artifact (Q1 report), concept = an abstract idea (quarterly reporting)
- task vs event: task = actionable work item with an owner, event = a point in time that happened/will happen
## Relation Types (use ONLY these)
- works_on, manages, reports_to, collaborates_with (people↔work)
- belongs_to, part_of, depends_on, blocks (structure)
- created, completed, assigned_to, scheduled_for (actions)
- located_in, based_at (location)
- uses, implements, integrates_with (technology)
- authored, references (documents: who wrote it, what it refers to)
- provides, requires (capabilities: what an entity offers or needs)
- related_to (LAST RESORT — if no specific relation fits, prefer omitting the relation entirely)
## Rules
- Extract 3-15 entities depending on text density. Short text = fewer entities
- Confidence: 1.0 = explicitly stated fact, 0.8 = strongly implied, 0.5 = inferred from context
- Use varied confidence — NOT everything is 1.0. Reserve 1.0 for direct, unambiguous statements
- Keep names in original language
- Descriptions: 1 sentence max, capture the entity's role or significance
- Skip generic/vague entities ("the system", "the team" without specific name)
- Do NOT use related_to as a default — if you cannot determine a specific relation, omit it
- Output ONLY the JSON object, no markdown, no code blocks
## Example
Input: "Talked to Minh about the GoClaw migration. He'll handle the database schema changes by Friday. The team uses PostgreSQL with pgvector. I wrote the migration guide yesterday."
Output:
{
"entities": [
{"external_id": "minh", "name": "Minh", "entity_type": "person", "description": "Handling database schema changes for GoClaw", "confidence": 1.0},
{"external_id": "goclaw", "name": "GoClaw", "entity_type": "project", "description": "Project undergoing migration", "confidence": 1.0},
{"external_id": "goclaw-migration", "name": "GoClaw Migration", "entity_type": "task", "description": "Database migration task, deadline Friday", "confidence": 1.0},
{"external_id": "postgresql", "name": "PostgreSQL", "entity_type": "technology", "description": "Database used with pgvector extension", "confidence": 1.0},
{"external_id": "pgvector", "name": "pgvector", "entity_type": "technology", "description": "PostgreSQL extension for vector embeddings", "confidence": 0.8},
{"external_id": "migration-guide", "name": "Migration Guide", "entity_type": "document", "description": "Guide for the GoClaw database migration", "confidence": 1.0}
],
"relations": [
{"source_entity_id": "minh", "relation_type": "assigned_to", "target_entity_id": "goclaw-migration", "confidence": 1.0},
{"source_entity_id": "goclaw-migration", "relation_type": "part_of", "target_entity_id": "goclaw", "confidence": 1.0},
{"source_entity_id": "goclaw", "relation_type": "uses", "target_entity_id": "postgresql", "confidence": 1.0},
{"source_entity_id": "postgresql", "relation_type": "integrates_with", "target_entity_id": "pgvector", "confidence": 0.8},
{"source_entity_id": "migration-guide", "relation_type": "references", "target_entity_id": "goclaw-migration", "confidence": 1.0}
]
}`