Protokoll #12868

ID	12868
Zeitstempel	2025-12-24 10:58:25.037819
Client	root
IP	145.224.72.140
Modell	claude-sonnet-4-20250514
Status	completed
Tokens	5,223 (Input: 2,609, Output: 2,614)
Dauer	161 ms
Request-Zeit	2025-12-24 10:58:25.037819
Response-Zeit	2025-12-24 10:58:25.198467

Request

{
    "event": "PreToolUse",
    "tool_name": "mcp__mcp-docs__docs_update",
    "tool_input": {
        "id": 100,
        "content": "<nav class=\"breadcrumb\">\n    <a href=\"\/docs\">Dokumentation<\/a> &raquo; <a href=\"\/docs\/prozesse\">Prozesse<\/a> &raquo; RAG-Prozess\n<\/nav>\n\n<h1>RAG-Prozess<\/h1>\n\n<section>\n    <h2>Übersicht<\/h2>\n    <p>Der RAG-Prozess (Retrieval Augmented Generation) bildet das Herzstück der KI-gestützten Wissensgenerierung. Die Architektur trennt strikt zwischen <strong>Offline-Pipeline<\/strong> (Wissensaufbau) und <strong>Online-Pipeline<\/strong> (Wissenskonsum).<\/p>\n    \n    <h3>Grundprinzip<\/h3>\n    <ul>\n        <li>Die <strong>Offline-Pipeline<\/strong> erzeugt und strukturiert Wissen (Python-Skripte)<\/li>\n        <li>Die <strong>Online-Pipeline<\/strong> konsumiert Wissen, ohne es zu verändern (PHP-Services)<\/li>\n        <li>Zwischen beiden Pipelines besteht <strong>keine Rückkopplung zur Laufzeit<\/strong><\/li>\n    <\/ul>\n<\/section>\n\n<section>\n    <h2>Drei-Säulen-Architektur<\/h2>\n    <table>\n        <thead><tr><th>System<\/th><th>Rolle<\/th><th>Inhalt<\/th><\/tr><\/thead>\n        <tbody>\n            <tr><td><strong>SQL-Datenbank<\/strong><\/td><td>Single Source of Truth<\/td><td>Text, Struktur, Entitäten, Aussagen, Provenienz<\/td><\/tr>\n            <tr><td><strong>Vektordatenbank<\/strong><\/td><td>Ähnlichkeitsindex<\/td><td>Embeddings für Chunk-Retrieval (keine Semantik!)<\/td><\/tr>\n            <tr><td><strong>Graph<\/strong><\/td><td>Wissensmodell<\/td><td>Entitäten, Relationen, Ontologie, Taxonomie<\/td><\/tr>\n        <\/tbody>\n    <\/table>\n<\/section>\n\n<!-- ============================================================== -->\n<!-- OFFLINE-PIPELINE - TATSÄCHLICHE IMPLEMENTIERUNG                -->\n<!-- ============================================================== -->\n\n<hr>\n<h1>Offline-Pipeline (Import)<\/h1>\n<p><em>Basierend auf Code-Analyse: <code>\/var\/www\/scripts\/pipeline\/<\/code><\/em><\/p>\n\n<section>\n    <h2>Pipeline-Architektur (IST-Zustand)<\/h2>\n    \n    <h3>Orchestrierung<\/h3>\n    <p><strong>Hauptskript:<\/strong> <code>pipeline.py<\/code><\/p>\n    <pre>\n# CLI-Befehle\npython pipeline.py scan      # Dokumente scannen\npython pipeline.py process   # Queue abarbeiten\npython pipeline.py embed     # Ausstehende Embeddings\npython pipeline.py all       # Vollständiger Durchlauf\npython pipeline.py file &lt;path&gt;  # Einzeldatei verarbeiten\npython pipeline.py status    # Status anzeigen\n    <\/pre>\n    \n    <h3>Verarbeitungsfluss (process_file)<\/h3>\n    <p><strong>Quelle:<\/strong> <code>pipeline.py:32-187<\/code><\/p>\n    <pre>\n┌─────────────┐\n│   Extract   │  extract.py - Text aus PDF\/DOCX\/PPTX\/MD\/TXT\n└──────┬──────┘\n       │ (nur PDF)\n       ▼\n┌─────────────┐\n│   Vision    │  vision.py - Bild\/Tabellen-Analyse mit llama3.2-vision:11b\n└──────┬──────┘\n       │\n       ▼\n┌─────────────┐\n│   Chunk     │  chunk.py - Semantisches Chunking nach Struktur\n└──────┬──────┘\n       │ (nur PDF)\n       ▼\n┌─────────────┐\n│   Enrich    │  enrich.py - Vision-Kontext zu Chunks hinzufügen\n└──────┬──────┘\n       │\n       ▼\n┌─────────────┐\n│   Embed     │  embed.py - Vektorisierung → Qdrant\n└──────┬──────┘\n       │\n       ▼\n┌─────────────┐\n│   Analyze   │  analyze.py - Entitäten, Relationen, Taxonomie\n└─────────────┘\n    <\/pre>\n    \n    <h3>Vollständiger Durchlauf (run_full_pipeline)<\/h3>\n    <p><strong>Quelle:<\/strong> <code>pipeline.py:234-365<\/code><\/p>\n    <pre>\nPhase 1: SCAN\n  └─ scan_directory()  → Dateien mit Hash-Vergleich finden\n  └─ queue_files()     → In pipeline_queue einfügen\n\nPhase 2: PROCESS\n  └─ get_pending_queue_items(limit=100)\n  └─ Für jedes Item: process_file() aufrufen\n  └─ Status in pipeline_queue aktualisieren\n\nPhase 3: EMBED REMAINING\n  └─ embed_pending_chunks()  → Chunks ohne qdrant_id verarbeiten\n    <\/pre>\n<\/section>\n\n<section>\n    <h2>Konfiguration (IST-Zustand)<\/h2>\n    <p><strong>Quelle:<\/strong> <code>config.py<\/code><\/p>\n    \n    <table>\n        <thead><tr><th>Parameter<\/th><th>Wert<\/th><th>Beschreibung<\/th><\/tr><\/thead>\n        <tbody>\n            <tr><td>NEXTCLOUD_PATH<\/td><td><code>\/var\/www\/nextcloud\/data\/root\/files\/Documents<\/code><\/td><td>Quellverzeichnis<\/td><\/tr>\n            <tr><td>SUPPORTED_EXTENSIONS<\/td><td><code>[\".pdf\", \".pptx\", \".docx\", \".md\", \".txt\"]<\/code><\/td><td>Dateitypen<\/td><\/tr>\n            <tr><td>EMBEDDING_MODEL<\/td><td><code>mxbai-embed-large<\/code><\/td><td>Ollama-Modell<\/td><\/tr>\n            <tr><td>EMBEDDING_DIMENSION<\/td><td><code>1024<\/code><\/td><td>Vektordimension<\/td><\/tr>\n            <tr><td>MAX_EMBED_CHARS<\/td><td><code>800<\/code><\/td><td>Max. Zeichen pro Embedding<\/td><\/tr>\n            <tr><td>MIN_CHUNK_SIZE<\/td><td><code>100<\/code><\/td><td>Min. Chunk-Größe<\/td><\/tr>\n            <tr><td>MAX_CHUNK_SIZE<\/td><td><code>2000<\/code><\/td><td>Max. Chunk-Größe<\/td><\/tr>\n            <tr><td>CHUNK_OVERLAP_PERCENT<\/td><td><code>10<\/code><\/td><td>Überlappung<\/td><\/tr>\n            <tr><td>DB_CONFIG.database<\/td><td><code>ki_content<\/code><\/td><td>Content-Datenbank<\/td><\/tr>\n            <tr><td>DB_LOG_CONFIG.database<\/td><td><code>ki_dev<\/code><\/td><td>Log-Datenbank<\/td><\/tr>\n            <tr><td>QDRANT_HOST<\/td><td><code>localhost<\/code><\/td><td>Qdrant-Server<\/td><\/tr>\n            <tr><td>QDRANT_PORT<\/td><td><code>6333<\/code><\/td><td>Qdrant-Port<\/td><\/tr>\n        <\/tbody>\n    <\/table>\n    \n    <h3>Qdrant Collections<\/h3>\n    <pre>\nQDRANT_COLLECTIONS = {\n    \"documents\": {\"size\": 1024, \"distance\": \"Cosine\"},\n    \"mail\":      {\"size\": 1024, \"distance\": \"Cosine\"},\n    \"entities\":  {\"size\": 1024, \"distance\": \"Cosine\"}\n}\n    <\/pre>\n<\/section>\n\n<section>\n    <h2>Skript-Details<\/h2>\n    \n    <h3>detect.py - Datei-Erkennung<\/h3>\n    <p><strong>Quelle:<\/strong> <code>detect.py:23-86<\/code><\/p>\n    <pre>\nFunktion: scan_directory(path=NEXTCLOUD_PATH)\n  - Rekursiver Scan, versteckte Dateien\/Ordner ignoriert\n  - SHA-256 Hash-Berechnung pro Datei\n  - Prüfung gegen documents.file_hash\n  - Rückgabe: Liste mit {path, name, ext, size, hash, action: \"new\"|\"update\"}\n\nFunktion: queue_files(files)\n  - Einfügen in pipeline_queue via db.add_to_queue()\n    <\/pre>\n    \n    <h3>extract.py - Text-Extraktion<\/h3>\n    <pre>\nUnterstützte Formate:\n  - PDF:  pdfplumber + OCR (tesseract, Sprache: deu)\n  - DOCX: python-docx\n  - PPTX: python-pptx\n  - MD:   direktes Lesen\n  - TXT:  direktes Lesen\n\nRückgabe: {success, content, file_type, error?}\n    <\/pre>\n    \n    <h3>chunk.py - Chunking<\/h3>\n    <pre>\nFunktion: chunk_by_structure(extraction)\n  - Semantisches Chunking basierend auf Dokumenttyp\n  - Erhält heading_path (JSON-Array der Überschriften)\n  - Respektiert MIN_CHUNK_SIZE und MAX_CHUNK_SIZE\n  - Rückgabe: [{content, heading_path, position_start, position_end, metadata}]\n    <\/pre>\n    \n    <h3>embed.py - Embedding<\/h3>\n    <p><strong>Quelle:<\/strong> <code>embed.py:20-116<\/code><\/p>\n    <pre>\nFunktion: get_embedding(text)\n  - Kollabiert mehrfache Punkte (z.B. \"...\" für Inhaltsverzeichnis)\n  - Truncation bei MAX_EMBED_CHARS (800 Zeichen)\n  - POST an {OLLAMA_HOST}\/api\/embeddings\n  - Modell: mxbai-embed-large\n\nFunktion: store_in_qdrant(collection, point_id, vector, payload)\n  - PUT an \/collections\/{collection}\/points\n  - Payload enthält: document_id, document_title, chunk_index, \n    content (truncated 1000 chars), heading_path, source_path\n\nFunktion: embed_chunks(chunks, document_id, document_title, source_path)\n  - Iteriert über Chunks\n  - Erzeugt UUID v4 für Qdrant point_id\n  - Speichert in Qdrant und aktualisiert chunks.qdrant_id\n    <\/pre>\n    \n    <h3>analyze.py - Semantische Analyse<\/h3>\n    <p><strong>Quelle:<\/strong> <code>analyze.py<\/code><\/p>\n    <pre>\nFunktion: analyze_document(document_id, text, use_anthropic=True)\n  - Extrahiert Entitäten → entities Tabelle\n  - Extrahiert Relationen → entity_relations Tabelle\n  - Klassifiziert in Taxonomie → document_taxonomy Tabelle\n  - Analysiert Chunks → chunk_semantics Tabelle\n  \nGespeicherte Felder in chunk_semantics:\n  - summary, keywords, sentiment, topics\n  - analysis_model (z.B. \"claude-opus-4-5-20251101\")\n    <\/pre>\n<\/section>\n\n<section>\n    <h2>Pipeline-Konfigurationen (DB)<\/h2>\n    <p><strong>Tabellen:<\/strong> <code>ki_content.pipeline_configs<\/code>, <code>ki_content.pipeline_steps<\/code><\/p>\n    \n    <h3>Bestehende Pipelines<\/h3>\n    <table>\n        <thead><tr><th>ID<\/th><th>Name<\/th><th>Steps<\/th><th>Default<\/th><th>Status<\/th><\/tr><\/thead>\n        <tbody>\n            <tr><td>1<\/td><td>Standard<\/td><td>5<\/td><td>Ja<\/td><td>Produktiv<\/td><\/tr>\n            <tr><td>2<\/td><td>Schulungsunterlagen<\/td><td>20<\/td><td>Nein<\/td><td>Spezialisiert<\/td><\/tr>\n        <\/tbody>\n    <\/table>\n    \n    <h3>Verfügbare Step-Types (ENUM)<\/h3>\n    <pre>\ndetect, validate, page_split, vision_analyze, extract, structure,\nsegment, chunk, metadata_store, embed, collection_setup, vector_store,\nindex_optimize, entity_extract, relation_extract, taxonomy_build,\nsemantic_analyze, summarize, question_generate, finalize, analyze,\nknowledge_page, knowledge_section, knowledge_document, knowledge_validate\n    <\/pre>\n<\/section>\n\n<!-- ============================================================== -->\n<!-- ONLINE-PIPELINE - TATSÄCHLICHE IMPLEMENTIERUNG                 -->\n<!-- ============================================================== -->\n\n<hr>\n<h1>Online-Pipeline (Query)<\/h1>\n<p><em>Basierend auf Code-Analyse: PHP-Services<\/em><\/p>\n\n<section>\n    <h2>Architektur (IST-Zustand)<\/h2>\n    \n    <h3>Komponenten<\/h3>\n    <table>\n        <thead><tr><th>Komponente<\/th><th>Datei<\/th><th>Verantwortung<\/th><\/tr><\/thead>\n        <tbody>\n            <tr><td>ChatController<\/td><td><code>src\/Controller\/ChatController.php<\/code><\/td><td>HTTP-Endpunkte<\/td><\/tr>\n            <tr><td>SendChatMessageUseCase<\/td><td><code>src\/UseCases\/Chat\/SendChatMessageUseCase.php<\/code><\/td><td>Orchestrierung<\/td><\/tr>\n            <tr><td>ChatService<\/td><td><code>src\/Infrastructure\/AI\/ChatService.php<\/code><\/td><td>RAG-Pipeline<\/td><\/tr>\n            <tr><td>QdrantService<\/td><td><code>src\/Infrastructure\/AI\/QdrantService.php<\/code><\/td><td>Vektor-Suche<\/td><\/tr>\n            <tr><td>OllamaService<\/td><td><code>src\/Infrastructure\/AI\/OllamaService.php<\/code><\/td><td>Embe... [TRUNCATED-fa907802b2a93488]"
    }
}

Response

{
    "tool_response": [
        {
            "type": "text",
            "text": "{\n  \"success\": true,\n  \"doc\": {\n    \"id\": 100,\n    \"parent_id\": 91,\n    \"slug\": \"rag-prozess\",\n    \"path\": \"\/prozesse\/rag-prozess\",\n    \"title\": \"RAG-Prozess\",\n    \"description\": \"Dokumentation des gesamten RAG-Prozesses (Retrieval Augmented Generation) für Chat und Content Studio\",\n    \"content\": \"<nav class=\\\"breadcrumb\\\">\\n    <a href=\\\"\/docs\\\">Dokumentation<\/a> &raquo; <a href=\\\"\/docs\/prozesse\\\">Prozesse<\/a> &raquo; RAG-Prozess\\n<\/nav>\\n\\n<h1>RAG-Prozess<\/h1>\\n\\n<section>\\n    <h2>Übersicht<\/h2>\\n    <p>Der RAG-Prozess (Retrieval Augmented Generation) bildet das Herzstück der KI-gestützten Wissensgenerierung. Die Architektur trennt strikt zwischen <strong>Offline-Pipeline<\/strong> (Wissensaufbau) und <strong>Online-Pipeline<\/strong> (Wissenskonsum).<\/p>\\n    \\n    <h3>Grundprinzip<\/h3>\\n    <ul>\\n        <li>Die <strong>Offline-Pipeline<\/strong> erzeugt und strukturiert Wissen (Python-Skripte)<\/li>\\n        <li>Die <strong>Online-Pipeline<\/strong> konsumiert Wissen, ohne es zu verändern (PHP-Services)<\/li>\\n        <li>Zwischen beiden Pipelines besteht <strong>keine Rückkopplung zur Laufzeit<\/strong><\/li>\\n    <\/ul>\\n<\/section>\\n\\n<section>\\n    <h2>Drei-Säulen-Architektur<\/h2>\\n    <table>\\n        <thead><tr><th>System<\/th><th>Rolle<\/th><th>Inhalt<\/th><\/tr><\/thead>\\n        <tbody>\\n            <tr><td><strong>SQL-Datenbank<\/strong><\/td><td>Single Source of Truth<\/td><td>Text, Struktur, Entitäten, Aussagen, Provenienz<\/td><\/tr>\\n            <tr><td><strong>Vektordatenbank<\/strong><\/td><td>Ähnlichkeitsindex<\/td><td>Embeddings für Chunk-Retrieval (keine Semantik!)<\/td><\/tr>\\n            <tr><td><strong>Graph<\/strong><\/td><td>Wissensmodell<\/td><td>Entitäten, Relationen, Ontologie, Taxonomie<\/td><\/tr>\\n        <\/tbody>\\n    <\/table>\\n<\/section>\\n\\n<!-- ============================================================== -->\\n<!-- OFFLINE-PIPELINE - TATSÄCHLICHE IMPLEMENTIERUNG                -->\\n<!-- ============================================================== -->\\n\\n<hr>\\n<h1>Offline-Pipeline (Import)<\/h1>\\n<p><em>Basierend auf Code-Analyse: <code>\/var\/www\/scripts\/pipeline\/<\/code><\/em><\/p>\\n\\n<section>\\n    <h2>Pipeline-Architektur (IST-Zustand)<\/h2>\\n    \\n    <h3>Orchestrierung<\/h3>\\n    <p><strong>Hauptskript:<\/strong> <code>pipeline.py<\/code><\/p>\\n    <pre>\\n# CLI-Befehle\\npython pipeline.py scan      # Dokumente scannen\\npython pipeline.py process   # Queue abarbeiten\\npython pipeline.py embed     # Ausstehende Embeddings\\npython pipeline.py all       # Vollständiger Durchlauf\\npython pipeline.py file &lt;path&gt;  # Einzeldatei verarbeiten\\npython pipeline.py status    # Status anzeigen\\n    <\/pre>\\n    \\n    <h3>Verarbeitungsfluss (process_file)<\/h3>\\n    <p><strong>Quelle:<\/strong> <code>pipeline.py:32-187<\/code><\/p>\\n    <pre>\\n┌─────────────┐\\n│   Extract   │  extract.py - Text aus PDF\/DOCX\/PPTX\/MD\/TXT\\n└──────┬──────┘\\n       │ (nur PDF)\\n       ▼\\n┌─────────────┐\\n│   Vision    │  vision.py - Bild\/Tabellen-Analyse mit llama3.2-vision:11b\\n└──────┬──────┘\\n       │\\n       ▼\\n┌─────────────┐\\n│   Chunk     │  chunk.py - Semantisches Chunking nach Struktur\\n└──────┬──────┘\\n       │ (nur PDF)\\n       ▼\\n┌─────────────┐\\n│   Enrich    │  enrich.py - Vision-Kontext zu Chunks hinzufügen\\n└──────┬──────┘\\n       │\\n       ▼\\n┌─────────────┐\\n│   Embed     │  embed.py - Vektorisierung → Qdrant\\n└──────┬──────┘\\n       │\\n       ▼\\n┌─────────────┐\\n│   Analyze   │  analyze.py - Entitäten, Relationen, Taxonomie\\n└─────────────┘\\n    <\/pre>\\n    \\n    <h3>Vollständiger Durchlauf (run_full_pipeline)<\/h3>\\n    <p><strong>Quelle:<\/strong> <code>pipeline.py:234-365<\/code><\/p>\\n    <pre>\\nPhase 1: SCAN\\n  └─ scan_directory()  → Dateien mit Hash-Vergleich finden\\n  └─ queue_files()     → In pipeline_queue einfügen\\n\\nPhase 2: PROCESS\\n  └─ get_pending_queue_items(limit=100)\\n  └─ Für jedes Item: process_file() aufrufen\\n  └─ Status in pipeline_queue aktualisieren\\n\\nPhase 3: EMBED REMAINING\\n  └─ embed_pending_chunks()  → Chunks ohne qdrant_id verarbeiten\\n    <\/pre>\\n<\/section>\\n\\n<section>\\n    <h2>Konfiguration (IST-Zustand)<\/h2>\\n    <p><strong>Quelle:<\/strong> <code>config.py<\/code><\/p>\\n    \\n    <table>\\n        <thead><tr><th>Parameter<\/th><th>Wert<\/th><th>Beschreibung<\/th><\/tr><\/thead>\\n        <tbody>\\n            <tr><td>NEXTCLOUD_PATH<\/td><td><code>\/var\/www\/nextcloud\/data\/root\/files\/Documents<\/code><\/td><td>Quellverzeichnis<\/td><\/tr>\\n            <tr><td>SUPPORTED_EXTENSIONS<\/td><td><code>[\\\".pdf\\\", \\\".pptx\\\", \\\".docx\\\", \\\".md\\\", \\\".txt\\\"]<\/code><\/td><td>Dateitypen<\/td><\/tr>\\n            <tr><td>EMBEDDING_MODEL<\/td><td><code>mxbai-embed-large<\/code><\/td><td>Ollama-Modell<\/td><\/tr>\\n            <tr><td>EMBEDDING_DIMENSION<\/td><td><code>1024<\/code><\/td><td>Vektordimension<\/td><\/tr>\\n            <tr><td>MAX_EMBED_CHARS<\/td><td><code>800<\/code><\/td><td>Max. Zeichen pro Embedding<\/td><\/tr>\\n            <tr><td>MIN_CHUNK_SIZE<\/td><td><code>100<\/code><\/td><td>Min. Chunk-Größe<\/td><\/tr>\\n            <tr><td>MAX_CHUNK_SIZE<\/td><td><code>2000<\/code><\/td><td>Max. Chunk-Größe<\/td><\/tr>\\n            <tr><td>CHUNK_OVERLAP_PERCENT<\/td><td><code>10<\/code><\/td><td>Überlappung<\/td><\/tr>\\n            <tr><td>DB_CONFIG.database<\/td><td><code>ki_content<\/code><\/td><td>Content-Datenbank<\/td><\/tr>\\n            <tr><td>DB_LOG_CONFIG.database<\/td><td><code>ki_dev<\/code><\/td><td>Log-Datenbank<\/td><\/tr>\\n            <tr><td>QDRANT_HOST<\/td><td><code>localhost<\/code><\/td><td>Qdrant-Server<\/td><\/tr>\\n            <tr><td>QDRANT_PORT<\/td><td><code>6333<\/code><\/td><td>Qdrant-Port<\/td><\/tr>\\n        <\/tbody>\\n    <\/table>\\n    \\n    <h3>Qdrant Collections<\/h3>\\n    <pre>\\nQDRANT_COLLECTIONS = {\\n    \\\"documents\\\": {\\\"size\\\": 1024, \\\"distance\\\": \\\"Cosine\\\"},\\n    \\\"mail\\\":      {\\\"size\\\": 1024, \\\"distance\\\": \\\"Cosine\\\"},\\n    \\\"entities\\\":  {\\\"size\\\": 1024, \\\"distance\\\": \\\"Cosine\\\"}\\n}\\n    <\/pre>\\n<\/section>\\n\\n<section>\\n    <h2>Skript-Details<\/h2>\\n    \\n    <h3>detect.py - Datei-Erkennung<\/h3>\\n    <p><strong>Quelle:<\/strong> <code>detect.py:23-86<\/code><\/p>\\n    <pre>\\nFunktion: scan_directory(path=NEXTCLOUD_PATH)\\n  - Rekursiver Scan, versteckte Dateien\/Ordner ignoriert\\n  - SHA-256 Hash-Berechnung pro Datei\\n  - Prüfung gegen documents.file_hash\\n  - Rückgabe: Liste mit {path, name, ext, size, hash, action: \\\"new\\\"|\\\"update\\\"}\\n\\nFunktion: queue_files(files)\\n  - Einfügen in pipeline_queue via db.add_to_queue()\\n    <\/pre>\\n    \\n    <h3>extract.py - Text-Extraktion<\/h3>\\n    <pre>\\nUnterstützte Formate:\\n  - PDF:  pdfplumber + OCR (tesseract, Sprache: deu)\\n  - DOCX: python-docx\\n  - PPTX: python-pptx\\n  - MD:   direktes Lesen\\n  - TXT:  direktes Lesen\\n\\nRückgabe: {success, content, file_type, error?}\\n    <\/pre>\\n    \\n    <h3>chunk.py - Chunking<\/h3>\\n    <pre>\\nFunktion: chunk_by_structure(extraction)\\n  - Semantisches Chunking basierend auf Dokumenttyp\\n  - Erhält heading_path (JSON-Array der Überschriften)\\n  - Respektiert MIN_CHUNK_SIZE und MAX_CHUNK_SIZE\\n  - Rückgabe: [{content, heading_path, position_start, position_end, metadata}]\\n    <\/pre>\\n    \\n    <h3>embed.py - Embedding<\/h3>\\n    <p><strong>Quelle:<\/strong> <code>embed.py:20-116<\/code><\/p>\\n    <pre>\\nFunktion: get_embedding(text)\\n  - Kollabiert mehrfache Punkte (z.B. \\\"...\\\" für Inhaltsverzeichnis)\\n  - Truncation bei MAX_EMBED_CHARS (800 Zeichen)\\n  - POST an {OLLAMA_HOST}\/api\/embeddings\\n  - Modell: mxbai-embed-large\\n\\nFunktion: store_in_qdrant(collection, point_id, vector, payload)\\n  - PUT an \/collections\/{collection}\/points\\n  - Payload enthält: document_id, document_title, chunk_index, \\n    content (truncated 1000 chars), heading_path, source_path\\n\\nFunktion: embed_chunks(chunks, document_id, document_title, source_path)\\n  - Iteriert über Chunks\\n  - Erzeugt UUID v4 für Qdrant point_id\\n  - Speichert in Qdrant und aktualisiert chunks.qdrant_id\\n    <\/pre>\\n    \\n    <h3>analyze.py - Semantische Analyse<\/h3>\\n    <p><strong>Quelle:<\/strong> <code>analyze.py<\/code><\/p>\\n    <pre>\\nFunktion: analyze_document(document_id, text, use_anthropic=True)\\n  - Extrahiert Entitäten → entities Tabelle\\n  - Extrahiert Relationen → entity_relations Tabelle\\n  - Klassifiziert in Taxonomie → document_taxonomy Tabelle\\n  - Analysiert Chunks → chunk_semantics Tabelle\\n  \\nGespeicherte Felder in chunk_semantics:\\n  - summary, keywords, sentiment, topics\\n  - analysis_model (z.B. \\\"claude-opus-4-5-20251101\\\")\\n    <\/pre>\\n<\/section>\\n\\n<section>\\n    <h2>Pipeline-Konfigurationen (DB)<\/h2>\\n    <p><strong>Tabellen:<\/strong> <code>ki_content.pipeline_configs<\/code>, <code>ki_content.pipeline_steps<\/code><\/p>\\n    \\n    <h3>Bestehende Pipelines<\/h3>\\n    <table>\\n        <thead><tr><th>ID<\/th><th>Name<\/th><th>Steps<\/th><th>Default<\/th><th>Status<\/th><\/tr><\/thead>\\n        <tbody>\\n            <tr><td>1<\/td><td>Standard<\/td><td>5<\/td><td>Ja<\/td><td>Produktiv<\/td><\/tr>\\n            <tr><td>2<\/td><td>Schulungsunterlagen<\/td><td>20<\/td><td>Nein<\/td><td>Spezialisiert<\/td><\/tr>\\n        <\/tbody>\\n    <\/table>\\n    \\n    <h3>Verfügbare Step-Types (ENUM)<\/h3>\\n    <pre>\\ndetect, validate, page_split, vision_analyze, extract, structure,\\nsegment, chunk, metadata_store, embed, collection_setup, vector_store,\\nindex_optimize, entity_extract, relation_extract, taxonomy_build,\\nsemantic_analyze, summarize, question_generate, finalize, analyze,\\nknowledge_page, knowledge_section, knowledge_document, knowledge_validate\\n    <\/pre>\\n<\/section>\\n\\n<!-- ============================================================== -->\\n<!-- ONLINE-PIPELINE - TATSÄCHLICHE IMPLEMENTIERUNG                 -->\\n<!-- ============================================================== -->\\n\\n<hr>\\n<h1>Online-Pipeline (Query)<\/h1>\\n<p><em>Basierend auf Code-Analyse: PHP-Services<\/em><\/p>\\n\\n<section>\\n    <h2>Architektur (IST-Zustand)<\/h2>\\n    \\n    <h3>Komponenten<\/h3>\\n    <table>\\n        <thead><tr><th>Komponente<\/th><th>Datei<\/th><th>Verantwortung<\/th><\/tr><\/thead>\\n        <tbody>\\n            <tr><td>ChatCon... [TRUNCATED-915d109d5fbae5b8]"
        }
    ]
}

← Vorheriger Zur Liste Nächster →