Import Pipeline

Import Pipeline - Planungsdokument

1. Bestehendes System (IST-Analyse)

1.1 Python-Skripte unter /opt/scripts/pipeline/

Datei	Funktion	Kernlogik
pipeline.py	Orchestrator	CLI mit scan, process, embed, all, file, status
config.py	Konfiguration	Hardcoded Pfade, Modelle, Limits
detect.py	Datei-Erkennung	Scan Nextcloud, Hash-Vergleich, Queue
extract.py	Text-Extraktion	PDF (OCR), DOCX, PPTX, MD, TXT
chunk.py	Chunking	Semantisch nach Typ, Heading-Pfad
embed.py	Embedding	Ollama → Qdrant
analyze.py	Semantische Analyse	Entitäten, Relationen, Taxonomie
db.py	Datenbank-Wrapper	CRUD für documents, chunks, queue

1.2 Datenfluss

Nextcloud (Files)
       ↓
   [detect.py] Scan + Hash
       ↓
   documents (DB) status=pending
       ↓
   [extract.py] PDF/DOCX/... → Text
       ↓
   [chunk.py] Semantisches Chunking
       ↓
   chunks (DB) + heading_path, metadata
       ↓
   [embed.py] Ollama mxbai-embed-large
       ↓
   Qdrant (Vektoren) + chunks.qdrant_id
       ↓
   [analyze.py] Entity/Relation/Taxonomy
       ↓
   entities, entity_relations, chunk_entities,
   chunk_taxonomy, chunk_semantics (DB)

1.3 Aktuelle Konfiguration (config.py)

Parameter	Wert
NEXTCLOUD_PATH	/var/www/nextcloud/data/root/files/Documents
SUPPORTED_EXTENSIONS	.pdf, .pptx, .docx, .md, .txt
QDRANT_HOST	localhost:6333
QDRANT_COLLECTIONS	documents, mail, entities
OLLAMA_HOST	localhost:11434
EMBED_MODEL	mxbai-embed-large (1024 dims)
MIN_CHUNK_SIZE	100 Zeichen
MAX_CHUNK_SIZE	2000 Zeichen
CHUNK_OVERLAP	10%

1.4 Datenbank-Struktur (ki_content)

documents (2 Rows)

id INT PK AUTO
source_path VARCHAR(500)
folder_path VARCHAR(500)
filename VARCHAR(255)
mime_type VARCHAR(100)
file_hash VARCHAR(64) - SHA256 für Änderungserkennung
file_size INT
language VARCHAR(10) DEFAULT 'de'
imported_at DATETIME
processed_at DATETIME
status ENUM('pending','processing','done','error')
error_message TEXT

chunks (6 Rows)

id INT PK AUTO
document_id INT FK
chunk_index INT
content TEXT
token_count INT
heading_path JSON - ["H1", "H2", ...]
metadata JSON
qdrant_id VARCHAR(36) - UUID in Qdrant
created_at DATETIME

entities (49 Rows)

id INT PK AUTO
name VARCHAR(255)
type ENUM('PERSON','ORGANIZATION','LOCATION','CONCEPT','METHOD','TOOL','EVENT','OTHER')
description TEXT
canonical_name VARCHAR(255) - Deduplizierung
created_at, updated_at DATETIME

entity_relations (47 Rows)

id INT PK AUTO
source_entity_id INT FK
target_entity_id INT FK
relation_type VARCHAR(100) - z.B. DEVELOPED_BY, RELATED_TO
strength FLOAT DEFAULT 1
context TEXT
chunk_id INT FK - Herkunft
created_at DATETIME

taxonomy_terms (8 Rows)

id INT PK AUTO
name VARCHAR(255)
slug VARCHAR(255) UNIQUE
parent_id INT FK (self-ref)
description TEXT
depth INT DEFAULT 0
path VARCHAR(1000) - z.B. "/Methoden/Systemisch"
created_at DATETIME

Verknüpfungstabellen

chunk_entities: chunk_id, entity_id, relevance_score, mention_count
chunk_taxonomy: chunk_id, taxonomy_term_id, confidence
chunk_semantics: chunk_id, summary, keywords, sentiment, topics, analysis_model

1.5 Fehlende Tabellen (im Code vorgesehen)

processing_queue - Existiert NICHT
processing_log - Existiert NICHT

2. SOLL-Konzept (GUI)

2.1 Anforderungen

Visuelle Darstellung der Pipeline-Schritte
Konfigurierbare Parameter pro Schritt
Unterstützung mehrerer Pipeline-Definitionen
Status-Übersicht für Dokumente
Manuelle Trigger-Möglichkeit

2.2 Neue Tabelle: pipeline_configs (ki_content)

id INT PK AUTO
name VARCHAR(100) UNIQUE - z.B. "Standard", "Nur-Embedding"
description TEXT
is_default BOOLEAN DEFAULT FALSE
source_path VARCHAR(500) - Nextcloud-Ordner
extensions JSON - [".pdf", ".docx", ...]
steps JSON - Aktivierte Steps + Reihenfolge
created_at, updated_at DATETIME

Beispiel steps:
[
  {"step": "detect", "enabled": true, "order": 1},
  {"step": "extract", "enabled": true, "order": 2, "config": {"ocr": true}},
  {"step": "chunk", "enabled": true, "order": 3, "config": {"min": 100, "max": 2000, "overlap": 0.1}},
  {"step": "embed", "enabled": true, "order": 4, "config": {"model": "mxbai-embed-large", "collection": "documents"}},
  {"step": "analyze", "enabled": false, "order": 5}
]

2.3 Neue Tabelle: pipeline_step_configs (ki_content)

id INT PK AUTO
pipeline_id INT FK
step_type ENUM('detect','extract','chunk','embed','analyze')
config JSON - Step-spezifische Einstellungen
sort_order INT
enabled BOOLEAN DEFAULT TRUE
created_at, updated_at DATETIME

2.4 Neue Tabelle: pipeline_runs (ki_content)

id INT PK AUTO
pipeline_id INT FK
status ENUM('pending','running','completed','failed','cancelled')
started_at DATETIME
completed_at DATETIME
documents_processed INT DEFAULT 0
documents_failed INT DEFAULT 0
error_log TEXT
created_at DATETIME

2.5 URL-Struktur

/content-pipeline                 - Übersicht aller Pipelines
/content-pipeline/import          - Import-Konfiguration (erste Seite)
/content-pipeline/{id}            - Pipeline-Detail
/content-pipeline/{id}/run        - Pipeline starten (POST)
/content-pipeline/{id}/status     - Laufender Status
/content-pipeline/new             - Neue Pipeline erstellen

2.6 View-Komponenten

┌─────────────────────────────────────────────────────────┐
│ Content Pipeline: Standard                        [Run] │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────┐   ┌─────────┐   ┌───────┐   ┌───────┐   ┌────┐│
│  │Detect│ → │ Extract │ → │ Chunk │ → │ Embed │ → │Anal││
│  │  ✓   │   │   ✓     │   │   ✓   │   │   ✓   │   │ ✗  ││
│  └──────┘   └─────────┘   └───────┘   └───────┘   └────┘│
│                                                         │
│  Quelle: /Documents                                     │
│  Formate: .pdf, .docx, .pptx, .md, .txt                │
│                                                         │
│  Letzte Ausführung: 2025-12-20 14:30                   │
│  Verarbeitet: 2 Dokumente, 6 Chunks                    │
└─────────────────────────────────────────────────────────┘

3. Implementierungsplan

Phase 1: Tabellen + Repository

DDL für pipeline_configs, pipeline_step_configs, pipeline_runs
Domain\Repository\PipelineRepositoryInterface
Infrastructure\Persistence\PipelineRepository

Phase 2: Controller + Views

Controller\ContentPipelineController
View\content-pipeline\index.php (Übersicht)
View\content-pipeline\show.php (Detail + Steps)
View\content-pipeline\form.php (Create/Edit)

Phase 3: Integration Python

Pipeline-Aufruf via Bash (mit Config-ID)
Status-Polling via AJAX
Log-Streaming

Prinzipien

DRY: Wiederverwendung bestehender Python-Logik
KISS: Minimale neue Tabellen, JSON für flexible Config
SRP: Repository nur für Pipelines, Controller nur für HTTP
YAGNI: Keine Features die nicht gefordert wurden