Protokoll #32222

ID	32222
Zeitstempel	2026-01-02 01:53:56.441974
Client	root
IP	145.224.72.245
Modell	claude-sonnet-4-20250514
Status	completed
Tokens	4,817 (Input: 2,406, Output: 2,411)
Dauer	85 ms
Request-Zeit	2026-01-02 01:53:56.441974
Response-Zeit	2026-01-02 01:53:56.526530

Request

{
    "event": "PreToolUse",
    "tool_name": "Write",
    "tool_input": {
        "file_path": "\/var\/www\/migration\/content-pipeline\/docs\/README.md",
        "content": "# Content Pipeline - RAG Document Processing System\n\nA complete document processing pipeline for Retrieval-Augmented Generation (RAG) systems. Processes PDFs, DOCX, PPTX, Markdown, and text files through chunking, embedding, and semantic analysis.\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                     CONTENT PIPELINE                             │\n├─────────────────────────────────────────────────────────────────┤\n│                                                                  │\n│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │\n│  │  DETECT  │ → │ EXTRACT  │ → │  CHUNK   │ → │  EMBED   │     │\n│  │  Files   │   │  Text    │   │  Split   │   │ Vectors  │     │\n│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │\n│                                       │              │          │\n│                                       ▼              ▼          │\n│                               ┌──────────┐   ┌──────────┐      │\n│                               │ ANALYZE  │   │  QDRANT  │      │\n│                               │ Semantic │   │ VectorDB │      │\n│                               └──────────┘   └──────────┘      │\n│                                       │                         │\n│                                       ▼                         │\n│                               ┌──────────┐                      │\n│                               │ ENTITIES │                      │\n│                               │ Extract  │                      │\n│                               └──────────┘                      │\n│                                                                  │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Quick Start\n\n### 1. Prerequisites\n\n```bash\n# Python 3.11+\npython3 --version\n\n# MariaDB\/MySQL\nmariadb --version\n\n# Qdrant Vector Database\ndocker run -p 6333:6333 qdrant\/qdrant\n\n# Ollama (for local LLM)\nollama --version\nollama pull mxbai-embed-large  # Embedding model\nollama pull llama3.2:3b        # Chat model (or your preferred)\n\n# Tesseract OCR (for PDF images)\ntesseract --version\n```\n\n### 2. Installation\n\n```bash\n# Clone\/copy pipeline\ncd \/path\/to\/content-pipeline\n\n# Create virtual environment\npython3 -m venv venv\nsource venv\/bin\/activate\n\n# Install dependencies\npip install -r src\/requirements.txt\n```\n\n### 3. Database Setup\n\n```bash\n# Create database\nmariadb -e \"CREATE DATABASE content_pipeline CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci\"\n\n# Import schema\nmariadb content_pipeline < sql\/schema.sql\n```\n\n### 4. Configuration\n\n```bash\n# Copy example config\ncp config\/settings.env.example .env\n\n# Edit configuration\nnano .env\n```\n\nKey settings to configure:\n- `PIPELINE_DOCUMENT_PATH` - Directory with documents to process\n- `DB_*` - Database connection\n- `QDRANT_*` - Vector database connection\n- `OLLAMA_*` - LLM settings\n\n### 5. Run Pipeline\n\n```bash\n# Activate environment\nsource venv\/bin\/activate\n\n# Set environment\nexport $(cat .env | xargs)\n\n# Run full pipeline\npython src\/pipeline.py\n\n# Or run specific steps\npython src\/detect.py      # Detect new documents\npython src\/extract.py     # Extract text\npython src\/chunk.py       # Create chunks\npython src\/embed.py       # Generate embeddings\npython src\/analyze.py     # Semantic analysis\n```\n\n## Pipeline Steps\n\n### Phase 1: Document Detection\n- **detect.py** - Scans source directory for new\/modified documents\n- Compares file hashes to avoid reprocessing\n- Queues documents for processing\n\n### Phase 2: Text Extraction\n- **extract.py** - Extracts text from documents\n- **vision.py** - OCR for images\/scanned PDFs\n- Supports: PDF, DOCX, PPTX, MD, TXT\n\n### Phase 3: Chunking\n- **chunk.py** - Semantic chunking with heading preservation\n- Configurable overlap and size limits\n- Maintains document structure context\n\n### Phase 4: Embedding\n- **embed.py** - Generates vector embeddings via Ollama\n- **step_embed.py** - Batch embedding step\n- Stores in Qdrant vector database\n\n### Phase 5: Semantic Analysis\n- **analyze.py** - LLM-based semantic analysis\n- **analyzers\/** - Specialized analyzers:\n  - Entity extraction\n  - Relation detection\n  - Taxonomy classification\n  - Ontology mapping\n\n### Phase 6: Knowledge Graph\n- **knowledge\/** - Entity and relation management\n- Builds connections between concepts\n- Taxonomy and ontology integration\n\n## Directory Structure\n\n```\ncontent-pipeline\/\n├── config\/\n│   ├── settings.py           # Abstracted configuration\n│   └── settings.env.example  # Environment template\n├── docs\/\n│   ├── README.md            # This file\n│   ├── ARCHITECTURE.md      # Detailed architecture\n│   └── API.md               # API documentation\n├── sql\/\n│   └── schema.sql           # Database schema (DDL)\n├── src\/\n│   ├── pipeline.py          # Main orchestrator\n│   ├── config.py            # Configuration loader\n│   ├── constants.py         # System constants\n│   │\n│   ├── # Core Steps\n│   ├── detect.py            # Document detection\n│   ├── extract.py           # Text extraction\n│   ├── chunk.py             # Semantic chunking\n│   ├── embed.py             # Vector embedding\n│   ├── enrich.py            # Metadata enrichment\n│   ├── vision.py            # OCR processing\n│   │\n│   ├── # Database\n│   ├── db.py                # Main DB interface\n│   ├── db_core.py           # Core DB operations\n│   ├── db_documents.py      # Document operations\n│   ├── db_semantic.py       # Semantic data ops\n│   ├── db_queue.py          # Queue management\n│   │\n│   ├── # Analyzers\n│   ├── analyzers\/\n│   │   ├── client.py        # LLM client\n│   │   ├── entity_extractor.py\n│   │   ├── relation_extractor.py\n│   │   ├── taxonomy_classifier.py\n│   │   └── semantic_analyzer.py\n│   │\n│   ├── # Knowledge Graph\n│   ├── knowledge\/\n│   │   ├── entity_extractor.py\n│   │   ├── taxonomy_extractor.py\n│   │   ├── storage.py\n│   │   └── models.py\n│   │\n│   └── # Utilities\n│       ├── json_utils.py\n│       └── model_registry.py\n└── scripts\/\n    └── setup.sh             # Setup script\n```\n\n## Database Schema\n\n### Core Tables\n\n| Table | Purpose |\n|-------|---------|\n| `documents` | Source documents metadata |\n| `document_pages` | Individual pages (PDF) |\n| `chunks` | Text chunks with embeddings |\n| `chunk_semantics` | Semantic analysis results |\n| `entities` | Extracted named entities |\n| `entity_relations` | Entity relationships |\n| `taxonomy_terms` | Hierarchical categories |\n\n### Pipeline Tables\n\n| Table | Purpose |\n|-------|---------|\n| `pipeline_configs` | Pipeline configurations |\n| `pipeline_steps` | Step definitions |\n| `pipeline_runs` | Execution history |\n| `pipeline_queue` | Processing queue |\n\n## Qdrant Collections\n\n```python\n# Default collections\nQDRANT_COLLECTIONS = {\n    \"documents\": {\"size\": 1024, \"distance\": \"Cosine\"},  # Chunk embeddings\n    \"entities\": {\"size\": 1024, \"distance\": \"Cosine\"},   # Entity embeddings\n}\n```\n\n### Setup Collections\n\n```python\nfrom qdrant_client import QdrantClient\nfrom qdrant_client.models import VectorParams, Distance\n\nclient = QdrantClient(host=\"localhost\", port=6333)\n\n# Create collection\nclient.create_collection(\n    collection_name=\"documents\",\n    vectors_config=VectorParams(\n        size=1024,  # Match embedding model dimension\n        distance=Distance.COSINE\n    )\n)\n```\n\n## Entity Types\n\nThe system recognizes 40+ entity types:\n\n| Category | Types |\n|----------|-------|\n| People & Orgs | PERSON, ORGANIZATION, ROLE |\n| Concepts | CONCEPT, THEORY, MODEL, PRINCIPLE, METHOD |\n| Content | TOOL, ARTIFACT, PUBLICATION_WORK |\n| Structure | PROCESS_PHASE_STEP, INTERVENTION_EXERCISE |\n| Relations | RELATIONSHIP_TYPE, COMMUNICATION_PATTERN |\n\n## Customization\n\n### Add New Pipeline Step\n\n1. Create step module in `src\/step_*.py`\n2. Register in `pipeline_steps` table\n3. Implement `execute(document_id, config)` function\n\n### Add New Entity Type\n\n1. Add to `entities.type` ENUM in schema\n2. Update `entity_types` table\n3. Adjust extraction prompts in `analyzers\/`\n\n### Change Embedding Model\n\n1. Update `OLLAMA_EMBEDDING_MODEL` in `.env`\n2. Update `EMBEDDING_DIMENSION` to match\n3. Recreate Qdrant collections with new size\n4. Re-embed all documents\n\n## API Usage\n\n### Python Integration\n\n```python\nfrom db import get_connection\nfrom embed import embed_text\n\n# Get connection\nconn = get_connection()\n\n# Embed query\nquery_vector = embed_text(\"What is systemic therapy?\")\n\n# Search Qdrant\nfrom qdrant_client import QdrantClient\nclient = QdrantClient(host=\"localhost\", port=6333)\n\nresults = client.search(\n    collection_name=\"documents\",\n    query_vector=query_vector,\n    limit=5\n)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Embedding Timeout**\n   - Increase `OLLAMA_TIMEOUT` in config\n   - Check Ollama service status\n\n2. **OCR Quality**\n   - Ensure Tesseract language packs installed\n   - Check image resolution\n\n3. **Memory Issues**\n   - Reduce `SEMANTIC_QUEUE_BATCH_SIZE`\n   - Process documents in smaller batches\n\n### Logging\n\n```bash\n# Set debug logging\nexport LOG_LEVEL=DEBUG\npython src\/pipeline.py\n```\n\n## License\n\nInternal use only. Contact system administrator for licensing.\n"
    }
}

Response

{
    "tool_response": {
        "type": "create",
        "filePath": "\/var\/www\/migration\/content-pipeline\/docs\/README.md",
        "content": "# Content Pipeline - RAG Document Processing System\n\nA complete document processing pipeline for Retrieval-Augmented Generation (RAG) systems. Processes PDFs, DOCX, PPTX, Markdown, and text files through chunking, embedding, and semantic analysis.\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                     CONTENT PIPELINE                             │\n├─────────────────────────────────────────────────────────────────┤\n│                                                                  │\n│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │\n│  │  DETECT  │ → │ EXTRACT  │ → │  CHUNK   │ → │  EMBED   │     │\n│  │  Files   │   │  Text    │   │  Split   │   │ Vectors  │     │\n│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │\n│                                       │              │          │\n│                                       ▼              ▼          │\n│                               ┌──────────┐   ┌──────────┐      │\n│                               │ ANALYZE  │   │  QDRANT  │      │\n│                               │ Semantic │   │ VectorDB │      │\n│                               └──────────┘   └──────────┘      │\n│                                       │                         │\n│                                       ▼                         │\n│                               ┌──────────┐                      │\n│                               │ ENTITIES │                      │\n│                               │ Extract  │                      │\n│                               └──────────┘                      │\n│                                                                  │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Quick Start\n\n### 1. Prerequisites\n\n```bash\n# Python 3.11+\npython3 --version\n\n# MariaDB\/MySQL\nmariadb --version\n\n# Qdrant Vector Database\ndocker run -p 6333:6333 qdrant\/qdrant\n\n# Ollama (for local LLM)\nollama --version\nollama pull mxbai-embed-large  # Embedding model\nollama pull llama3.2:3b        # Chat model (or your preferred)\n\n# Tesseract OCR (for PDF images)\ntesseract --version\n```\n\n### 2. Installation\n\n```bash\n# Clone\/copy pipeline\ncd \/path\/to\/content-pipeline\n\n# Create virtual environment\npython3 -m venv venv\nsource venv\/bin\/activate\n\n# Install dependencies\npip install -r src\/requirements.txt\n```\n\n### 3. Database Setup\n\n```bash\n# Create database\nmariadb -e \"CREATE DATABASE content_pipeline CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci\"\n\n# Import schema\nmariadb content_pipeline < sql\/schema.sql\n```\n\n### 4. Configuration\n\n```bash\n# Copy example config\ncp config\/settings.env.example .env\n\n# Edit configuration\nnano .env\n```\n\nKey settings to configure:\n- `PIPELINE_DOCUMENT_PATH` - Directory with documents to process\n- `DB_*` - Database connection\n- `QDRANT_*` - Vector database connection\n- `OLLAMA_*` - LLM settings\n\n### 5. Run Pipeline\n\n```bash\n# Activate environment\nsource venv\/bin\/activate\n\n# Set environment\nexport $(cat .env | xargs)\n\n# Run full pipeline\npython src\/pipeline.py\n\n# Or run specific steps\npython src\/detect.py      # Detect new documents\npython src\/extract.py     # Extract text\npython src\/chunk.py       # Create chunks\npython src\/embed.py       # Generate embeddings\npython src\/analyze.py     # Semantic analysis\n```\n\n## Pipeline Steps\n\n### Phase 1: Document Detection\n- **detect.py** - Scans source directory for new\/modified documents\n- Compares file hashes to avoid reprocessing\n- Queues documents for processing\n\n### Phase 2: Text Extraction\n- **extract.py** - Extracts text from documents\n- **vision.py** - OCR for images\/scanned PDFs\n- Supports: PDF, DOCX, PPTX, MD, TXT\n\n### Phase 3: Chunking\n- **chunk.py** - Semantic chunking with heading preservation\n- Configurable overlap and size limits\n- Maintains document structure context\n\n### Phase 4: Embedding\n- **embed.py** - Generates vector embeddings via Ollama\n- **step_embed.py** - Batch embedding step\n- Stores in Qdrant vector database\n\n### Phase 5: Semantic Analysis\n- **analyze.py** - LLM-based semantic analysis\n- **analyzers\/** - Specialized analyzers:\n  - Entity extraction\n  - Relation detection\n  - Taxonomy classification\n  - Ontology mapping\n\n### Phase 6: Knowledge Graph\n- **knowledge\/** - Entity and relation management\n- Builds connections between concepts\n- Taxonomy and ontology integration\n\n## Directory Structure\n\n```\ncontent-pipeline\/\n├── config\/\n│   ├── settings.py           # Abstracted configuration\n│   └── settings.env.example  # Environment template\n├── docs\/\n│   ├── README.md            # This file\n│   ├── ARCHITECTURE.md      # Detailed architecture\n│   └── API.md               # API documentation\n├── sql\/\n│   └── schema.sql           # Database schema (DDL)\n├── src\/\n│   ├── pipeline.py          # Main orchestrator\n│   ├── config.py            # Configuration loader\n│   ├── constants.py         # System constants\n│   │\n│   ├── # Core Steps\n│   ├── detect.py            # Document detection\n│   ├── extract.py           # Text extraction\n│   ├── chunk.py             # Semantic chunking\n│   ├── embed.py             # Vector embedding\n│   ├── enrich.py            # Metadata enrichment\n│   ├── vision.py            # OCR processing\n│   │\n│   ├── # Database\n│   ├── db.py                # Main DB interface\n│   ├── db_core.py           # Core DB operations\n│   ├── db_documents.py      # Document operations\n│   ├── db_semantic.py       # Semantic data ops\n│   ├── db_queue.py          # Queue management\n│   │\n│   ├── # Analyzers\n│   ├── analyzers\/\n│   │   ├── client.py        # LLM client\n│   │   ├── entity_extractor.py\n│   │   ├── relation_extractor.py\n│   │   ├── taxonomy_classifier.py\n│   │   └── semantic_analyzer.py\n│   │\n│   ├── # Knowledge Graph\n│   ├── knowledge\/\n│   │   ├── entity_extractor.py\n│   │   ├── taxonomy_extractor.py\n│   │   ├── storage.py\n│   │   └── models.py\n│   │\n│   └── # Utilities\n│       ├── json_utils.py\n│       └── model_registry.py\n└── scripts\/\n    └── setup.sh             # Setup script\n```\n\n## Database Schema\n\n### Core Tables\n\n| Table | Purpose |\n|-------|---------|\n| `documents` | Source documents metadata |\n| `document_pages` | Individual pages (PDF) |\n| `chunks` | Text chunks with embeddings |\n| `chunk_semantics` | Semantic analysis results |\n| `entities` | Extracted named entities |\n| `entity_relations` | Entity relationships |\n| `taxonomy_terms` | Hierarchical categories |\n\n### Pipeline Tables\n\n| Table | Purpose |\n|-------|---------|\n| `pipeline_configs` | Pipeline configurations |\n| `pipeline_steps` | Step definitions |\n| `pipeline_runs` | Execution history |\n| `pipeline_queue` | Processing queue |\n\n## Qdrant Collections\n\n```python\n# Default collections\nQDRANT_COLLECTIONS = {\n    \"documents\": {\"size\": 1024, \"distance\": \"Cosine\"},  # Chunk embeddings\n    \"entities\": {\"size\": 1024, \"distance\": \"Cosine\"},   # Entity embeddings\n}\n```\n\n### Setup Collections\n\n```python\nfrom qdrant_client import QdrantClient\nfrom qdrant_client.models import VectorParams, Distance\n\nclient = QdrantClient(host=\"localhost\", port=6333)\n\n# Create collection\nclient.create_collection(\n    collection_name=\"documents\",\n    vectors_config=VectorParams(\n        size=1024,  # Match embedding model dimension\n        distance=Distance.COSINE\n    )\n)\n```\n\n## Entity Types\n\nThe system recognizes 40+ entity types:\n\n| Category | Types |\n|----------|-------|\n| People & Orgs | PERSON, ORGANIZATION, ROLE |\n| Concepts | CONCEPT, THEORY, MODEL, PRINCIPLE, METHOD |\n| Content | TOOL, ARTIFACT, PUBLICATION_WORK |\n| Structure | PROCESS_PHASE_STEP, INTERVENTION_EXERCISE |\n| Relations | RELATIONSHIP_TYPE, COMMUNICATION_PATTERN |\n\n## Customization\n\n### Add New Pipeline Step\n\n1. Create step module in `src\/step_*.py`\n2. Register in `pipeline_steps` table\n3. Implement `execute(document_id, config)` function\n\n### Add New Entity Type\n\n1. Add to `entities.type` ENUM in schema\n2. Update `entity_types` table\n3. Adjust extraction prompts in `analyzers\/`\n\n### Change Embedding Model\n\n1. Update `OLLAMA_EMBEDDING_MODEL` in `.env`\n2. Update `EMBEDDING_DIMENSION` to match\n3. Recreate Qdrant collections with new size\n4. Re-embed all documents\n\n## API Usage\n\n### Python Integration\n\n```python\nfrom db import get_connection\nfrom embed import embed_text\n\n# Get connection\nconn = get_connection()\n\n# Embed query\nquery_vector = embed_text(\"What is systemic therapy?\")\n\n# Search Qdrant\nfrom qdrant_client import QdrantClient\nclient = QdrantClient(host=\"localhost\", port=6333)\n\nresults = client.search(\n    collection_name=\"documents\",\n    query_vector=query_vector,\n    limit=5\n)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Embedding Timeout**\n   - Increase `OLLAMA_TIMEOUT` in config\n   - Check Ollama service status\n\n2. **OCR Quality**\n   - Ensure Tesseract language packs installed\n   - Check image resolution\n\n3. **Memory Issues**\n   - Reduce `SEMANTIC_QUEUE_BATCH_SIZE`\n   - Process documents in smaller batches\n\n### Logging\n\n```bash\n# Set debug logging\nexport LOG_LEVEL=DEBUG\npython src\/pipeline.py\n```\n\n## License\n\nInternal use only. Contact system administrator for licensing.\n",
        "structuredPatch": [],
        "originalFile": null
    }
}

← Vorheriger Zur Liste Nächster →