{
"event": "PreToolUse",
"tool_name": "Write",
"tool_input": {
"file_path": "\/var\/www\/migration\/content-pipeline\/docs\/README.md",
"content": "# Content Pipeline - RAG Document Processing System\n\nA complete document processing pipeline for Retrieval-Augmented Generation (RAG) systems. Processes PDFs, DOCX, PPTX, Markdown, and text files through chunking, embedding, and semantic analysis.\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ CONTENT PIPELINE │\n├─────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │\n│ │ DETECT │ → │ EXTRACT │ → │ CHUNK │ → │ EMBED │ │\n│ │ Files │ │ Text │ │ Split │ │ Vectors │ │\n│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │\n│ │ │ │\n│ ▼ ▼ │\n│ ┌──────────┐ ┌──────────┐ │\n│ │ ANALYZE │ │ QDRANT │ │\n│ │ Semantic │ │ VectorDB │ │\n│ └──────────┘ └──────────┘ │\n│ │ │\n│ ▼ │\n│ ┌──────────┐ │\n│ │ ENTITIES │ │\n│ │ Extract │ │\n│ └──────────┘ │\n│ │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Quick Start\n\n### 1. Prerequisites\n\n```bash\n# Python 3.11+\npython3 --version\n\n# MariaDB\/MySQL\nmariadb --version\n\n# Qdrant Vector Database\ndocker run -p 6333:6333 qdrant\/qdrant\n\n# Ollama (for local LLM)\nollama --version\nollama pull mxbai-embed-large # Embedding model\nollama pull llama3.2:3b # Chat model (or your preferred)\n\n# Tesseract OCR (for PDF images)\ntesseract --version\n```\n\n### 2. Installation\n\n```bash\n# Clone\/copy pipeline\ncd \/path\/to\/content-pipeline\n\n# Create virtual environment\npython3 -m venv venv\nsource venv\/bin\/activate\n\n# Install dependencies\npip install -r src\/requirements.txt\n```\n\n### 3. Database Setup\n\n```bash\n# Create database\nmariadb -e \"CREATE DATABASE content_pipeline CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci\"\n\n# Import schema\nmariadb content_pipeline < sql\/schema.sql\n```\n\n### 4. Configuration\n\n```bash\n# Copy example config\ncp config\/settings.env.example .env\n\n# Edit configuration\nnano .env\n```\n\nKey settings to configure:\n- `PIPELINE_DOCUMENT_PATH` - Directory with documents to process\n- `DB_*` - Database connection\n- `QDRANT_*` - Vector database connection\n- `OLLAMA_*` - LLM settings\n\n### 5. Run Pipeline\n\n```bash\n# Activate environment\nsource venv\/bin\/activate\n\n# Set environment\nexport $(cat .env | xargs)\n\n# Run full pipeline\npython src\/pipeline.py\n\n# Or run specific steps\npython src\/detect.py # Detect new documents\npython src\/extract.py # Extract text\npython src\/chunk.py # Create chunks\npython src\/embed.py # Generate embeddings\npython src\/analyze.py # Semantic analysis\n```\n\n## Pipeline Steps\n\n### Phase 1: Document Detection\n- **detect.py** - Scans source directory for new\/modified documents\n- Compares file hashes to avoid reprocessing\n- Queues documents for processing\n\n### Phase 2: Text Extraction\n- **extract.py** - Extracts text from documents\n- **vision.py** - OCR for images\/scanned PDFs\n- Supports: PDF, DOCX, PPTX, MD, TXT\n\n### Phase 3: Chunking\n- **chunk.py** - Semantic chunking with heading preservation\n- Configurable overlap and size limits\n- Maintains document structure context\n\n### Phase 4: Embedding\n- **embed.py** - Generates vector embeddings via Ollama\n- **step_embed.py** - Batch embedding step\n- Stores in Qdrant vector database\n\n### Phase 5: Semantic Analysis\n- **analyze.py** - LLM-based semantic analysis\n- **analyzers\/** - Specialized analyzers:\n - Entity extraction\n - Relation detection\n - Taxonomy classification\n - Ontology mapping\n\n### Phase 6: Knowledge Graph\n- **knowledge\/** - Entity and relation management\n- Builds connections between concepts\n- Taxonomy and ontology integration\n\n## Directory Structure\n\n```\ncontent-pipeline\/\n├── config\/\n│ ├── settings.py # Abstracted configuration\n│ └── settings.env.example # Environment template\n├── docs\/\n│ ├── README.md # This file\n│ ├── ARCHITECTURE.md # Detailed architecture\n│ └── API.md # API documentation\n├── sql\/\n│ └── schema.sql # Database schema (DDL)\n├── src\/\n│ ├── pipeline.py # Main orchestrator\n│ ├── config.py # Configuration loader\n│ ├── constants.py # System constants\n│ │\n│ ├── # Core Steps\n│ ├── detect.py # Document detection\n│ ├── extract.py # Text extraction\n│ ├── chunk.py # Semantic chunking\n│ ├── embed.py # Vector embedding\n│ ├── enrich.py # Metadata enrichment\n│ ├── vision.py # OCR processing\n│ │\n│ ├── # Database\n│ ├── db.py # Main DB interface\n│ ├── db_core.py # Core DB operations\n│ ├── db_documents.py # Document operations\n│ ├── db_semantic.py # Semantic data ops\n│ ├── db_queue.py # Queue management\n│ │\n│ ├── # Analyzers\n│ ├── analyzers\/\n│ │ ├── client.py # LLM client\n│ │ ├── entity_extractor.py\n│ │ ├── relation_extractor.py\n│ │ ├── taxonomy_classifier.py\n│ │ └── semantic_analyzer.py\n│ │\n│ ├── # Knowledge Graph\n│ ├── knowledge\/\n│ │ ├── entity_extractor.py\n│ │ ├── taxonomy_extractor.py\n│ │ ├── storage.py\n│ │ └── models.py\n│ │\n│ └── # Utilities\n│ ├── json_utils.py\n│ └── model_registry.py\n└── scripts\/\n └── setup.sh # Setup script\n```\n\n## Database Schema\n\n### Core Tables\n\n| Table | Purpose |\n|-------|---------|\n| `documents` | Source documents metadata |\n| `document_pages` | Individual pages (PDF) |\n| `chunks` | Text chunks with embeddings |\n| `chunk_semantics` | Semantic analysis results |\n| `entities` | Extracted named entities |\n| `entity_relations` | Entity relationships |\n| `taxonomy_terms` | Hierarchical categories |\n\n### Pipeline Tables\n\n| Table | Purpose |\n|-------|---------|\n| `pipeline_configs` | Pipeline configurations |\n| `pipeline_steps` | Step definitions |\n| `pipeline_runs` | Execution history |\n| `pipeline_queue` | Processing queue |\n\n## Qdrant Collections\n\n```python\n# Default collections\nQDRANT_COLLECTIONS = {\n \"documents\": {\"size\": 1024, \"distance\": \"Cosine\"}, # Chunk embeddings\n \"entities\": {\"size\": 1024, \"distance\": \"Cosine\"}, # Entity embeddings\n}\n```\n\n### Setup Collections\n\n```python\nfrom qdrant_client import QdrantClient\nfrom qdrant_client.models import VectorParams, Distance\n\nclient = QdrantClient(host=\"localhost\", port=6333)\n\n# Create collection\nclient.create_collection(\n collection_name=\"documents\",\n vectors_config=VectorParams(\n size=1024, # Match embedding model dimension\n distance=Distance.COSINE\n )\n)\n```\n\n## Entity Types\n\nThe system recognizes 40+ entity types:\n\n| Category | Types |\n|----------|-------|\n| People & Orgs | PERSON, ORGANIZATION, ROLE |\n| Concepts | CONCEPT, THEORY, MODEL, PRINCIPLE, METHOD |\n| Content | TOOL, ARTIFACT, PUBLICATION_WORK |\n| Structure | PROCESS_PHASE_STEP, INTERVENTION_EXERCISE |\n| Relations | RELATIONSHIP_TYPE, COMMUNICATION_PATTERN |\n\n## Customization\n\n### Add New Pipeline Step\n\n1. Create step module in `src\/step_*.py`\n2. Register in `pipeline_steps` table\n3. Implement `execute(document_id, config)` function\n\n### Add New Entity Type\n\n1. Add to `entities.type` ENUM in schema\n2. Update `entity_types` table\n3. Adjust extraction prompts in `analyzers\/`\n\n### Change Embedding Model\n\n1. Update `OLLAMA_EMBEDDING_MODEL` in `.env`\n2. Update `EMBEDDING_DIMENSION` to match\n3. Recreate Qdrant collections with new size\n4. Re-embed all documents\n\n## API Usage\n\n### Python Integration\n\n```python\nfrom db import get_connection\nfrom embed import embed_text\n\n# Get connection\nconn = get_connection()\n\n# Embed query\nquery_vector = embed_text(\"What is systemic therapy?\")\n\n# Search Qdrant\nfrom qdrant_client import QdrantClient\nclient = QdrantClient(host=\"localhost\", port=6333)\n\nresults = client.search(\n collection_name=\"documents\",\n query_vector=query_vector,\n limit=5\n)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Embedding Timeout**\n - Increase `OLLAMA_TIMEOUT` in config\n - Check Ollama service status\n\n2. **OCR Quality**\n - Ensure Tesseract language packs installed\n - Check image resolution\n\n3. **Memory Issues**\n - Reduce `SEMANTIC_QUEUE_BATCH_SIZE`\n - Process documents in smaller batches\n\n### Logging\n\n```bash\n# Set debug logging\nexport LOG_LEVEL=DEBUG\npython src\/pipeline.py\n```\n\n## License\n\nInternal use only. Contact system administrator for licensing.\n"
}
}
{
"tool_response": {
"type": "create",
"filePath": "\/var\/www\/migration\/content-pipeline\/docs\/README.md",
"content": "# Content Pipeline - RAG Document Processing System\n\nA complete document processing pipeline for Retrieval-Augmented Generation (RAG) systems. Processes PDFs, DOCX, PPTX, Markdown, and text files through chunking, embedding, and semantic analysis.\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ CONTENT PIPELINE │\n├─────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │\n│ │ DETECT │ → │ EXTRACT │ → │ CHUNK │ → │ EMBED │ │\n│ │ Files │ │ Text │ │ Split │ │ Vectors │ │\n│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │\n│ │ │ │\n│ ▼ ▼ │\n│ ┌──────────┐ ┌──────────┐ │\n│ │ ANALYZE │ │ QDRANT │ │\n│ │ Semantic │ │ VectorDB │ │\n│ └──────────┘ └──────────┘ │\n│ │ │\n│ ▼ │\n│ ┌──────────┐ │\n│ │ ENTITIES │ │\n│ │ Extract │ │\n│ └──────────┘ │\n│ │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Quick Start\n\n### 1. Prerequisites\n\n```bash\n# Python 3.11+\npython3 --version\n\n# MariaDB\/MySQL\nmariadb --version\n\n# Qdrant Vector Database\ndocker run -p 6333:6333 qdrant\/qdrant\n\n# Ollama (for local LLM)\nollama --version\nollama pull mxbai-embed-large # Embedding model\nollama pull llama3.2:3b # Chat model (or your preferred)\n\n# Tesseract OCR (for PDF images)\ntesseract --version\n```\n\n### 2. Installation\n\n```bash\n# Clone\/copy pipeline\ncd \/path\/to\/content-pipeline\n\n# Create virtual environment\npython3 -m venv venv\nsource venv\/bin\/activate\n\n# Install dependencies\npip install -r src\/requirements.txt\n```\n\n### 3. Database Setup\n\n```bash\n# Create database\nmariadb -e \"CREATE DATABASE content_pipeline CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci\"\n\n# Import schema\nmariadb content_pipeline < sql\/schema.sql\n```\n\n### 4. Configuration\n\n```bash\n# Copy example config\ncp config\/settings.env.example .env\n\n# Edit configuration\nnano .env\n```\n\nKey settings to configure:\n- `PIPELINE_DOCUMENT_PATH` - Directory with documents to process\n- `DB_*` - Database connection\n- `QDRANT_*` - Vector database connection\n- `OLLAMA_*` - LLM settings\n\n### 5. Run Pipeline\n\n```bash\n# Activate environment\nsource venv\/bin\/activate\n\n# Set environment\nexport $(cat .env | xargs)\n\n# Run full pipeline\npython src\/pipeline.py\n\n# Or run specific steps\npython src\/detect.py # Detect new documents\npython src\/extract.py # Extract text\npython src\/chunk.py # Create chunks\npython src\/embed.py # Generate embeddings\npython src\/analyze.py # Semantic analysis\n```\n\n## Pipeline Steps\n\n### Phase 1: Document Detection\n- **detect.py** - Scans source directory for new\/modified documents\n- Compares file hashes to avoid reprocessing\n- Queues documents for processing\n\n### Phase 2: Text Extraction\n- **extract.py** - Extracts text from documents\n- **vision.py** - OCR for images\/scanned PDFs\n- Supports: PDF, DOCX, PPTX, MD, TXT\n\n### Phase 3: Chunking\n- **chunk.py** - Semantic chunking with heading preservation\n- Configurable overlap and size limits\n- Maintains document structure context\n\n### Phase 4: Embedding\n- **embed.py** - Generates vector embeddings via Ollama\n- **step_embed.py** - Batch embedding step\n- Stores in Qdrant vector database\n\n### Phase 5: Semantic Analysis\n- **analyze.py** - LLM-based semantic analysis\n- **analyzers\/** - Specialized analyzers:\n - Entity extraction\n - Relation detection\n - Taxonomy classification\n - Ontology mapping\n\n### Phase 6: Knowledge Graph\n- **knowledge\/** - Entity and relation management\n- Builds connections between concepts\n- Taxonomy and ontology integration\n\n## Directory Structure\n\n```\ncontent-pipeline\/\n├── config\/\n│ ├── settings.py # Abstracted configuration\n│ └── settings.env.example # Environment template\n├── docs\/\n│ ├── README.md # This file\n│ ├── ARCHITECTURE.md # Detailed architecture\n│ └── API.md # API documentation\n├── sql\/\n│ └── schema.sql # Database schema (DDL)\n├── src\/\n│ ├── pipeline.py # Main orchestrator\n│ ├── config.py # Configuration loader\n│ ├── constants.py # System constants\n│ │\n│ ├── # Core Steps\n│ ├── detect.py # Document detection\n│ ├── extract.py # Text extraction\n│ ├── chunk.py # Semantic chunking\n│ ├── embed.py # Vector embedding\n│ ├── enrich.py # Metadata enrichment\n│ ├── vision.py # OCR processing\n│ │\n│ ├── # Database\n│ ├── db.py # Main DB interface\n│ ├── db_core.py # Core DB operations\n│ ├── db_documents.py # Document operations\n│ ├── db_semantic.py # Semantic data ops\n│ ├── db_queue.py # Queue management\n│ │\n│ ├── # Analyzers\n│ ├── analyzers\/\n│ │ ├── client.py # LLM client\n│ │ ├── entity_extractor.py\n│ │ ├── relation_extractor.py\n│ │ ├── taxonomy_classifier.py\n│ │ └── semantic_analyzer.py\n│ │\n│ ├── # Knowledge Graph\n│ ├── knowledge\/\n│ │ ├── entity_extractor.py\n│ │ ├── taxonomy_extractor.py\n│ │ ├── storage.py\n│ │ └── models.py\n│ │\n│ └── # Utilities\n│ ├── json_utils.py\n│ └── model_registry.py\n└── scripts\/\n └── setup.sh # Setup script\n```\n\n## Database Schema\n\n### Core Tables\n\n| Table | Purpose |\n|-------|---------|\n| `documents` | Source documents metadata |\n| `document_pages` | Individual pages (PDF) |\n| `chunks` | Text chunks with embeddings |\n| `chunk_semantics` | Semantic analysis results |\n| `entities` | Extracted named entities |\n| `entity_relations` | Entity relationships |\n| `taxonomy_terms` | Hierarchical categories |\n\n### Pipeline Tables\n\n| Table | Purpose |\n|-------|---------|\n| `pipeline_configs` | Pipeline configurations |\n| `pipeline_steps` | Step definitions |\n| `pipeline_runs` | Execution history |\n| `pipeline_queue` | Processing queue |\n\n## Qdrant Collections\n\n```python\n# Default collections\nQDRANT_COLLECTIONS = {\n \"documents\": {\"size\": 1024, \"distance\": \"Cosine\"}, # Chunk embeddings\n \"entities\": {\"size\": 1024, \"distance\": \"Cosine\"}, # Entity embeddings\n}\n```\n\n### Setup Collections\n\n```python\nfrom qdrant_client import QdrantClient\nfrom qdrant_client.models import VectorParams, Distance\n\nclient = QdrantClient(host=\"localhost\", port=6333)\n\n# Create collection\nclient.create_collection(\n collection_name=\"documents\",\n vectors_config=VectorParams(\n size=1024, # Match embedding model dimension\n distance=Distance.COSINE\n )\n)\n```\n\n## Entity Types\n\nThe system recognizes 40+ entity types:\n\n| Category | Types |\n|----------|-------|\n| People & Orgs | PERSON, ORGANIZATION, ROLE |\n| Concepts | CONCEPT, THEORY, MODEL, PRINCIPLE, METHOD |\n| Content | TOOL, ARTIFACT, PUBLICATION_WORK |\n| Structure | PROCESS_PHASE_STEP, INTERVENTION_EXERCISE |\n| Relations | RELATIONSHIP_TYPE, COMMUNICATION_PATTERN |\n\n## Customization\n\n### Add New Pipeline Step\n\n1. Create step module in `src\/step_*.py`\n2. Register in `pipeline_steps` table\n3. Implement `execute(document_id, config)` function\n\n### Add New Entity Type\n\n1. Add to `entities.type` ENUM in schema\n2. Update `entity_types` table\n3. Adjust extraction prompts in `analyzers\/`\n\n### Change Embedding Model\n\n1. Update `OLLAMA_EMBEDDING_MODEL` in `.env`\n2. Update `EMBEDDING_DIMENSION` to match\n3. Recreate Qdrant collections with new size\n4. Re-embed all documents\n\n## API Usage\n\n### Python Integration\n\n```python\nfrom db import get_connection\nfrom embed import embed_text\n\n# Get connection\nconn = get_connection()\n\n# Embed query\nquery_vector = embed_text(\"What is systemic therapy?\")\n\n# Search Qdrant\nfrom qdrant_client import QdrantClient\nclient = QdrantClient(host=\"localhost\", port=6333)\n\nresults = client.search(\n collection_name=\"documents\",\n query_vector=query_vector,\n limit=5\n)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Embedding Timeout**\n - Increase `OLLAMA_TIMEOUT` in config\n - Check Ollama service status\n\n2. **OCR Quality**\n - Ensure Tesseract language packs installed\n - Check image resolution\n\n3. **Memory Issues**\n - Reduce `SEMANTIC_QUEUE_BATCH_SIZE`\n - Process documents in smaller batches\n\n### Logging\n\n```bash\n# Set debug logging\nexport LOG_LEVEL=DEBUG\npython src\/pipeline.py\n```\n\n## License\n\nInternal use only. Contact system administrator for licensing.\n",
"structuredPatch": [],
"originalFile": null
}
}