Protokoll #4484

ID	4484
Zeitstempel	2025-12-22 10:08:28.690081
Client	root
IP	145.224.96.190
Modell	claude-sonnet-4-20250514
Status	completed
Tokens	1,208 (Input: 37, Output: 1,171)
Dauer	44424 ms
Request-Zeit	2025-12-22 10:08:28.690081
Response-Zeit	2025-12-22 10:09:13.113951

Request

{
    "event": "PreToolUse",
    "tool_name": "WebSearch",
    "tool_input": {
        "query": "best vision model for PDF document analysis 2025 ollama phi llava minicpm"
    }
}

Response

{
    "tool_response": {
        "query": "best vision model for PDF document analysis 2025 ollama phi llava minicpm",
        "results": [
            {
                "tool_use_id": "srvtoolu_01E19t8E4C3MNRaRwA88L3VF",
                "content": [
                    {
                        "title": "GitHub - OpenBMB\/MiniCPM-V: MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone",
                        "url": "https:\/\/github.com\/OpenBMB\/MiniCPM-V"
                    },
                    {
                        "title": "GitHub - imanoop7\/Ollama-OCR",
                        "url": "https:\/\/github.com\/imanoop7\/Ollama-OCR"
                    },
                    {
                        "title": "Best Local Vision-Language Models for Offline AI",
                        "url": "https:\/\/blog.roboflow.com\/local-vision-language-models\/"
                    },
                    {
                        "title": "openbmb\/MiniCPM-V-4_5 · Hugging Face",
                        "url": "https:\/\/huggingface.co\/openbmb\/MiniCPM-V-4_5"
                    },
                    {
                        "title": "library",
                        "url": "https:\/\/ollama.com\/library"
                    },
                    {
                        "title": "Vision Models | imanoop7\/Ollama-OCR | DeepWiki",
                        "url": "https:\/\/deepwiki.com\/imanoop7\/Ollama-OCR\/7-vision-models"
                    },
                    {
                        "title": "The Latest Ollama Models in 2025 Update",
                        "url": "https:\/\/www.elightwalk.com\/blog\/latest-ollama-models"
                    },
                    {
                        "title": "Vision models · Ollama Blog",
                        "url": "https:\/\/ollama.com\/blog\/vision-models"
                    },
                    {
                        "title": "Best Open-Source Vision Language Models of 2025",
                        "url": "https:\/\/www.labellerr.com\/blog\/top-open-source-vision-language-models\/"
                    },
                    {
                        "title": "llava-phi3",
                        "url": "https:\/\/ollama.com\/library\/llava-phi3"
                    }
                ]
            },
            "Based on the search results, here's a comprehensive overview of the best vision models for PDF document analysis in 2025 that work with Ollama:\n\n## Top Vision Models for PDF\/Document Analysis\n\n### **MiniCPM-V (Recommended for High-Resolution Documents)**\nBased on LLaVA-UHD architecture, MiniCPM-V 4.5 can process high-resolution images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), using 4x fewer visual tokens than most MLLMs.(no content)With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B for vision-language capabilities.(no content)MiniCPM-V 2.6 handles images up to 1.8 million pixels and shows top-tier results on OCR use cases.(no content)MiniCPM-V is designed to handle high-resolution images with flexible aspect ratios. Usage Recommendation: Ideal for processing high-resolution images or scans where maintaining detail is important, such as large format documents or detailed diagrams.\n\n### **Llama 3.2 Vision (Best for Complex Documents)**\nLlama 3.2 Vision: Advanced model with high accuracy for complex documents.(no content)Llama 3.2 Vision is an advanced vision-language model with high accuracy, particularly for complex documents. Usage Recommendation: Ideal for processing complex documents where accuracy is critical, such as legal documents, financial statements, or technical papers.\n\n### **Granite3.2-Vision (Designed for Document Understanding)**\nGranite3.2-vision: A compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more.\n\n### **LLaVA 1.6 (General Purpose)**\nThe LLaVA (Large Language-and-Vision Assistant) model collection has been updated to version 1.6 supporting: Higher image resolution: support for up to 4x more pixels, allowing the model to grasp more details. Improved text recognition and reasoning capabilities: trained on additional document, chart and diagram data sets.(no content)Usage Recommendation: Best suited for general-purpose OCR tasks where processing speed is important, and occasional inaccuracies are acceptable.\n\n### **LLaVA-Phi3 (Lightweight Option)**\nllava-phi3 is a LLaVA model fine-tuned from Phi 3 Mini 4k, with strong performance benchmarks on par with the original LLaVA model.\n\n### **Qwen-VL 2.5**\nQwen-VL 2.5: Capable of document OCR, layout analysis, translation, and visual reasoning.(no content)Qwen supports 29 languages and dynamic resolution for mixed image-text docs.\n\n## Ollama OCR Package\nA powerful OCR (Optical Character Recognition) package that uses state-of-the-art vision language models through Ollama to extract text from images and PDF. Available both as a Python package and a Streamlit web application.\n\n**Quick Install Commands:**\n```bash\nollama pull llama3.2-vision:11b\nollama pull granite3.2-vision\nollama pull minicpm-v\nollama pull moondream\n```\n\n## Summary Recommendation\n- **Best overall for documents**: MiniCPM-V or Llama 3.2 Vision\n- **Best for structured data (tables\/charts)**: Granite3.2-Vision\n- **Best for lightweight\/edge deployment**: LLaVA-Phi3 or Moondream\n- **Best for multilingual documents**: Qwen-VL 2.5"
        ],
        "durationSeconds": 33.001641676999625
    }
}

← Vorheriger Zur Liste Nächster →