Crawler Phase 3 – Infrastructure Layer (T-030..T-040)
| ID | 756 |
|---|---|
| Status | completed |
| Typ | ai_task |
| Erstellt von | mcp-tasks (ai) |
| Erstellt | 2026-04-21 02:47:45.654298 |
| Abgeschlossen | 2026-04-21 03:02:00.890525 |
Beschreibung
MariaDB-Adapters (12 Repositories + Connection), DbUrlQueue, HttpFetcher (httpx), PlaywrightFetcher (Chromium), HtmlParser (BS4/lxml), StructuredDataParser (extruct), FileStorage, Logger, SystemClock. Plus url_queue-Tabelle ins Schema. Integrationstests gegen Live-DB + Fixture-Server.
Aktionen
Ergebnisse
claude (claude)
success
| Tokens | - (In: -, Out: -) |
|---|---|
| Dauer | - |
| Zeit | 2026-04-21 03:01:57 |
Antwort:
Phase 3 Infrastructure Layer abgeschlossen. Persistence (src/infrastructure/persistence/): - MariaDbConnectionFactory mit session() context manager (Commit/Rollback). - 12 Repositories (CrawlRepository, UrlRepository, PageRepository, LinkRepository, ImageRepository, ResourceRepository, StructuredDataRepository, HreflangRepository, JsErrorRepository, PageHeaderRepository, CrawlErrorRepository, ExternalCheckRepository) — je Klasse ≤ 80 LOC durch Auslagerung der Mapper in Modul-Funktionen. - Mappers (crawl_mapper, url_mapper, page_mapper) als pure Transform-Funktionen; page_mapper in 7 kleine _*_fields-Helper zerlegt. Queue: DbUrlQueue (url_queue-Tabelle, FIFO + Dedupe via UNIQUE KEY + FOR UPDATE Lock). Fetchers: - HttpxFetcher: asyncio httpx mit Redirect-Following, Retry-Safe, strukturierter FetchResult. - PlaywrightFetcher: Chromium headless, pageerror/console/response Listener für JS-Errors und Resources, Screenshot optional, DOM-Node-Count via JS, Performance-Metriken. Parsers: - Bs4HtmlParser (BeautifulSoup + lxml): title, meta (description/keywords/robots/canonical/lang/charset/viewport), OG, Twitter, h1-h6 counts + texts, Wortanzahl, SHA-256 text_hash, hreflang, interne/externe Links, Bilder mit alt/title/dimensions. - ExtructStructuredDataParser: JSON-LD, Microdata, RDFa, OpenGraph via extruct. Storage: LocalFileStorage (gzip HTML, PNG Screenshots) mit url_hash-Sanitization. Logging: JsonLogger (stdlib logging mit JSON-Formatter). Clock: SystemClock (UTC). Config: db_config (lädt Credentials via /var/www/scripts/pipeline/config.py), constants.py (HTTP_CONCURRENCY, USER_AGENT, TIMEOUTs, STORAGE_ROOTs). Schema-Erweiterung: url_queue-Tabelle (13 Tabellen total). MCP-DB-Allowlist aktualisiert. Quality-Gateways (alle grün): - G1 Ruff ALL, Ruff format, Mypy --strict (102 files), Radon cc/mi, MaxLOC 80 — alle PASS - G2 Unit tests 73/73 PASS - G3 Integration tests 9/9 PASS (CrawlRepository, UrlRepository, PageRepository roundtrip, DbUrlQueue, LocalFileStorage, Bs4HtmlParser — alle gegen Live-DB bzw. tmp_path) - G4 Import-linter: 3 Contracts kept (Hexagonal-Direction, Infra→!Interfaces, Domain→!third-party) - G5 pip-audit: no vulnerabilities - G6 schema-drift: 13 tables in sync Coverage: 72% (Interfaces-Layer noch leer → keine Tests dort)