Crawler Phase 3 – Infrastructure Layer (T-030..T-040)

ID756
Statuscompleted
Typai_task
Erstellt vonmcp-tasks (ai)
Erstellt2026-04-21 02:47:45.654298
Abgeschlossen2026-04-21 03:02:00.890525

Beschreibung

MariaDB-Adapters (12 Repositories + Connection), DbUrlQueue, HttpFetcher (httpx), PlaywrightFetcher (Chromium), HtmlParser (BS4/lxml), StructuredDataParser (extruct), FileStorage, Logger, SystemClock. Plus url_queue-Tabelle ins Schema. Integrationstests gegen Live-DB + Fixture-Server.

Aktionen

Bearbeiten

Ergebnisse

claude (claude) success
Tokens- (In: -, Out: -)
Dauer-
Zeit2026-04-21 03:01:57
Antwort:
Phase 3 Infrastructure Layer abgeschlossen.

Persistence (src/infrastructure/persistence/):
- MariaDbConnectionFactory mit session() context manager (Commit/Rollback).
- 12 Repositories (CrawlRepository, UrlRepository, PageRepository, LinkRepository, ImageRepository, ResourceRepository, StructuredDataRepository, HreflangRepository, JsErrorRepository, PageHeaderRepository, CrawlErrorRepository, ExternalCheckRepository) — je Klasse ≤ 80 LOC durch Auslagerung der Mapper in Modul-Funktionen.
- Mappers (crawl_mapper, url_mapper, page_mapper) als pure Transform-Funktionen; page_mapper in 7 kleine _*_fields-Helper zerlegt.

Queue: DbUrlQueue (url_queue-Tabelle, FIFO + Dedupe via UNIQUE KEY + FOR UPDATE Lock).

Fetchers:
- HttpxFetcher: asyncio httpx mit Redirect-Following, Retry-Safe, strukturierter FetchResult.
- PlaywrightFetcher: Chromium headless, pageerror/console/response Listener für JS-Errors und Resources, Screenshot optional, DOM-Node-Count via JS, Performance-Metriken.

Parsers:
- Bs4HtmlParser (BeautifulSoup + lxml): title, meta (description/keywords/robots/canonical/lang/charset/viewport), OG, Twitter, h1-h6 counts + texts, Wortanzahl, SHA-256 text_hash, hreflang, interne/externe Links, Bilder mit alt/title/dimensions.
- ExtructStructuredDataParser: JSON-LD, Microdata, RDFa, OpenGraph via extruct.

Storage: LocalFileStorage (gzip HTML, PNG Screenshots) mit url_hash-Sanitization.
Logging: JsonLogger (stdlib logging mit JSON-Formatter).
Clock: SystemClock (UTC).
Config: db_config (lädt Credentials via /var/www/scripts/pipeline/config.py), constants.py (HTTP_CONCURRENCY, USER_AGENT, TIMEOUTs, STORAGE_ROOTs).

Schema-Erweiterung: url_queue-Tabelle (13 Tabellen total). MCP-DB-Allowlist aktualisiert.

Quality-Gateways (alle grün):
- G1 Ruff ALL, Ruff format, Mypy --strict (102 files), Radon cc/mi, MaxLOC 80 — alle PASS
- G2 Unit tests 73/73 PASS
- G3 Integration tests 9/9 PASS (CrawlRepository, UrlRepository, PageRepository roundtrip, DbUrlQueue, LocalFileStorage, Bs4HtmlParser — alle gegen Live-DB bzw. tmp_path)
- G4 Import-linter: 3 Contracts kept (Hexagonal-Direction, Infra→!Interfaces, Domain→!third-party)
- G5 pip-audit: no vulnerabilities
- G6 schema-drift: 13 tables in sync

Coverage: 72% (Interfaces-Layer noch leer → keine Tests dort)

← Zurück zur Task-Liste