This commit is contained in:
ldy
2026-06-17 08:38:15 -04:00
commit f13b8fc1ca
28 changed files with 894 additions and 0 deletions

View File

@@ -0,0 +1,11 @@
"""Deterministic careers-page heuristics: URL probing, homepage scan, sitemap (Stage 2, tiers 24).
Scaffold stub -- not implemented yet.
"""
# TODO (Stage 2, tiers 24): implement per CLAUDE.md "Stage 2 — URL patterns / homepage / sitemap".
# Tier 2 — URL patterns: probe /careers, /career, /jobs, /join-us, /join,
# careers.{domain}, jobs.{domain} via HTTP HEAD (or GET if HEAD fails).
# Tier 3 — Homepage link scan: fetch homepage HTML, parse with BeautifulSoup + lxml,
# rank <a> anchors by career/job keywords in href/text, return highest-ranked.
# Tier 4 — Sitemap: fetch sitemap.xml (and sitemap index if present), scan for career/job URLs.
# Each function returns (url: str | None) so cascade.py can return early on first hit.