12 lines
768 B
Python
12 lines
768 B
Python
"""Deterministic careers-page heuristics: URL probing, homepage scan, sitemap (Stage 2, tiers 2–4).
|
||
|
||
Scaffold stub -- not implemented yet.
|
||
"""
|
||
# TODO (Stage 2, tiers 2–4): implement per CLAUDE.md "Stage 2 — URL patterns / homepage / sitemap".
|
||
# Tier 2 — URL patterns: probe /careers, /career, /jobs, /join-us, /join,
|
||
# careers.{domain}, jobs.{domain} via HTTP HEAD (or GET if HEAD fails).
|
||
# Tier 3 — Homepage link scan: fetch homepage HTML, parse with BeautifulSoup + lxml,
|
||
# rank <a> anchors by career/job keywords in href/text, return highest-ranked.
|
||
# Tier 4 — Sitemap: fetch sitemap.xml (and sitemap index if present), scan for career/job URLs.
|
||
# Each function returns (url: str | None) so cascade.py can return early on first hit.
|