phase2-ATS + heuristic careers finding

2026-06-17 17:33:11 -04:00
parent cd9ab9b95e
commit 113a4ced36
11 changed files with 2836 additions and 39 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -177,10 +177,20 @@ pytest -q

 ## Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)

- Verify ATS JSON field names against live responses before trusting them: Greenhouse `jobs[].absolute_url`; Lever `[].hostedUrl`; Ashby `jobs[].jobUrl`; Workday varies by tenant. Fix in code AND note the confirmed shape here.
+- **ATS JSON shapes confirmed live (2026-06-17)** — use these field names in code:
+  - **Greenhouse**: `GET https://boards-api.greenhouse.io/v1/boards/{slug}/jobs?per_page=1` → `{"jobs":[{"absolute_url":"…"}],"meta":{"total":N}}`. Detect from `boards.greenhouse.io/{slug}`, `job-boards.greenhouse.io/{slug}`, or `embed/job_board?for={slug}` in page HTML. Individual job URLs may appear on either `boards.greenhouse.io` or `job-boards.greenhouse.io` — both are valid; always use the `absolute_url` field verbatim.
+  - **Lever**: `GET https://api.lever.co/v0/postings/{slug}?mode=json&limit=1` → **JSON array** `[{"hostedUrl":"…"}]` (empty array `[]` for unknown slug, not 404). Detect from `jobs.lever.co/{slug}`.
+  - **Ashby**: `GET https://api.ashbyhq.com/posting-api/job-board/{slug}` → `{"jobs":[{"jobUrl":"…"}],"apiVersion":"…"}`. **GET only — not POST** (earlier entry was wrong). Slug case is preserved in the returned `jobUrl` but the API is case-insensitive — both `ramp` and `Ramp` return 200 with the same jobs. Detect from `jobs.ashbyhq.com/{slug}`.
+  - **Workday**: `POST https://{host}/wday/cxs/{tenant}/{site}/jobs` body `{"appliedFacets":{},"limit":1,"offset":0,"searchText":""}` → `{"total":N,"jobPostings":[{"externalPath":"/job/…"}]}`. Build job URL as `https://{host}/en-US/{site}{externalPath}`. Detect from `{tenant}.wd{N}.myworkdayjobs.com/…/{site}` in HTML.
+- **ATS embeds are typically on the /careers page, not the homepage** (verified 2026-06-17 against Vercel/Figma/Ramp/Anthropic). `detect_ats_in_html` on the homepage will miss most companies. The `_finalize` upgrade in `cascade.py` handles this by fetching the heuristic-found careers URL and re-running ATS detection on it — do not remove this step.
+- **SPA-rendered careers pages (e.g. Anthropic/Next.js) cannot be detected by static HTML parsing** — the ATS embed is injected by JavaScript after the initial page load. Static-tier resolution falls through to `url_pattern` only; the browser-agent tier is needed for full ATS detection on these sites. Anthropic confirmed uses Greenhouse slug `anthropic` (370+ jobs as of 2026-06-17).
+- **Soft-404 and off-brand redirect filtering in `probe_url_patterns`** (added 2026-06-17): Netflix `/careers` redirects to `/NotFound?prev=…` with HTTP 200 (SPA catch-all); Microsoft `/careers` redirects to `bing.com` via aka.ms. Both are rejected by `_is_plausible_careers_url` in heuristics.py — the next probe candidates are tried instead.
+- **Live smoke-test results (2026-06-17, 10 companies)**: 10/10 careers-URL hit rate; 4/10 ATS-resolved (Vercel→greenhouse, Linear→ashby, Figma→greenhouse, Ramp→ashby); 4/10 position_url populated. Google/Microsoft/Apple/Netflix/Amazon/Anthropic resolve via `url_pattern` only (custom or SPA-rendered ATS).
 - **JobSpy `company_url_direct` fill rate: 0% observed** (5/5 jobs had `website=None` in a live fetch on 2026-06-17, search: "software engineer", United States, `linkedin_fetch_description=False`). `resolve.py` is essential for **every** job, not just a gap-filler. Do not assume any job arrives with a website pre-populated.
 - **JobSpy `date_posted` / `listed_at` fill rate: ~40% observed** (2/5 jobs had a date; 3/5 were `None`). This is because `linkedin_fetch_description=False` (our default for speed) means LinkedIn's posted date is often absent. `listed_at` is best-effort metadata only; do not gate pipeline logic on it.
 - **JobSpy confirmed column names** (verified 2026-06-17): `job_url` (full LinkedIn URL incl. tracking params), `company` (display name), `company_url_direct` (company own site — always `None` in practice so far), `date_posted` (sparse when `linkedin_fetch_description=False`), `title`, `location`, `id` (may be `None`; always parse job_id from `job_url` instead). `company_url` is the LinkedIn *company page* URL — never use it as the company website.
+- **Tier 1b slug-guess ATS recovery (added 2026-06-17)**: when HTML detection misses (JS-embedded / SPA boards like Anthropic's Next.js Greenhouse embed), `recover_via_slug_guess()` in `ats.py` probes Greenhouse/Lever/Ashby public JSON APIs with guessed slugs. Slug candidates: domain stem first (e.g. `anthropic.com` → `anthropic`), then `_slug(company_name)` (strips legal suffixes/punctuation). False-positive guards: require `job_count > 0`; Greenhouse responses include `company_name` (used for a loose substring cross-check against input; clear mismatches rejected). Workday skipped (needs tenant+site). Confidence 0.90 / method `ats:{name}:slug_guess`. Wired in `cascade.py` as Tier 1b: fires after HTML-ATS misses, before URL-pattern probing.
+- **Greenhouse `company_name` field**: live Greenhouse API responses include `jobs[0]["company_name"]` — used by slug-guess cross-check. Do not remove this field from the `ATSFetch` parsing in `_fetch_greenhouse`.
 - LinkedIn parses the numeric job id from `/jobs/view/{id}`; strip tracking query params.
 - Browser Use needs Chromium installed (`playwright install chromium`) and an LLM key; without them the tier must degrade gracefully.
 - LinkedIn rate-limits aggressively; keep batches small while testing.