scaffold

2026-06-17 08:38:15 -04:00
commit f13b8fc1ca
28 changed files with 894 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,186 @@
+# CLAUDE.md
+
+Operating instructions for Claude Code on this project. Read this fully before planning or editing. These are decisions, not suggestions — do not re-derive or override them without asking.
+
+---
+
+## Project goal
+
+Build the **AI Job Source Agent**: a Python pipeline that, for recently posted LinkedIn jobs, produces records of the form:
+
+```
+company_name, career_page_url, open_position_url
+```
+
+It runs in configurable batches, on a schedule, and is **incremental** — re-runs process only NEW jobs. The output is a CSV at `output/results.csv` plus rows in a local database.
+
+The four logical steps:
+1. From LinkedIn job listings, get **company name** and **company website URL**.
+2. From the company website, find the **careers/jobs page URL**.
+3. From the careers page, get **one open position URL**.
+4. Emit `company_name, career_page_url, open_position_url`.
+
+---
+
+## Architecture decisions (non-negotiable)
+
+**This is a WORKFLOW, not a multi-agent system.** The orchestrator is plain code (Prefect), not an LLM. Most stages are deterministic. Genuine LLM agency appears in exactly one place: the last-resort fallback for steps 2–3.
+
+1. **Stage 1 (ingestion) uses a managed data API — NEVER browser automation on LinkedIn.** LinkedIn is a hostile anti-bot target and browser agents get blocked / require login (ToS + ban risk). Default provider is **JobSpy** (free); **Apify** is a drop-in alternative behind the same interface. No hand-built LinkedIn crawler.
+2. **Company website is a separate, deterministic resolution step.** LinkedIn often does not expose the company's own site. Resolve `company name → website` via the provider field if present, else a verified domain guess, else an optional search API. This is plumbing, not an agent.
+3. **Steps 2 and 3 share ONE cascade, cheapest tier first.** Each tier returns early on success. A full browser agent is the LAST tier only.
+4. **When the browser-agent tier fires, it does steps 2 AND 3 in a single session** (find careers page + return one job URL). One agent run, not two.
+5. **Dedup keys:** jobs are keyed on the LinkedIn numeric `jobPostingId` (parsed from the job URL); companies are keyed on normalized domain. Resolved careers pages are cached per company so a company is never re-resolved.
+6. **Everything swappable lives behind an interface** (provider pattern): job sources, the careers cascade tiers, the extractor. Swapping JobSpy↔Apify, or heuristics↔agent, must not require touching neighbors.
+7. **No fine-tuning.** The task is solved with tool use + prompting + the cascade. Use a small/cheap model for link classification and a stronger model for the browser agent; both configurable.
+8. **Graceful degradation is mandatory.** If the LLM key or Browser Use / its browser is unavailable at runtime, the affected tier logs clearly and returns `None`, and the pipeline still completes (those records get status `needs_review`).
+9. **Design for extension:** adding new ingestion sources (Indeed, Wellfound, ATS firehoses) and swapping SQLite→Postgres should drop in behind the existing interfaces without refactors. Cross-source dedup (later) will use a `(company_domain, normalized_title, location)` fingerprint.
+
+---
+
+## Pipeline stages (the cascade, in order)
+
+**Stage 1 — Ingest (deterministic):** call the job source for recent postings (`hours_old` window) → list of `RawJob{job_id, company, website?, linkedin_url, listed_at}`. Dedup by `job_id`.
+
+**Stage 1b — Resolve website (deterministic):** if `website` empty, resolve from company name (verified `{slug}.com` guess, optional search API).
+
+**Stage 2 — Find careers page (cascade, return on first hit):**
+1. **ATS detection** — detect Greenhouse / Lever / Ashby / Workday from the site and use their **public JSON APIs** (most reliable; also yields a job URL for Stage 3).
+2. **URL patterns** — probe `/careers`, `/career`, `/jobs`, `/join-us`, `/join`, `careers.{domain}`, `jobs.{domain}`.
+3. **Homepage link scan** — fetch homepage, rank anchors by career/job keywords in href/text.
+4. **Sitemap** — parse `sitemap.xml` for career/job URLs.
+5. **Cheap-LLM classification** — pass extracted anchors to a small model; pick the careers link (Pydantic AI, typed output).
+6. **Browser-agent fallback** — Browser Use; fused with Stage 3 (see below).
+
+**Stage 3 — Extract one open position (return on first hit):**
+1. **ATS JSON** — if ATS known from Stage 2, return the first posting URL directly.
+2. **JobPosting JSON-LD** — parse `application/ld+json` for a `url`.
+3. **Job-like anchors** — first link matching `/job`, `/position`, `/opening`, `/vacancy`.
+4. **Cheap-LLM classification** — pick the single-job link from anchors.
+5. **Browser-agent fallback** — handled inside the fused Stage-2 agent call.
+
+**Stage 4 — Persist & export:** write status to DB, export the 3-field CSV.
+
+---
+
+## Tech stack
+
+- **Python 3.11+**
+- **Orchestration/scheduling:** Prefect (`@flow`, retries, interval schedule). Cron documented as a no-daemon fallback.
+- **HTTP:** httpx (shared client; timeouts + bounded retries).
+- **HTML parsing:** BeautifulSoup + lxml.
+- **Ingestion:** JobSpy (`python-jobspy`) default; Apify (`apify-client`) alternative.
+- **Structured LLM extraction:** Pydantic AI (model-agnostic, typed).
+- **Browser agent (fallback only):** Browser Use (`browser-use`) + Playwright/Chromium.
+- **Config:** pydantic-settings (env-driven).
+- **Data models:** Pydantic v2.
+- **Storage:** SQLite via stdlib `sqlite3` (Postgres-ready behind the DB module).
+- **Tests:** pytest.
+
+Do not add other heavy dependencies without asking. (`uv` may be used instead of pip/venv if preferred.)
+
+---
+
+## Project structure
+
+```
+jobsource/
+  __init__.py
+  config.py            # pydantic-settings; env-driven; model IDs/keys read from env with placeholder defaults (never hardcode or look up model IDs)
+  models.py            # Pydantic: RawJob, JobResult; JobStatus enum
+  http.py              # shared httpx client factory: timeout, headers, retry
+  db.py                # SQLite: companies, jobs; dedup, company cache, CSV export
+  resolve.py           # company name -> website (deterministic)
+  sources/
+    __init__.py
+    base.py            # JobSource interface: fetch_recent_jobs() -> list[RawJob]
+    jobspy_source.py   # default provider
+    apify_source.py    # alternative provider (same interface)
+  careers/
+    __init__.py
+    cascade.py         # find_careers_page() orchestrates the tiers
+    ats.py             # ATS detect + public JSON (Greenhouse/Lever/Ashby/Workday)
+    heuristics.py      # URL patterns, homepage scan, sitemap
+    classify_llm.py    # Pydantic AI link classifier (careers link / job link)
+  extract.py           # extract_open_position(): ATS -> JSON-LD -> anchors -> LLM
+  agent_fallback.py    # Browser Use: fused find-careers + extract-job (last resort)
+  pipeline.py          # run_batch(): dedup, per-record isolation, persistence, summary
+  flow.py              # Prefect flow + schedule
+  main.py              # CLI entry
+tests/                 # pytest
+output/                # results.csv (gitignored)
+.env.example
+requirements.txt
+README.md
+```
+
+---
+
+## Data model
+
+`JobStatus` enum: `new | website_resolved | careers_found | position_found | failed | needs_review`.
+A record is **complete** when status is `position_found`.
+
+`jobs` table: `job_id` (PK, LinkedIn numeric id), `company_key`, `linkedin_url`, `position_url`, `status`, `listed_at`, `first_seen`.
+`companies` table: `company_key` (PK, normalized domain else lowercased name), `name`, `website`, `career_url` (cached), `first_seen`.
+
+CSV columns, exactly: `company_name,career_page_url,open_position_url`. Empty cells allowed for incomplete rows; complete rows sorted first.
+
+---
+
+## Commands
+
+```bash
+# setup
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+playwright install chromium        # for the browser-agent tier
+cp .env.example .env               # fill keys as available
+
+# run a batch
+python -m jobsource.main --batch-size 20 --search "software engineer" --location "United States"
+
+# scheduled run (Prefect)
+python -m jobsource.flow            # serves the flow on an interval schedule
+# cron fallback: */0 6 * * * cd <repo> && ./.venv/bin/python -m jobsource.main --batch-size 50
+
+# tests
+pytest -q
+```
+
+`--search` is repeatable. Provide `--help` from `main.py`.
+
+---
+
+## Coding conventions
+
+- Full type hints; Pydantic models for all records crossing module boundaries.
+- Every external call (job provider, HTTP fetch, ATS API, LLM, agent) wrapped with a timeout, bounded retry, and try/except. **One failing company must never abort the batch** — catch, record `failed`/`needs_review`, continue.
+- Secrets only from env (pydantic-settings). Never hardcode keys; never commit `.env`.
+- Each cascade/extract function returns a typed result including which tier/method resolved it (for observability and metrics).
+- Keep functions small and independently testable. Pure functions where possible; side effects (DB, network) isolated.
+- Log at INFO per stage with the method used; log failures with context.
+- Prefer standard library and the listed stack; ask before introducing alternatives.
+- Model identifiers are configurable env values with placeholder defaults; never hardcode specific model IDs or fetch model references, the operator sets real values in `.env`.
+
+---
+
+## Output contract & success criteria
+
+- `python -m jobsource.main --batch-size 20` completes without an unhandled exception and writes `output/results.csv`.
+- Every row has exactly the three contract columns.
+- Re-running immediately processes **0 new jobs** and adds **0 rows** (dedup proven).
+- A run summary prints per-stage counts and end-to-end coverage (% of new jobs reaching `position_found`).
+- Spot-checked `career_page_url` and `open_position_url` resolve (HTTP 200, not a 404/login wall).
+
+---
+
+## Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)
+
+- Verify ATS JSON field names against live responses before trusting them: Greenhouse `jobs[].absolute_url`; Lever `[].hostedUrl`; Ashby `jobs[].jobUrl`; Workday varies by tenant. Fix in code AND note the confirmed shape here.
+- JobSpy populates the company's own site (`company_url_direct`) only sometimes; `resolve.py` must cover the gap. Record the observed fill rate here after the first live fetch.
+- LinkedIn parses the numeric job id from `/jobs/view/{id}`; strip tracking query params.
+- Browser Use needs Chromium installed (`playwright install chromium`) and an LLM key; without them the tier must degrade gracefully.
+- LinkedIn rate-limits aggressively; keep batches small while testing.
+- Standard pip struggles with pydantic dependency resolution in this stack; always use uv pip install instead.
+- The system Python is protected by PEP 668 (externally-managed-environment). Always use explicit virtual environment paths (e.g., .venv/bin/python, .venv/bin/pytest) for all terminal commands instead of relying on global commands.