199 lines
15 KiB
Markdown
199 lines
15 KiB
Markdown
# CLAUDE.md
|
||
|
||
Operating instructions for Claude Code on this project. Read this fully before planning or editing. These are decisions, not suggestions — do not re-derive or override them without asking.
|
||
|
||
---
|
||
|
||
## Project goal
|
||
|
||
Build the **AI Job Source Agent**: a Python pipeline that, for recently posted LinkedIn jobs, produces records of the form:
|
||
|
||
```
|
||
company_name, career_page_url, open_position_url
|
||
```
|
||
|
||
It runs in configurable batches, on a schedule, and is **incremental** — re-runs process only NEW jobs. The output is a CSV at `output/results.csv` plus rows in a local database.
|
||
|
||
The four logical steps:
|
||
1. From LinkedIn job listings, get **company name** and **company website URL**.
|
||
2. From the company website, find the **careers/jobs page URL**.
|
||
3. From the careers page, get **one open position URL**.
|
||
4. Emit `company_name, career_page_url, open_position_url`.
|
||
|
||
---
|
||
|
||
## Architecture decisions (non-negotiable)
|
||
|
||
**This is a WORKFLOW, not a multi-agent system.** The orchestrator is plain code (Prefect), not an LLM. Most stages are deterministic. Genuine LLM agency appears in exactly one place: the last-resort fallback for steps 2–3.
|
||
|
||
1. **Stage 1 (ingestion) uses a managed data API — NEVER browser automation on LinkedIn.** LinkedIn is a hostile anti-bot target and browser agents get blocked / require login (ToS + ban risk). Default provider is **JobSpy** (free); **Apify** is a drop-in alternative behind the same interface. No hand-built LinkedIn crawler.
|
||
2. **Company website is a separate, deterministic resolution step.** LinkedIn often does not expose the company's own site. Resolve `company name → website` via the provider field if present, else a verified domain guess, else an optional search API. This is plumbing, not an agent.
|
||
3. **Steps 2 and 3 share ONE cascade, cheapest tier first.** Each tier returns early on success. A full browser agent is the LAST tier only.
|
||
4. **When the browser-agent tier fires, it does steps 2 AND 3 in a single session** (find careers page + return one job URL). One agent run, not two.
|
||
5. **Dedup keys:** jobs are keyed on the LinkedIn numeric `jobPostingId` (parsed from the job URL); companies are keyed on normalized domain. Resolved careers pages are cached per company so a company is never re-resolved.
|
||
6. **Everything swappable lives behind an interface** (provider pattern): job sources, the careers cascade tiers, the extractor. Swapping JobSpy↔Apify, or heuristics↔agent, must not require touching neighbors.
|
||
7. **No fine-tuning.** The task is solved with tool use + prompting + the cascade. Use a small/cheap model for link classification and a stronger model for the browser agent; both configurable.
|
||
8. **Graceful degradation is mandatory.** If the LLM key or Browser Use / its browser is unavailable at runtime, the affected tier logs clearly and returns `None`, and the pipeline still completes (those records get status `needs_review`).
|
||
9. **Design for extension:** adding new ingestion sources (Indeed, Wellfound, ATS firehoses) and swapping SQLite→Postgres should drop in behind the existing interfaces without refactors. Cross-source dedup (later) will use a `(company_domain, normalized_title, location)` fingerprint.
|
||
|
||
---
|
||
|
||
## Pipeline stages (the cascade, in order)
|
||
|
||
**Stage 1 — Ingest (deterministic):** call the job source for recent postings (`hours_old` window) → list of `RawJob{job_id, company, website?, linkedin_url, listed_at}`. Dedup by `job_id`.
|
||
|
||
**Stage 1b — Resolve website (deterministic):** if `website` empty, resolve from company name (verified `{slug}.com` guess, optional search API).
|
||
|
||
**Stage 2 — Find careers page (cascade, return on first hit):**
|
||
1. **ATS detection** — detect Greenhouse / Lever / Ashby / Workday from the site and use their **public JSON APIs** (most reliable; also yields a job URL for Stage 3).
|
||
2. **URL patterns** — probe `/careers`, `/career`, `/jobs`, `/join-us`, `/join`, `careers.{domain}`, `jobs.{domain}`.
|
||
3. **Homepage link scan** — fetch homepage, rank anchors by career/job keywords in href/text.
|
||
4. **Sitemap** — parse `sitemap.xml` for career/job URLs.
|
||
5. **Cheap-LLM classification** — pass extracted anchors to a small model; pick the careers link (Pydantic AI, typed output).
|
||
6. **Browser-agent fallback** — Browser Use; fused with Stage 3 (see below).
|
||
|
||
**Stage 3 — Extract one open position (return on first hit):**
|
||
1. **ATS JSON** — if ATS known from Stage 2, return the first posting URL directly.
|
||
2. **JobPosting JSON-LD** — parse `application/ld+json` for a `url`.
|
||
3. **Job-like anchors** — first link matching `/job`, `/position`, `/opening`, `/vacancy`.
|
||
4. **Cheap-LLM classification** — pick the single-job link from anchors.
|
||
5. **Browser-agent fallback** — handled inside the fused Stage-2 agent call.
|
||
|
||
**Stage 4 — Persist & export:** write status to DB, export the 3-field CSV.
|
||
|
||
---
|
||
|
||
## Tech stack
|
||
|
||
- **Python 3.11+**
|
||
- **Orchestration/scheduling:** Prefect (`@flow`, retries, interval schedule). Cron documented as a no-daemon fallback.
|
||
- **HTTP:** httpx (shared client; timeouts + bounded retries).
|
||
- **HTML parsing:** BeautifulSoup + lxml.
|
||
- **Ingestion:** JobSpy (`python-jobspy`) default; Apify (`apify-client`) alternative.
|
||
- **Structured LLM extraction:** Pydantic AI (model-agnostic, typed).
|
||
- **Browser agent (fallback only):** Browser Use (`browser-use`) + Playwright/Chromium.
|
||
- **Config:** pydantic-settings (env-driven).
|
||
- **Data models:** Pydantic v2.
|
||
- **Storage:** SQLite via stdlib `sqlite3` (Postgres-ready behind the DB module).
|
||
- **Tests:** pytest.
|
||
|
||
Do not add other heavy dependencies without asking. (`uv` may be used instead of pip/venv if preferred.)
|
||
|
||
---
|
||
|
||
## Project structure
|
||
|
||
```
|
||
jobsource/
|
||
__init__.py
|
||
config.py # pydantic-settings; env-driven; model IDs/keys read from env with placeholder defaults (never hardcode or look up model IDs)
|
||
models.py # Pydantic: RawJob, JobResult; JobStatus enum
|
||
http.py # shared httpx client factory: timeout, headers, retry
|
||
db.py # SQLite: companies, jobs; dedup, company cache, CSV export
|
||
resolve.py # company name -> website (deterministic)
|
||
sources/
|
||
__init__.py
|
||
base.py # JobSource interface: fetch_recent_jobs() -> list[RawJob]
|
||
jobspy_source.py # default provider
|
||
apify_source.py # alternative provider (same interface)
|
||
careers/
|
||
__init__.py
|
||
cascade.py # find_careers_page() orchestrates the tiers
|
||
ats.py # ATS detect + public JSON (Greenhouse/Lever/Ashby/Workday)
|
||
heuristics.py # URL patterns, homepage scan, sitemap
|
||
classify_llm.py # Pydantic AI link classifier (careers link / job link)
|
||
extract.py # extract_open_position(): ATS -> JSON-LD -> anchors -> LLM
|
||
agent_fallback.py # Browser Use: fused find-careers + extract-job (last resort)
|
||
pipeline.py # run_batch(): dedup, per-record isolation, persistence, summary
|
||
flow.py # Prefect flow + schedule
|
||
main.py # CLI entry
|
||
tests/ # pytest
|
||
output/ # results.csv (gitignored)
|
||
.env.example
|
||
requirements.txt
|
||
README.md
|
||
```
|
||
|
||
---
|
||
|
||
## Data model
|
||
|
||
`JobStatus` enum: `new | website_resolved | careers_found | position_found | failed | needs_review`.
|
||
A record is **complete** when status is `position_found`.
|
||
|
||
`jobs` table: `job_id` (PK, LinkedIn numeric id), `company_key`, `linkedin_url`, `position_url`, `status`, `listed_at`, `first_seen`.
|
||
`companies` table: `company_key` (PK, normalized domain else lowercased name), `name`, `website`, `career_url` (cached), `first_seen`.
|
||
|
||
CSV columns, exactly: `company_name,career_page_url,open_position_url`. Empty cells allowed for incomplete rows; complete rows sorted first.
|
||
|
||
---
|
||
|
||
## Commands
|
||
|
||
```bash
|
||
# setup
|
||
python -m venv .venv && source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
playwright install chromium # for the browser-agent tier
|
||
cp .env.example .env # fill keys as available
|
||
|
||
# run a batch
|
||
python -m jobsource.main --batch-size 20 --search "software engineer" --location "United States"
|
||
|
||
# scheduled run (Prefect)
|
||
python -m jobsource.flow # serves the flow on an interval schedule
|
||
# cron fallback: */0 6 * * * cd <repo> && ./.venv/bin/python -m jobsource.main --batch-size 50
|
||
|
||
# tests
|
||
pytest -q
|
||
```
|
||
|
||
`--search` is repeatable. Provide `--help` from `main.py`.
|
||
|
||
---
|
||
|
||
## Coding conventions
|
||
|
||
- Full type hints; Pydantic models for all records crossing module boundaries.
|
||
- Every external call (job provider, HTTP fetch, ATS API, LLM, agent) wrapped with a timeout, bounded retry, and try/except. **One failing company must never abort the batch** — catch, record `failed`/`needs_review`, continue.
|
||
- Secrets only from env (pydantic-settings). Never hardcode keys; never commit `.env`.
|
||
- Each cascade/extract function returns a typed result including which tier/method resolved it (for observability and metrics).
|
||
- Keep functions small and independently testable. Pure functions where possible; side effects (DB, network) isolated.
|
||
- Log at INFO per stage with the method used; log failures with context.
|
||
- Prefer standard library and the listed stack; ask before introducing alternatives.
|
||
- Model identifiers are configurable env values with placeholder defaults; never hardcode specific model IDs or fetch model references, the operator sets real values in `.env`.
|
||
|
||
---
|
||
|
||
## Output contract & success criteria
|
||
|
||
- `python -m jobsource.main --batch-size 20` completes without an unhandled exception and writes `output/results.csv`.
|
||
- Every row has exactly the three contract columns.
|
||
- Re-running immediately processes **0 new jobs** and adds **0 rows** (dedup proven).
|
||
- A run summary prints per-stage counts and end-to-end coverage (% of new jobs reaching `position_found`).
|
||
- Spot-checked `career_page_url` and `open_position_url` resolve (HTTP 200, not a 404/login wall).
|
||
|
||
---
|
||
|
||
## Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)
|
||
|
||
- **ATS JSON shapes confirmed live (2026-06-17)** — use these field names in code:
|
||
- **Greenhouse**: `GET https://boards-api.greenhouse.io/v1/boards/{slug}/jobs?per_page=1` → `{"jobs":[{"absolute_url":"…"}],"meta":{"total":N}}`. Detect from `boards.greenhouse.io/{slug}`, `job-boards.greenhouse.io/{slug}`, or `embed/job_board?for={slug}` in page HTML. Individual job URLs may appear on either `boards.greenhouse.io` or `job-boards.greenhouse.io` — both are valid; always use the `absolute_url` field verbatim.
|
||
- **Lever**: `GET https://api.lever.co/v0/postings/{slug}?mode=json&limit=1` → **JSON array** `[{"hostedUrl":"…"}]` (empty array `[]` for unknown slug, not 404). Detect from `jobs.lever.co/{slug}`.
|
||
- **Ashby**: `GET https://api.ashbyhq.com/posting-api/job-board/{slug}` → `{"jobs":[{"jobUrl":"…"}],"apiVersion":"…"}`. **GET only — not POST** (earlier entry was wrong). Slug case is preserved in the returned `jobUrl` but the API is case-insensitive — both `ramp` and `Ramp` return 200 with the same jobs. Detect from `jobs.ashbyhq.com/{slug}`.
|
||
- **Workday**: `POST https://{host}/wday/cxs/{tenant}/{site}/jobs` body `{"appliedFacets":{},"limit":1,"offset":0,"searchText":""}` → `{"total":N,"jobPostings":[{"externalPath":"/job/…"}]}`. Build job URL as `https://{host}/en-US/{site}{externalPath}`. Detect from `{tenant}.wd{N}.myworkdayjobs.com/…/{site}` in HTML.
|
||
- **ATS embeds are typically on the /careers page, not the homepage** (verified 2026-06-17 against Vercel/Figma/Ramp/Anthropic). `detect_ats_in_html` on the homepage will miss most companies. The `_finalize` upgrade in `cascade.py` handles this by fetching the heuristic-found careers URL and re-running ATS detection on it — do not remove this step.
|
||
- **SPA-rendered careers pages (e.g. Anthropic/Next.js) cannot be detected by static HTML parsing** — the ATS embed is injected by JavaScript after the initial page load. Static-tier resolution falls through to `url_pattern` only; the browser-agent tier is needed for full ATS detection on these sites. Anthropic confirmed uses Greenhouse slug `anthropic` (370+ jobs as of 2026-06-17).
|
||
- **Soft-404 and off-brand redirect filtering in `probe_url_patterns`** (added 2026-06-17): Netflix `/careers` redirects to `/NotFound?prev=…` with HTTP 200 (SPA catch-all); Microsoft `/careers` redirects to `bing.com` via aka.ms. Both are rejected by `_is_plausible_careers_url` in heuristics.py — the next probe candidates are tried instead.
|
||
- **Live smoke-test results (2026-06-17, 10 companies)**: 10/10 careers-URL hit rate; 4/10 ATS-resolved (Vercel→greenhouse, Linear→ashby, Figma→greenhouse, Ramp→ashby); 4/10 position_url populated. Google/Microsoft/Apple/Netflix/Amazon/Anthropic resolve via `url_pattern` only (custom or SPA-rendered ATS).
|
||
- **JobSpy `company_url_direct` fill rate: 0% observed** (5/5 jobs had `website=None` in a live fetch on 2026-06-17, search: "software engineer", United States, `linkedin_fetch_description=False`). `resolve.py` is essential for **every** job, not just a gap-filler. Do not assume any job arrives with a website pre-populated.
|
||
- **JobSpy `date_posted` / `listed_at` fill rate: ~40% observed** (2/5 jobs had a date; 3/5 were `None`). This is because `linkedin_fetch_description=False` (our default for speed) means LinkedIn's posted date is often absent. `listed_at` is best-effort metadata only; do not gate pipeline logic on it.
|
||
- **JobSpy confirmed column names** (verified 2026-06-17): `job_url` (full LinkedIn URL incl. tracking params), `company` (display name), `company_url_direct` (company own site — always `None` in practice so far), `date_posted` (sparse when `linkedin_fetch_description=False`), `title`, `location`, `id` (may be `None`; always parse job_id from `job_url` instead). `company_url` is the LinkedIn *company page* URL — never use it as the company website.
|
||
- **Tier 1b slug-guess ATS recovery (added 2026-06-17)**: when HTML detection misses (JS-embedded / SPA boards like Anthropic's Next.js Greenhouse embed), `recover_via_slug_guess()` in `ats.py` probes Greenhouse/Lever/Ashby public JSON APIs with guessed slugs. Slug candidates: domain stem first (e.g. `anthropic.com` → `anthropic`), then `_slug(company_name)` (strips legal suffixes/punctuation). False-positive guards: require `job_count > 0`; Greenhouse responses include `company_name` (used for a loose substring cross-check against input; clear mismatches rejected). Workday skipped (needs tenant+site). Confidence 0.90 / method `ats:{name}:slug_guess`. Wired in `cascade.py` as Tier 1b: fires after HTML-ATS misses, before URL-pattern probing.
|
||
- **Greenhouse `company_name` field**: live Greenhouse API responses include `jobs[0]["company_name"]` — used by slug-guess cross-check. Do not remove this field from the `ATSFetch` parsing in `_fetch_greenhouse`.
|
||
- LinkedIn parses the numeric job id from `/jobs/view/{id}`; strip tracking query params.
|
||
- Browser Use needs Chromium installed (`playwright install chromium`) and an LLM key; without them the tier must degrade gracefully.
|
||
- LinkedIn rate-limits aggressively; keep batches small while testing.
|
||
- Standard pip struggles with pydantic dependency resolution in this stack; always use uv pip install instead.
|
||
- The system Python is protected by PEP 668 (externally-managed-environment). Always use explicit virtual environment paths (e.g., .venv/bin/python, .venv/bin/pytest) for all terminal commands instead of relying on global commands.
|