Files
JobSourceAgent/CLAUDE.md
2026-06-17 13:59:00 -04:00

189 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
Operating instructions for Claude Code on this project. Read this fully before planning or editing. These are decisions, not suggestions — do not re-derive or override them without asking.
---
## Project goal
Build the **AI Job Source Agent**: a Python pipeline that, for recently posted LinkedIn jobs, produces records of the form:
```
company_name, career_page_url, open_position_url
```
It runs in configurable batches, on a schedule, and is **incremental** — re-runs process only NEW jobs. The output is a CSV at `output/results.csv` plus rows in a local database.
The four logical steps:
1. From LinkedIn job listings, get **company name** and **company website URL**.
2. From the company website, find the **careers/jobs page URL**.
3. From the careers page, get **one open position URL**.
4. Emit `company_name, career_page_url, open_position_url`.
---
## Architecture decisions (non-negotiable)
**This is a WORKFLOW, not a multi-agent system.** The orchestrator is plain code (Prefect), not an LLM. Most stages are deterministic. Genuine LLM agency appears in exactly one place: the last-resort fallback for steps 23.
1. **Stage 1 (ingestion) uses a managed data API — NEVER browser automation on LinkedIn.** LinkedIn is a hostile anti-bot target and browser agents get blocked / require login (ToS + ban risk). Default provider is **JobSpy** (free); **Apify** is a drop-in alternative behind the same interface. No hand-built LinkedIn crawler.
2. **Company website is a separate, deterministic resolution step.** LinkedIn often does not expose the company's own site. Resolve `company name → website` via the provider field if present, else a verified domain guess, else an optional search API. This is plumbing, not an agent.
3. **Steps 2 and 3 share ONE cascade, cheapest tier first.** Each tier returns early on success. A full browser agent is the LAST tier only.
4. **When the browser-agent tier fires, it does steps 2 AND 3 in a single session** (find careers page + return one job URL). One agent run, not two.
5. **Dedup keys:** jobs are keyed on the LinkedIn numeric `jobPostingId` (parsed from the job URL); companies are keyed on normalized domain. Resolved careers pages are cached per company so a company is never re-resolved.
6. **Everything swappable lives behind an interface** (provider pattern): job sources, the careers cascade tiers, the extractor. Swapping JobSpy↔Apify, or heuristics↔agent, must not require touching neighbors.
7. **No fine-tuning.** The task is solved with tool use + prompting + the cascade. Use a small/cheap model for link classification and a stronger model for the browser agent; both configurable.
8. **Graceful degradation is mandatory.** If the LLM key or Browser Use / its browser is unavailable at runtime, the affected tier logs clearly and returns `None`, and the pipeline still completes (those records get status `needs_review`).
9. **Design for extension:** adding new ingestion sources (Indeed, Wellfound, ATS firehoses) and swapping SQLite→Postgres should drop in behind the existing interfaces without refactors. Cross-source dedup (later) will use a `(company_domain, normalized_title, location)` fingerprint.
---
## Pipeline stages (the cascade, in order)
**Stage 1 — Ingest (deterministic):** call the job source for recent postings (`hours_old` window) → list of `RawJob{job_id, company, website?, linkedin_url, listed_at}`. Dedup by `job_id`.
**Stage 1b — Resolve website (deterministic):** if `website` empty, resolve from company name (verified `{slug}.com` guess, optional search API).
**Stage 2 — Find careers page (cascade, return on first hit):**
1. **ATS detection** — detect Greenhouse / Lever / Ashby / Workday from the site and use their **public JSON APIs** (most reliable; also yields a job URL for Stage 3).
2. **URL patterns** — probe `/careers`, `/career`, `/jobs`, `/join-us`, `/join`, `careers.{domain}`, `jobs.{domain}`.
3. **Homepage link scan** — fetch homepage, rank anchors by career/job keywords in href/text.
4. **Sitemap** — parse `sitemap.xml` for career/job URLs.
5. **Cheap-LLM classification** — pass extracted anchors to a small model; pick the careers link (Pydantic AI, typed output).
6. **Browser-agent fallback** — Browser Use; fused with Stage 3 (see below).
**Stage 3 — Extract one open position (return on first hit):**
1. **ATS JSON** — if ATS known from Stage 2, return the first posting URL directly.
2. **JobPosting JSON-LD** — parse `application/ld+json` for a `url`.
3. **Job-like anchors** — first link matching `/job`, `/position`, `/opening`, `/vacancy`.
4. **Cheap-LLM classification** — pick the single-job link from anchors.
5. **Browser-agent fallback** — handled inside the fused Stage-2 agent call.
**Stage 4 — Persist & export:** write status to DB, export the 3-field CSV.
---
## Tech stack
- **Python 3.11+**
- **Orchestration/scheduling:** Prefect (`@flow`, retries, interval schedule). Cron documented as a no-daemon fallback.
- **HTTP:** httpx (shared client; timeouts + bounded retries).
- **HTML parsing:** BeautifulSoup + lxml.
- **Ingestion:** JobSpy (`python-jobspy`) default; Apify (`apify-client`) alternative.
- **Structured LLM extraction:** Pydantic AI (model-agnostic, typed).
- **Browser agent (fallback only):** Browser Use (`browser-use`) + Playwright/Chromium.
- **Config:** pydantic-settings (env-driven).
- **Data models:** Pydantic v2.
- **Storage:** SQLite via stdlib `sqlite3` (Postgres-ready behind the DB module).
- **Tests:** pytest.
Do not add other heavy dependencies without asking. (`uv` may be used instead of pip/venv if preferred.)
---
## Project structure
```
jobsource/
__init__.py
config.py # pydantic-settings; env-driven; model IDs/keys read from env with placeholder defaults (never hardcode or look up model IDs)
models.py # Pydantic: RawJob, JobResult; JobStatus enum
http.py # shared httpx client factory: timeout, headers, retry
db.py # SQLite: companies, jobs; dedup, company cache, CSV export
resolve.py # company name -> website (deterministic)
sources/
__init__.py
base.py # JobSource interface: fetch_recent_jobs() -> list[RawJob]
jobspy_source.py # default provider
apify_source.py # alternative provider (same interface)
careers/
__init__.py
cascade.py # find_careers_page() orchestrates the tiers
ats.py # ATS detect + public JSON (Greenhouse/Lever/Ashby/Workday)
heuristics.py # URL patterns, homepage scan, sitemap
classify_llm.py # Pydantic AI link classifier (careers link / job link)
extract.py # extract_open_position(): ATS -> JSON-LD -> anchors -> LLM
agent_fallback.py # Browser Use: fused find-careers + extract-job (last resort)
pipeline.py # run_batch(): dedup, per-record isolation, persistence, summary
flow.py # Prefect flow + schedule
main.py # CLI entry
tests/ # pytest
output/ # results.csv (gitignored)
.env.example
requirements.txt
README.md
```
---
## Data model
`JobStatus` enum: `new | website_resolved | careers_found | position_found | failed | needs_review`.
A record is **complete** when status is `position_found`.
`jobs` table: `job_id` (PK, LinkedIn numeric id), `company_key`, `linkedin_url`, `position_url`, `status`, `listed_at`, `first_seen`.
`companies` table: `company_key` (PK, normalized domain else lowercased name), `name`, `website`, `career_url` (cached), `first_seen`.
CSV columns, exactly: `company_name,career_page_url,open_position_url`. Empty cells allowed for incomplete rows; complete rows sorted first.
---
## Commands
```bash
# setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium # for the browser-agent tier
cp .env.example .env # fill keys as available
# run a batch
python -m jobsource.main --batch-size 20 --search "software engineer" --location "United States"
# scheduled run (Prefect)
python -m jobsource.flow # serves the flow on an interval schedule
# cron fallback: */0 6 * * * cd <repo> && ./.venv/bin/python -m jobsource.main --batch-size 50
# tests
pytest -q
```
`--search` is repeatable. Provide `--help` from `main.py`.
---
## Coding conventions
- Full type hints; Pydantic models for all records crossing module boundaries.
- Every external call (job provider, HTTP fetch, ATS API, LLM, agent) wrapped with a timeout, bounded retry, and try/except. **One failing company must never abort the batch** — catch, record `failed`/`needs_review`, continue.
- Secrets only from env (pydantic-settings). Never hardcode keys; never commit `.env`.
- Each cascade/extract function returns a typed result including which tier/method resolved it (for observability and metrics).
- Keep functions small and independently testable. Pure functions where possible; side effects (DB, network) isolated.
- Log at INFO per stage with the method used; log failures with context.
- Prefer standard library and the listed stack; ask before introducing alternatives.
- Model identifiers are configurable env values with placeholder defaults; never hardcode specific model IDs or fetch model references, the operator sets real values in `.env`.
---
## Output contract & success criteria
- `python -m jobsource.main --batch-size 20` completes without an unhandled exception and writes `output/results.csv`.
- Every row has exactly the three contract columns.
- Re-running immediately processes **0 new jobs** and adds **0 rows** (dedup proven).
- A run summary prints per-stage counts and end-to-end coverage (% of new jobs reaching `position_found`).
- Spot-checked `career_page_url` and `open_position_url` resolve (HTTP 200, not a 404/login wall).
---
## Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)
- Verify ATS JSON field names against live responses before trusting them: Greenhouse `jobs[].absolute_url`; Lever `[].hostedUrl`; Ashby `jobs[].jobUrl`; Workday varies by tenant. Fix in code AND note the confirmed shape here.
- **JobSpy `company_url_direct` fill rate: 0% observed** (5/5 jobs had `website=None` in a live fetch on 2026-06-17, search: "software engineer", United States, `linkedin_fetch_description=False`). `resolve.py` is essential for **every** job, not just a gap-filler. Do not assume any job arrives with a website pre-populated.
- **JobSpy `date_posted` / `listed_at` fill rate: ~40% observed** (2/5 jobs had a date; 3/5 were `None`). This is because `linkedin_fetch_description=False` (our default for speed) means LinkedIn's posted date is often absent. `listed_at` is best-effort metadata only; do not gate pipeline logic on it.
- **JobSpy confirmed column names** (verified 2026-06-17): `job_url` (full LinkedIn URL incl. tracking params), `company` (display name), `company_url_direct` (company own site — always `None` in practice so far), `date_posted` (sparse when `linkedin_fetch_description=False`), `title`, `location`, `id` (may be `None`; always parse job_id from `job_url` instead). `company_url` is the LinkedIn *company page* URL — never use it as the company website.
- LinkedIn parses the numeric job id from `/jobs/view/{id}`; strip tracking query params.
- Browser Use needs Chromium installed (`playwright install chromium`) and an LLM key; without them the tier must degrade gracefully.
- LinkedIn rate-limits aggressively; keep batches small while testing.
- Standard pip struggles with pydantic dependency resolution in this stack; always use uv pip install instead.
- The system Python is protected by PEP 668 (externally-managed-environment). Always use explicit virtual environment paths (e.g., .venv/bin/python, .venv/bin/pytest) for all terminal commands instead of relying on global commands.