15 KiB
CLAUDE.md
Operating instructions for Claude Code on this project. Read this fully before planning or editing. These are decisions, not suggestions — do not re-derive or override them without asking.
Project goal
Build the AI Job Source Agent: a Python pipeline that, for recently posted LinkedIn jobs, produces records of the form:
company_name, career_page_url, open_position_url
It runs in configurable batches, on a schedule, and is incremental — re-runs process only NEW jobs. The output is a CSV at output/results.csv plus rows in a local database.
The four logical steps:
- From LinkedIn job listings, get company name and company website URL.
- From the company website, find the careers/jobs page URL.
- From the careers page, get one open position URL.
- Emit
company_name, career_page_url, open_position_url.
Architecture decisions (non-negotiable)
This is a WORKFLOW, not a multi-agent system. The orchestrator is plain code (Prefect), not an LLM. Most stages are deterministic. Genuine LLM agency appears in exactly one place: the last-resort fallback for steps 2–3.
- Stage 1 (ingestion) uses a managed data API — NEVER browser automation on LinkedIn. LinkedIn is a hostile anti-bot target and browser agents get blocked / require login (ToS + ban risk). Default provider is JobSpy (free); Apify is a drop-in alternative behind the same interface. No hand-built LinkedIn crawler.
- Company website is a separate, deterministic resolution step. LinkedIn often does not expose the company's own site. Resolve
company name → websitevia the provider field if present, else a verified domain guess, else an optional search API. This is plumbing, not an agent. - Steps 2 and 3 share ONE cascade, cheapest tier first. Each tier returns early on success. A full browser agent is the LAST tier only.
- When the browser-agent tier fires, it does steps 2 AND 3 in a single session (find careers page + return one job URL). One agent run, not two.
- Dedup keys: jobs are keyed on the LinkedIn numeric
jobPostingId(parsed from the job URL); companies are keyed on normalized domain. Resolved careers pages are cached per company so a company is never re-resolved. - Everything swappable lives behind an interface (provider pattern): job sources, the careers cascade tiers, the extractor. Swapping JobSpy↔Apify, or heuristics↔agent, must not require touching neighbors.
- No fine-tuning. The task is solved with tool use + prompting + the cascade. Use a small/cheap model for link classification and a stronger model for the browser agent; both configurable.
- Graceful degradation is mandatory. If the LLM key or Browser Use / its browser is unavailable at runtime, the affected tier logs clearly and returns
None, and the pipeline still completes (those records get statusneeds_review). - Design for extension: adding new ingestion sources (Indeed, Wellfound, ATS firehoses) and swapping SQLite→Postgres should drop in behind the existing interfaces without refactors. Cross-source dedup (later) will use a
(company_domain, normalized_title, location)fingerprint.
Pipeline stages (the cascade, in order)
Stage 1 — Ingest (deterministic): call the job source for recent postings (hours_old window) → list of RawJob{job_id, company, website?, linkedin_url, listed_at}. Dedup by job_id.
Stage 1b — Resolve website (deterministic): if website empty, resolve from company name (verified {slug}.com guess, optional search API).
Stage 2 — Find careers page (cascade, return on first hit):
- ATS detection — detect Greenhouse / Lever / Ashby / Workday from the site and use their public JSON APIs (most reliable; also yields a job URL for Stage 3).
- URL patterns — probe
/careers,/career,/jobs,/join-us,/join,careers.{domain},jobs.{domain}. - Homepage link scan — fetch homepage, rank anchors by career/job keywords in href/text.
- Sitemap — parse
sitemap.xmlfor career/job URLs. - Cheap-LLM classification — pass extracted anchors to a small model; pick the careers link (Pydantic AI, typed output).
- Browser-agent fallback — Browser Use; fused with Stage 3 (see below).
Stage 3 — Extract one open position (return on first hit):
- ATS JSON — if ATS known from Stage 2, return the first posting URL directly.
- JobPosting JSON-LD — parse
application/ld+jsonfor aurl. - Job-like anchors — first link matching
/job,/position,/opening,/vacancy. - Cheap-LLM classification — pick the single-job link from anchors.
- Browser-agent fallback — handled inside the fused Stage-2 agent call.
Stage 4 — Persist & export: write status to DB, export the 3-field CSV.
Tech stack
- Python 3.11+
- Orchestration/scheduling: Prefect (
@flow, retries, interval schedule). Cron documented as a no-daemon fallback. - HTTP: httpx (shared client; timeouts + bounded retries).
- HTML parsing: BeautifulSoup + lxml.
- Ingestion: JobSpy (
python-jobspy) default; Apify (apify-client) alternative. - Structured LLM extraction: Pydantic AI (model-agnostic, typed).
- Browser agent (fallback only): Browser Use (
browser-use) + Playwright/Chromium. - Config: pydantic-settings (env-driven).
- Data models: Pydantic v2.
- Storage: SQLite via stdlib
sqlite3(Postgres-ready behind the DB module). - Tests: pytest.
Do not add other heavy dependencies without asking. (uv may be used instead of pip/venv if preferred.)
Project structure
jobsource/
__init__.py
config.py # pydantic-settings; env-driven; model IDs/keys read from env with placeholder defaults (never hardcode or look up model IDs)
models.py # Pydantic: RawJob, JobResult; JobStatus enum
http.py # shared httpx client factory: timeout, headers, retry
db.py # SQLite: companies, jobs; dedup, company cache, CSV export
resolve.py # company name -> website (deterministic)
sources/
__init__.py
base.py # JobSource interface: fetch_recent_jobs() -> list[RawJob]
jobspy_source.py # default provider
apify_source.py # alternative provider (same interface)
careers/
__init__.py
cascade.py # find_careers_page() orchestrates the tiers
ats.py # ATS detect + public JSON (Greenhouse/Lever/Ashby/Workday)
heuristics.py # URL patterns, homepage scan, sitemap
classify_llm.py # Pydantic AI link classifier (careers link / job link)
extract.py # extract_open_position(): ATS -> JSON-LD -> anchors -> LLM
agent_fallback.py # Browser Use: fused find-careers + extract-job (last resort)
pipeline.py # run_batch(): dedup, per-record isolation, persistence, summary
flow.py # Prefect flow + schedule
main.py # CLI entry
tests/ # pytest
output/ # results.csv (gitignored)
.env.example
requirements.txt
README.md
Data model
JobStatus enum: new | website_resolved | careers_found | position_found | failed | needs_review.
A record is complete when status is position_found.
jobs table: job_id (PK, LinkedIn numeric id), company_key, linkedin_url, position_url, status, listed_at, first_seen.
companies table: company_key (PK, normalized domain else lowercased name), name, website, career_url (cached), first_seen.
CSV columns, exactly: company_name,career_page_url,open_position_url. Empty cells allowed for incomplete rows; complete rows sorted first.
Commands
# setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium # for the browser-agent tier
cp .env.example .env # fill keys as available
# run a batch
python -m jobsource.main --batch-size 20 --search "software engineer" --location "United States"
# scheduled run (Prefect)
python -m jobsource.flow # serves the flow on an interval schedule
# cron fallback: */0 6 * * * cd <repo> && ./.venv/bin/python -m jobsource.main --batch-size 50
# tests
pytest -q
--search is repeatable. Provide --help from main.py.
Coding conventions
- Full type hints; Pydantic models for all records crossing module boundaries.
- Every external call (job provider, HTTP fetch, ATS API, LLM, agent) wrapped with a timeout, bounded retry, and try/except. One failing company must never abort the batch — catch, record
failed/needs_review, continue. - Secrets only from env (pydantic-settings). Never hardcode keys; never commit
.env. - Each cascade/extract function returns a typed result including which tier/method resolved it (for observability and metrics).
- Keep functions small and independently testable. Pure functions where possible; side effects (DB, network) isolated.
- Log at INFO per stage with the method used; log failures with context.
- Prefer standard library and the listed stack; ask before introducing alternatives.
- Model identifiers are configurable env values with placeholder defaults; never hardcode specific model IDs or fetch model references, the operator sets real values in
.env.
Output contract & success criteria
python -m jobsource.main --batch-size 20completes without an unhandled exception and writesoutput/results.csv.- Every row has exactly the three contract columns.
- Re-running immediately processes 0 new jobs and adds 0 rows (dedup proven).
- A run summary prints per-stage counts and end-to-end coverage (% of new jobs reaching
position_found). - Spot-checked
career_page_urlandopen_position_urlresolve (HTTP 200, not a 404/login wall).
Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)
- ATS JSON shapes confirmed live (2026-06-17) — use these field names in code:
- Greenhouse:
GET https://boards-api.greenhouse.io/v1/boards/{slug}/jobs?per_page=1→{"jobs":[{"absolute_url":"…"}],"meta":{"total":N}}. Detect fromboards.greenhouse.io/{slug},job-boards.greenhouse.io/{slug}, orembed/job_board?for={slug}in page HTML. Individual job URLs may appear on eitherboards.greenhouse.ioorjob-boards.greenhouse.io— both are valid; always use theabsolute_urlfield verbatim. - Lever:
GET https://api.lever.co/v0/postings/{slug}?mode=json&limit=1→ JSON array[{"hostedUrl":"…"}](empty array[]for unknown slug, not 404). Detect fromjobs.lever.co/{slug}. - Ashby:
GET https://api.ashbyhq.com/posting-api/job-board/{slug}→{"jobs":[{"jobUrl":"…"}],"apiVersion":"…"}. GET only — not POST (earlier entry was wrong). Slug case is preserved in the returnedjobUrlbut the API is case-insensitive — bothrampandRampreturn 200 with the same jobs. Detect fromjobs.ashbyhq.com/{slug}. - Workday:
POST https://{host}/wday/cxs/{tenant}/{site}/jobsbody{"appliedFacets":{},"limit":1,"offset":0,"searchText":""}→{"total":N,"jobPostings":[{"externalPath":"/job/…"}]}. Build job URL ashttps://{host}/en-US/{site}{externalPath}. Detect from{tenant}.wd{N}.myworkdayjobs.com/…/{site}in HTML.
- Greenhouse:
- ATS embeds are typically on the /careers page, not the homepage (verified 2026-06-17 against Vercel/Figma/Ramp/Anthropic).
detect_ats_in_htmlon the homepage will miss most companies. The_finalizeupgrade incascade.pyhandles this by fetching the heuristic-found careers URL and re-running ATS detection on it — do not remove this step. - SPA-rendered careers pages (e.g. Anthropic/Next.js) cannot be detected by static HTML parsing — the ATS embed is injected by JavaScript after the initial page load. Static-tier resolution falls through to
url_patternonly; the browser-agent tier is needed for full ATS detection on these sites. Anthropic confirmed uses Greenhouse sluganthropic(370+ jobs as of 2026-06-17). - Soft-404 and off-brand redirect filtering in
probe_url_patterns(added 2026-06-17): Netflix/careersredirects to/NotFound?prev=…with HTTP 200 (SPA catch-all); Microsoft/careersredirects tobing.comvia aka.ms. Both are rejected by_is_plausible_careers_urlin heuristics.py — the next probe candidates are tried instead. - Live smoke-test results (2026-06-17, 10 companies): 10/10 careers-URL hit rate; 4/10 ATS-resolved (Vercel→greenhouse, Linear→ashby, Figma→greenhouse, Ramp→ashby); 4/10 position_url populated. Google/Microsoft/Apple/Netflix/Amazon/Anthropic resolve via
url_patternonly (custom or SPA-rendered ATS). - JobSpy
company_url_directfill rate: 0% observed (5/5 jobs hadwebsite=Nonein a live fetch on 2026-06-17, search: "software engineer", United States,linkedin_fetch_description=False).resolve.pyis essential for every job, not just a gap-filler. Do not assume any job arrives with a website pre-populated. - JobSpy
date_posted/listed_atfill rate: ~40% observed (2/5 jobs had a date; 3/5 wereNone). This is becauselinkedin_fetch_description=False(our default for speed) means LinkedIn's posted date is often absent.listed_atis best-effort metadata only; do not gate pipeline logic on it. - JobSpy confirmed column names (verified 2026-06-17):
job_url(full LinkedIn URL incl. tracking params),company(display name),company_url_direct(company own site — alwaysNonein practice so far),date_posted(sparse whenlinkedin_fetch_description=False),title,location,id(may beNone; always parse job_id fromjob_urlinstead).company_urlis the LinkedIn company page URL — never use it as the company website. - Tier 1b slug-guess ATS recovery (added 2026-06-17): when HTML detection misses (JS-embedded / SPA boards like Anthropic's Next.js Greenhouse embed),
recover_via_slug_guess()inats.pyprobes Greenhouse/Lever/Ashby public JSON APIs with guessed slugs. Slug candidates: domain stem first (e.g.anthropic.com→anthropic), then_slug(company_name)(strips legal suffixes/punctuation). False-positive guards: requirejob_count > 0; Greenhouse responses includecompany_name(used for a loose substring cross-check against input; clear mismatches rejected). Workday skipped (needs tenant+site). Confidence 0.90 / methodats:{name}:slug_guess. Wired incascade.pyas Tier 1b: fires after HTML-ATS misses, before URL-pattern probing. - Greenhouse
company_namefield: live Greenhouse API responses includejobs[0]["company_name"]— used by slug-guess cross-check. Do not remove this field from theATSFetchparsing in_fetch_greenhouse. - LinkedIn parses the numeric job id from
/jobs/view/{id}; strip tracking query params. - Browser Use needs Chromium installed (
playwright install chromium) and an LLM key; without them the tier must degrade gracefully. - LinkedIn rate-limits aggressively; keep batches small while testing.
- Standard pip struggles with pydantic dependency resolution in this stack; always use uv pip install instead.
- The system Python is protected by PEP 668 (externally-managed-environment). Always use explicit virtual environment paths (e.g., .venv/bin/python, .venv/bin/pytest) for all terminal commands instead of relying on global commands.