CLAUDE.md

Operating instructions for Claude Code on this project. Read this fully before planning or editing. These are decisions, not suggestions — do not re-derive or override them without asking.

Project goal

Build the AI Job Source Agent: a Python pipeline that, for recently posted LinkedIn jobs, produces records of the form:

company_name, career_page_url, open_position_url

It runs in configurable batches, on a schedule, and is incremental — re-runs process only NEW jobs. The output is a CSV at output/results.csv plus rows in a local database.

The four logical steps:

From LinkedIn job listings, get company name and company website URL.
From the company website, find the careers/jobs page URL.
From the careers page, get one open position URL.
Emit company_name, career_page_url, open_position_url.

Architecture decisions (non-negotiable)

This is a WORKFLOW, not a multi-agent system. The orchestrator is plain code (Prefect), not an LLM. Most stages are deterministic. Genuine LLM agency appears in exactly one place: the last-resort fallback for steps 2–3.

Stage 1 (ingestion) uses a managed data API — NEVER browser automation on LinkedIn. LinkedIn is a hostile anti-bot target and browser agents get blocked / require login (ToS + ban risk). Default provider is JobSpy (free); Apify is a drop-in alternative behind the same interface. No hand-built LinkedIn crawler.
Company website is a separate, deterministic resolution step. LinkedIn often does not expose the company's own site. Resolve company name → website via the provider field if present, else a verified domain guess, else an optional search API. This is plumbing, not an agent.
Steps 2 and 3 share ONE cascade, cheapest tier first. Each tier returns early on success. A full browser agent is the LAST tier only.
When the browser-agent tier fires, it does steps 2 AND 3 in a single session (find careers page + return one job URL). One agent run, not two.
Dedup keys: jobs are keyed on the LinkedIn numeric jobPostingId (parsed from the job URL); companies are keyed on normalized domain. Resolved careers pages are cached per company so a company is never re-resolved.
Everything swappable lives behind an interface (provider pattern): job sources, the careers cascade tiers, the extractor. Swapping JobSpy↔Apify, or heuristics↔agent, must not require touching neighbors.
No fine-tuning. The task is solved with tool use + prompting + the cascade. Use a small/cheap model for link classification and a stronger model for the browser agent; both configurable.
Graceful degradation is mandatory. If the LLM key or Browser Use / its browser is unavailable at runtime, the affected tier logs clearly and returns None, and the pipeline still completes (those records get status needs_review).
Design for extension: adding new ingestion sources (Indeed, Wellfound, ATS firehoses) and swapping SQLite→Postgres should drop in behind the existing interfaces without refactors. Cross-source dedup (later) will use a (company_domain, normalized_title, location) fingerprint.

Pipeline stages (the cascade, in order)

Stage 1 — Ingest (deterministic): call the job source for recent postings (hours_old window) → list of RawJob{job_id, company, website?, linkedin_url, listed_at}. Dedup by job_id.

Stage 1b — Resolve website (deterministic): if website empty, resolve from company name (verified {slug}.com guess, optional search API).

Stage 2 — Find careers page (cascade, return on first hit):

ATS detection — detect Greenhouse / Lever / Ashby / Workday from the site and use their public JSON APIs (most reliable; also yields a job URL for Stage 3).
URL patterns — probe /careers, /career, /jobs, /join-us, /join, careers.{domain}, jobs.{domain}.
Homepage link scan — fetch homepage, rank anchors by career/job keywords in href/text.
Sitemap — parse sitemap.xml for career/job URLs.
Cheap-LLM classification — pass extracted anchors to a small model; pick the careers link (Pydantic AI, typed output).
Browser-agent fallback — Browser Use; fused with Stage 3 (see below).

Stage 3 — Extract one open position (return on first hit):

ATS JSON — if ATS known from Stage 2, return the first posting URL directly.
JobPosting JSON-LD — parse application/ld+json for a url.
Job-like anchors — first link matching /job, /position, /opening, /vacancy.
Cheap-LLM classification — pick the single-job link from anchors.
Browser-agent fallback — handled inside the fused Stage-2 agent call.

Stage 4 — Persist & export: write status to DB, export the 3-field CSV.

Tech stack

Python 3.11+
Orchestration/scheduling: Prefect (@flow, retries, interval schedule). Cron documented as a no-daemon fallback.
HTTP: httpx (shared client; timeouts + bounded retries).
HTML parsing: BeautifulSoup + lxml.
Ingestion: JobSpy (python-jobspy) default; Apify (apify-client) alternative.
Structured LLM extraction: Pydantic AI (model-agnostic, typed).
Browser agent (fallback only): Browser Use (browser-use) + Playwright/Chromium.
Config: pydantic-settings (env-driven).
Data models: Pydantic v2.
Storage: SQLite via stdlib sqlite3 (Postgres-ready behind the DB module).
Tests: pytest.

Do not add other heavy dependencies without asking. (uv may be used instead of pip/venv if preferred.)

Project structure

jobsource/
  __init__.py
  config.py            # pydantic-settings; env-driven; model IDs/keys read from env with placeholder defaults (never hardcode or look up model IDs)
  models.py            # Pydantic: RawJob, JobResult; JobStatus enum
  http.py              # shared httpx client factory: timeout, headers, retry
  db.py                # SQLite: companies, jobs; dedup, company cache, CSV export
  resolve.py           # company name -> website (deterministic)
  sources/
    __init__.py
    base.py            # JobSource interface: fetch_recent_jobs() -> list[RawJob]
    jobspy_source.py   # default provider
    apify_source.py    # alternative provider (same interface)
  careers/
    __init__.py
    cascade.py         # find_careers_page() orchestrates the tiers
    ats.py             # ATS detect + public JSON (Greenhouse/Lever/Ashby/Workday)
    heuristics.py      # URL patterns, homepage scan, sitemap
    classify_llm.py    # Pydantic AI link classifier (careers link / job link)
  extract.py           # extract_open_position(): ATS -> JSON-LD -> anchors -> LLM
  agent_fallback.py    # Browser Use: fused find-careers + extract-job (last resort)
  pipeline.py          # run_batch(): dedup, per-record isolation, persistence, summary
  flow.py              # Prefect flow + schedule
  main.py              # CLI entry
tests/                 # pytest
output/                # results.csv (gitignored)
.env.example
requirements.txt
README.md

Data model

jobs table: job_id (PK, LinkedIn numeric id), company_key, linkedin_url, position_url, status, listed_at, first_seen. companies table: company_key (PK, normalized domain else lowercased name), name, website, career_url (cached), first_seen.

CSV columns, exactly: company_name,career_page_url,open_position_url. Empty cells allowed for incomplete rows; complete rows sorted first.

Commands

# setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
playwright install chromium        # for the browser-agent tier
cp .env.example .env               # fill keys as available

# run a batch
python -m jobsource.main --batch-size 20 --search "software engineer" --location "United States"

# scheduled run (Prefect)
python -m jobsource.flow            # serves the flow on an interval schedule
# cron fallback: */0 6 * * * cd <repo> && ./.venv/bin/python -m jobsource.main --batch-size 50

# tests
pytest -q

--search is repeatable. Provide --help from main.py.

Coding conventions

Full type hints; Pydantic models for all records crossing module boundaries.
Every external call (job provider, HTTP fetch, ATS API, LLM, agent) wrapped with a timeout, bounded retry, and try/except. One failing company must never abort the batch — catch, record failed/needs_review, continue.
Secrets only from env (pydantic-settings). Never hardcode keys; never commit .env.
Each cascade/extract function returns a typed result including which tier/method resolved it (for observability and metrics).
Keep functions small and independently testable. Pure functions where possible; side effects (DB, network) isolated.
Log at INFO per stage with the method used; log failures with context.
Prefer standard library and the listed stack; ask before introducing alternatives.
Model identifiers are configurable env values with placeholder defaults; never hardcode specific model IDs or fetch model references, the operator sets real values in .env.

Output contract & success criteria

python -m jobsource.main --batch-size 20 completes without an unhandled exception and writes output/results.csv.
Every row has exactly the three contract columns.
Re-running immediately processes 0 new jobs and adds 0 rows (dedup proven).
A run summary prints per-stage counts and end-to-end coverage (% of new jobs reaching position_found).
Spot-checked career_page_url and open_position_url resolve (HTTP 200, not a 404/login wall).

Gotchas (append confirmed findings here as you build — this section is durable memory across /clear)

Verify ATS JSON field names against live responses before trusting them: Greenhouse jobs[].absolute_url; Lever [].hostedUrl; Ashby jobs[].jobUrl; Workday varies by tenant. Fix in code AND note the confirmed shape here.
JobSpy company_url_direct fill rate: 0% observed (5/5 jobs had website=None in a live fetch on 2026-06-17, search: "software engineer", United States, linkedin_fetch_description=False). resolve.py is essential for every job, not just a gap-filler. Do not assume any job arrives with a website pre-populated.
JobSpy date_posted / listed_at fill rate: ~40% observed (2/5 jobs had a date; 3/5 were None). This is because linkedin_fetch_description=False (our default for speed) means LinkedIn's posted date is often absent. listed_at is best-effort metadata only; do not gate pipeline logic on it.
JobSpy confirmed column names (verified 2026-06-17): job_url (full LinkedIn URL incl. tracking params), company (display name), company_url_direct (company own site — always None in practice so far), date_posted (sparse when linkedin_fetch_description=False), title, location, id (may be None; always parse job_id from job_url instead). company_url is the LinkedIn company page URL — never use it as the company website.
LinkedIn parses the numeric job id from /jobs/view/{id}; strip tracking query params.
Browser Use needs Chromium installed (playwright install chromium) and an LLM key; without them the tier must degrade gracefully.
LinkedIn rate-limits aggressively; keep batches small while testing.
Standard pip struggles with pydantic dependency resolution in this stack; always use uv pip install instead.
The system Python is protected by PEP 668 (externally-managed-environment). Always use explicit virtual environment paths (e.g., .venv/bin/python, .venv/bin/pytest) for all terminal commands instead of relying on global commands.

12 KiB Raw Blame History Unescape Escape