LLM を活用した求人マッチング：リードスコアリングパイプラインの構築

Keyword filtering is fast, cheap, and wrong in exactly the ways that matter. When a recruiter defines an ideal candidate profile — specifying domain expertise, technology depth, communication style, and a dozen contextual preferences — no bag-of-words approach can evaluate whether a given job listing fits that profile faithfully. The semantic gap between “5+ years Python” and “writes Python for production data engineering workflows with heavy async I/O usage” is invisible to a keyword filter. It is, however, precisely what a language model can evaluate given the right prompt.

We built a matching pipeline that takes a fixed candidate profile and scores a queue of job listings against it using an LLM, running matches in parallel, tracking token costs in real time, and exposing pause/resume controls from a native SwiftUI macOS application. This post covers how the system is structured and where LLM matching genuinely earns its cost over simpler approaches.

プロンプトアーキテクチャ

The core of the system is a single scoring prompt, called once per job listing. Each call receives two inputs: the candidate profile and the listing text. The profile is fixed for the entire session — it includes skills, experience level, domain preferences, non-negotiable constraints, and any contextual notes the recruiter wants the model to weigh. The listing provides the job title, company, description, and requirements as-scraped.

The model returns a structured JSON object with two fields: a numeric score from 0 to 100, and a brief rationale explaining the key factors in the match or mismatch. Enforcing structured output matters here. An unstructured response requires fragile parsing and introduces variance in how the score is presented. Using a tool-calling or function-calling interface to enforce a JSON schema eliminates that class of problem entirely and makes downstream aggregation trivial.

One design choice worth making explicit: the profile should be verbose. Short profiles produce noisy scores because the model has insufficient context to distinguish a 70 from an 85. A profile that clearly articulates why certain constraints matter — not just what they are — gives the model enough signal to score with useful discrimination. Treat the profile like a detailed rubric rather than a filter string.

ThreadPoolExecutor による並行実行

With a queue of 50 or 100 listings, sequential API calls are too slow to be useful interactively. Python’s concurrent.futures.ThreadPoolExecutor handles this cleanly. Worker count is set based on the API provider’s rate limits — typically between 5 and 10 concurrent requests achieves meaningful throughput without triggering quota errors.

The pattern uses as_completed(), which yields futures in the order they finish rather than the order they were submitted. This means the fastest matches surface first and the UI can update progressively rather than waiting for an entire batch:

from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {
        executor.submit(score_listing, listing, profile): listing
        for listing in listings
    }
    for future in as_completed(futures):
        listing = futures[future]
        score, rationale = future.result()
        update_results(listing, score, rationale)

Error handling deserves explicit attention here. Individual API calls can fail — rate limit errors, transient network issues, content policy rejections. Each worker catches exceptions and returns a sentinel value rather than letting the exception propagate and cancel the entire executor. A listing that fails to score gets queued for retry rather than silently dropped. Transparent failure is far easier to reason about than a score list with unexplained gaps.

ライブトークントラッキング

Token costs at scale are easy to underestimate. A 400-token prompt sent to 100 listings, with 150-token responses, accumulates quickly depending on the model tier. The pipeline tracks running totals using a thread-safe counter — a simple integer protected by a threading.Lock, incremented after each API response using the token counts from the response metadata.

The SwiftUI frontend polls this counter on a timer, displaying cumulative input tokens, output tokens, and an estimated cost based on the current model’s per-token pricing. This gives the recruiter a live sense of session cost and helps identify when the profile is producing unusually long rationales — often a signal that the prompt itself needs tightening or that the listing text is noisier than expected.

One practical refinement: estimate token counts before sending, not just after. Most LLM SDKs expose a tokenisation method or a separate counting endpoint. Pre-counting lets the pipeline warn when a listing’s text is unusually long and would consume disproportionate context, or when the combined prompt approaches the model’s context limit. Catching this before the API call avoids wasted spend on a request that will be truncated.

一時停止と再開

When a recruiter is reviewing results mid-session, they may want to pause new scoring requests without losing work already in progress. The pause/resume mechanism uses a threading.Event as a shared signal. Before each API call, the worker thread checks the event state. If paused, it blocks on event.wait(). When the event is set again, execution continues.

This is cleaner than killing and restarting threads because in-flight API calls are allowed to complete naturally. Pausing means “stop starting new calls” rather than “abort everything now.” The distinction matters when you want the current batch to finish before the user reviews results — partial batches are harder to reason about than complete ones.

State persistence across pause/resume is worth handling explicitly. The pipeline maintains a checkpoint file — a simple JSON list of listing IDs already scored. On resume, the pipeline skips any listing already in the checkpoint. This also means that if the application crashes mid-session, no already-scored listings need to be re-scored on restart. Idempotency at the session level costs almost nothing to implement and saves significant annoyance.

SwiftUI macOS フロントエンド

We chose SwiftUI for the frontend because the target machine was macOS and we wanted a native experience — responsive, low overhead, proper macOS window management. An Electron shell would have worked but added unnecessary complexity for a single-user productivity tool. The Python backend runs as a local process and exposes a minimal HTTP API using Flask; the SwiftUI app communicates with it over localhost.

SwiftUI’s ObservableObject and @Published pattern handles the reactive update loop cleanly. A view model polls the backend on a timer, updating the published arrays that drive the results list and the token counter display. The main window shows three panels: a results list sorted by score descending, a live token and estimated cost counter, and a control bar with Pause/Resume and Stop buttons.

Each result row shows the job title, company name, a colour-coded score badge, and the LLM’s one-line rationale. Tapping a row opens a detail sheet with the full rationale, the key factors the model flagged, and a direct link to the original listing. The detail sheet is where most of the recruiter’s review time is spent — the score is a signal to direct attention, not a final decision.

LLM マッチングがキーワードフィルタリングより優れている場合

Keyword filtering is the right tool when requirements are binary and explicit: must hold a specific certification, must be located in a specific city, must have a minimum number of years in a role. For these, a simple filter runs in milliseconds and costs nothing.

But in reality, most interesting constraints are contextual. “Looking for someone who writes clean, well-tested code” is a preference that shows up in how a job description is written, not as a discrete keyword. “Not interested in roles where the stack is legacy-only with no modernisation path” requires reading and interpreting the description. A keyword filter cannot surface this signal. An LLM given an explicit profile with these constraints laid out clearly can evaluate them directly.

The gap widens further in markets with significant keyword inflation. When every job posting lists the same skills regardless of whether they are actually central to the role, keyword matching produces a near-flat distribution where most listings look equivalent. Semantic scoring breaks through this noise because it evaluates the overall coherence of the listing against the profile, not just term presence.

過剰な場合

LLM scoring adds real cost and latency. For high-volume, low-nuance screening — thousands of listings with straightforward binary requirements — a structured filter chain is faster and cheaper by orders of magnitude. The right architecture often combines both approaches: a fast pre-filter removes obvious non-matches, reducing the queue to a manageable subset, and LLM scoring is applied only to what remains. Pre-filtering 80% of listings before LLM scoring dramatically reduces cost without meaningfully reducing match quality.

The other consideration is consistency. LLMs introduce non-determinism. The same listing scored twice may receive slightly different scores. For most recruiting applications this variance is acceptable — a spread of a few points in either direction does not change which listings are worth reviewing. But it is worth being aware of when comparing scores across sessions that used different model versions or significantly revised prompt iterations.