Review · VerifiedUpdated 2026-05-03Verified against code 2026-05-03
Verified 2026-05-03 against MEMORY.md (extraction pipeline versioning) and prod feature flags.
02 — Mentions & Citations
The second stage of the loop. Every raw ProviderResponse from Stage 1 is processed by the Extraction Pipeline to identify what the response actually says about brands.
What gets extracted
Per response, per provider:
- Brand mentions: Which brands appear, where in the response (position 1, 2, 3…), and in what context
- Sentiment: Positive, neutral, or negative — extracted via batch sentiment LLM call with structured output
- Key passages: The exact text surrounding each brand mention
- Citations: Source URLs the AI model references alongside each mention
- Source classification: Whether cited sources are Editorial, UGC, Reference, or owned by the brand or competitor (handled by
source-classification-daily-euscheduled job)
Why per-provider matters
Each provider has a distinct response shape, citation model, and quality profile. Extraction parses each accordingly:
- ChatGPT — citations as inline links, sometimes none at all; mentions in conversational prose
- Perplexity — explicit numbered citation index, dense citation graph per response
- AI Overviews — structured AI Overview blocks alongside organic SERP results; both feed extraction
- Gemini / Claude / Grok — varied citation conventions; mention extraction normalises across them
- Copilot / AI Mode — provider-specific scrape shapes from DataForSEO
The output is a normalised mention model the rest of the loop doesn’t have to think about.
Pipeline version state (production)
| Version | Status | Notes |
|---|---|---|
| v1.5 | Superseded | Per-brand sentiment (N LLM calls) |
| v1.7 | Superseded | Batch sentiment + key_passage + structured output. Was prod 2026-03-14 → 2026-04-23. |
| v1.10 | Shipped + rolled back 2026-04-20 | Identity normalisation + disambiguation. Pure-Python fuzzy matcher caused 17s/response regression. |
| v1.11 | Shipped, fuzzy reverted 2026-04-22 | rapidfuzz swap + thread-pool offload + disambig batch parallelism. OOM under batch load forced fuzzy flag off. |
| v1.12 | PRODUCTION as of 2026-04-23 | Decoupled fuzzy from identity normalisation. 892ms/response mean (34% faster than v1.7). Identity + disambig flags ON, fuzzy permanently OFF. |
| v1.13 | Design (Linear GEN-51) | Candidate-first matcher: Discovery LLM produces tokens → rapidfuzz on candidates × workspace brands. ~44,000× compute reduction over old sliding-window fuzzy. Subsumes v1.8 BrandMatcher portion. Gated on stable v1.12 + v1.1 observed mode. |
Feature flags on prod (2026-05-03)
USE_BATCH_SENTIMENT=True(v1.7+ behaviour, default)BRAND_IDENTITY_NORMALIZATION_ENABLED=True(v1.12 default)BRAND_DISAMBIGUATION_ENABLED=True(v1.12 default)BRAND_FUZZY_MATCHING_ENABLED=False(permanently OFF — replaced by v1.13)INLINE_EXTRACTION_CONCURRENCY=6(per-container semaphore from v1.10.1)
Upcoming capabilities
- v1.13 Candidate-first matching — Discovery LLM-driven tokens fuzzy-matched against workspace brand list (~150 comparisons/response vs ~6.6M in old sliding-window approach). Better quality + 4 orders of magnitude less compute. Prototype-first to avoid the v1.11 rabbit-hole.
- v1.8 Brand Discovery (partial) — Discovery LLM + persistence halves still planned. BrandMatcher portion is now superseded by v1.13.
See also
- Stage 1 — Collections & Prompts — where responses come from
- Stage 3 — Aggregated Stats — what the structured mentions become
docs/active/extraction-pipeline/V1_12_DECOUPLE_FUZZY_PLAN.md— current extraction prod state (in main docs repo)docs/active/extraction-pipeline/V1_13_CANDIDATE_FIRST_SCOPE.md— next iteration design