Review · VerifiedUpdated 2026-05-03Verified against code 2026-05-03

Verified 2026-05-03 against MEMORY.md (extraction pipeline versioning) and prod feature flags.

02 — Mentions & Citations

The second stage of the loop. Every raw ProviderResponse from Stage 1 is processed by the Extraction Pipeline to identify what the response actually says about brands.

What gets extracted

Per response, per provider:

Brand mentions: Which brands appear, where in the response (position 1, 2, 3…), and in what context
Sentiment: Positive, neutral, or negative — extracted via batch sentiment LLM call with structured output
Key passages: The exact text surrounding each brand mention
Citations: Source URLs the AI model references alongside each mention
Source classification: Whether cited sources are Editorial, UGC, Reference, or owned by the brand or competitor (handled by source-classification-daily-eu scheduled job)

Why per-provider matters

Each provider has a distinct response shape, citation model, and quality profile. Extraction parses each accordingly:

ChatGPT — citations as inline links, sometimes none at all; mentions in conversational prose
Perplexity — explicit numbered citation index, dense citation graph per response
AI Overviews — structured AI Overview blocks alongside organic SERP results; both feed extraction
Gemini / Claude / Grok — varied citation conventions; mention extraction normalises across them
Copilot / AI Mode — provider-specific scrape shapes from DataForSEO

The output is a normalised mention model the rest of the loop doesn’t have to think about.

Pipeline version state (production)

Version	Status	Notes
v1.5	Superseded	Per-brand sentiment (N LLM calls)
v1.7	Superseded	Batch sentiment + key_passage + structured output. Was prod 2026-03-14 → 2026-04-23.
v1.10	Shipped + rolled back 2026-04-20	Identity normalisation + disambiguation. Pure-Python fuzzy matcher caused 17s/response regression.
v1.11	Shipped, fuzzy reverted 2026-04-22	rapidfuzz swap + thread-pool offload + disambig batch parallelism. OOM under batch load forced fuzzy flag off.
v1.12	PRODUCTION as of 2026-04-23	Decoupled fuzzy from identity normalisation. 892ms/response mean (34% faster than v1.7). Identity + disambig flags ON, fuzzy permanently OFF.
v1.13	Design (Linear GEN-51)	Candidate-first matcher: Discovery LLM produces tokens → rapidfuzz on candidates × workspace brands. ~44,000× compute reduction over old sliding-window fuzzy. Subsumes v1.8 BrandMatcher portion. Gated on stable v1.12 + v1.1 observed mode.

Feature flags on prod (2026-05-03)

USE_BATCH_SENTIMENT=True (v1.7+ behaviour, default)
BRAND_IDENTITY_NORMALIZATION_ENABLED=True (v1.12 default)
BRAND_DISAMBIGUATION_ENABLED=True (v1.12 default)
BRAND_FUZZY_MATCHING_ENABLED=False (permanently OFF — replaced by v1.13)
INLINE_EXTRACTION_CONCURRENCY=6 (per-container semaphore from v1.10.1)

Upcoming capabilities

v1.13 Candidate-first matching — Discovery LLM-driven tokens fuzzy-matched against workspace brand list (~150 comparisons/response vs ~6.6M in old sliding-window approach). Better quality + 4 orders of magnitude less compute. Prototype-first to avoid the v1.11 rabbit-hole.
v1.8 Brand Discovery (partial) — Discovery LLM + persistence halves still planned. BrandMatcher portion is now superseded by v1.13.