The Loop02 — Mentions & Citations
Review · VerifiedUpdated 2026-05-03Verified against code 2026-05-03

Verified 2026-05-03 against MEMORY.md (extraction pipeline versioning) and prod feature flags.

02 — Mentions & Citations

The second stage of the loop. Every raw ProviderResponse from Stage 1 is processed by the Extraction Pipeline to identify what the response actually says about brands.

What gets extracted

Per response, per provider:

  • Brand mentions: Which brands appear, where in the response (position 1, 2, 3…), and in what context
  • Sentiment: Positive, neutral, or negative — extracted via batch sentiment LLM call with structured output
  • Key passages: The exact text surrounding each brand mention
  • Citations: Source URLs the AI model references alongside each mention
  • Source classification: Whether cited sources are Editorial, UGC, Reference, or owned by the brand or competitor (handled by source-classification-daily-eu scheduled job)

Why per-provider matters

Each provider has a distinct response shape, citation model, and quality profile. Extraction parses each accordingly:

  • ChatGPT — citations as inline links, sometimes none at all; mentions in conversational prose
  • Perplexity — explicit numbered citation index, dense citation graph per response
  • AI Overviews — structured AI Overview blocks alongside organic SERP results; both feed extraction
  • Gemini / Claude / Grok — varied citation conventions; mention extraction normalises across them
  • Copilot / AI Mode — provider-specific scrape shapes from DataForSEO

The output is a normalised mention model the rest of the loop doesn’t have to think about.

Pipeline version state (production)

VersionStatusNotes
v1.5SupersededPer-brand sentiment (N LLM calls)
v1.7SupersededBatch sentiment + key_passage + structured output. Was prod 2026-03-14 → 2026-04-23.
v1.10Shipped + rolled back 2026-04-20Identity normalisation + disambiguation. Pure-Python fuzzy matcher caused 17s/response regression.
v1.11Shipped, fuzzy reverted 2026-04-22rapidfuzz swap + thread-pool offload + disambig batch parallelism. OOM under batch load forced fuzzy flag off.
v1.12PRODUCTION as of 2026-04-23Decoupled fuzzy from identity normalisation. 892ms/response mean (34% faster than v1.7). Identity + disambig flags ON, fuzzy permanently OFF.
v1.13Design (Linear GEN-51)Candidate-first matcher: Discovery LLM produces tokens → rapidfuzz on candidates × workspace brands. ~44,000× compute reduction over old sliding-window fuzzy. Subsumes v1.8 BrandMatcher portion. Gated on stable v1.12 + v1.1 observed mode.

Feature flags on prod (2026-05-03)

  • USE_BATCH_SENTIMENT=True (v1.7+ behaviour, default)
  • BRAND_IDENTITY_NORMALIZATION_ENABLED=True (v1.12 default)
  • BRAND_DISAMBIGUATION_ENABLED=True (v1.12 default)
  • BRAND_FUZZY_MATCHING_ENABLED=False (permanently OFF — replaced by v1.13)
  • INLINE_EXTRACTION_CONCURRENCY=6 (per-container semaphore from v1.10.1)

Upcoming capabilities

  • v1.13 Candidate-first matching — Discovery LLM-driven tokens fuzzy-matched against workspace brand list (~150 comparisons/response vs ~6.6M in old sliding-window approach). Better quality + 4 orders of magnitude less compute. Prototype-first to avoid the v1.11 rabbit-hole.
  • v1.8 Brand Discovery (partial) — Discovery LLM + persistence halves still planned. BrandMatcher portion is now superseded by v1.13.

See also

  • Stage 1 — Collections & Prompts — where responses come from
  • Stage 3 — Aggregated Stats — what the structured mentions become
  • docs/active/extraction-pipeline/V1_12_DECOUPLE_FUZZY_PLAN.md — current extraction prod state (in main docs repo)
  • docs/active/extraction-pipeline/V1_13_CANDIDATE_FIRST_SCOPE.md — next iteration design