Review · VerifiedUpdated 2026-05-03Verified against code 2026-05-03

Verified 2026-05-03 against nudg3-workflows/utils/provider_name_mapper.py, prod env vars, and MEMORY.md.

01 — Collections & Prompts

The first stage of the loop. Daily, every monitored workspace’s prompts are run against every provider it has enabled, generating raw AI responses for the rest of the pipeline to chew on.

Prompts

A prompt is a tracked AI query — usually customer-defined, sometimes auto-generated during onboarding. Each prompt is tagged with:

Funnel phase: Discovery (unbranded — “best CRM software”), Research (branded — “tell me about Acme Corp”), or Purchase (intent — “should I buy Acme CRM”). Discovery and Research/Purchase have very different behaviour and are scored differently downstream.
Tags: arbitrary content/topic labels for grouping
Active flag: only is_active=True prompts are collected against, and only active prompts feed downstream analysis (enforced at fetch_responses_node — see PR #318 / v3.8.1)

A workspace’s active prompt set is the optimisation surface for everything else. Active-prompt rotation >20% triggers change-guards downstream (see Historical Tracking).

Providers

Confirmed live in provider_name_mapper.py (2026-05-03):

Base tier:

ChatGPT (gpt-4o-mini, also gpt-5-mini routes)
Perplexity (sonar)
AI Overviews (DataForSEO Google AIO)

Premium tier (per-workspace flag):

Gemini (gemini-2.5-flash, gemini-3-pro-preview)
Claude (claude-haiku-4-5, claude-sonnet-4-5, claude-opus-4-5)
Grok (grok-3-mini, grok-4-1-fast-reasoning)
Copilot (DataForSEO Bing Copilot)
AI Mode (DataForSEO Google AI Mode)
ChatGPT Premium (gpt-5.2, o4-mini reasoning), Perplexity Pro (sonar-pro)

Brave Search AI and Mistral were referenced in pre-2026 internal drafts but are not in the production provider mapper.

How collection runs (V3 pipeline)

The V3 Collection Pipeline queries AI providers every day using a queue-based architecture (Cloro).

Cloud Scheduler triggers daily at 00:30 UTC (v3-collection-define-eu)
Backend builds a prompt × provider matrix for each eligible workspace
Cloro async API executes queries with managed concurrency (CLORO_CONCURRENCY_LIMIT=25)
Webhooks return results, persisted as ProviderResponse records via cloro_webhook
Recovery scheduler at 03:30 UTC retries stuck/failed tasks (70min buffer)
Health check at 04:00 UTC produces a Slack digest

Scheduler timings (production)

Cron job	Time (UTC)	Purpose
`v3-collection-define-eu`	`30 0 * * *`	Submit Cloro tasks for eligible workspaces
`v3-collection-recovery-eu`	`30 3 * * *`	Retry stuck/failed tasks
`collection-health-check-eu`	`0 4 * * *`	Slack digest
`extraction-backlog-processor`	`/5 5-9 * *`	Safety net for inline extraction (feeds Stage 2)
`source-classification-daily-eu`	`0 10 * * *`	Classify cited sources (feeds Stage 2)
`detect-duplicates-daily-eu`	`0 11 * * *`	De-duplicate mentions
`subscription-expiration-check-eu`	`0 14 * * *`	Billing checks

Scale

Cloro plan: Standard, 25 concurrent tasks, ~36 tasks/min, ~111 min for ~4,000 tasks.

Why managed API, not scraping

Cloro provides managed async collection against each provider via official APIs and structured scrapes. This matters: competitors that scrape provider UIs (Peec, Hall) face fragility every time a provider changes its interface. Our approach is sustainable and reliable.