Verified 2026-05-03 against
nudg3-workflows/utils/provider_name_mapper.py, prod env vars, and MEMORY.md.
01 — Collections & Prompts
The first stage of the loop. Daily, every monitored workspace’s prompts are run against every provider it has enabled, generating raw AI responses for the rest of the pipeline to chew on.
Prompts
A prompt is a tracked AI query — usually customer-defined, sometimes auto-generated during onboarding. Each prompt is tagged with:
- Funnel phase: Discovery (unbranded — “best CRM software”), Research (branded — “tell me about Acme Corp”), or Purchase (intent — “should I buy Acme CRM”). Discovery and Research/Purchase have very different behaviour and are scored differently downstream.
- Tags: arbitrary content/topic labels for grouping
- Active flag: only
is_active=Trueprompts are collected against, and only active prompts feed downstream analysis (enforced atfetch_responses_node— see PR #318 / v3.8.1)
A workspace’s active prompt set is the optimisation surface for everything else. Active-prompt rotation >20% triggers change-guards downstream (see Historical Tracking).
Providers
Confirmed live in provider_name_mapper.py (2026-05-03):
Base tier:
- ChatGPT (gpt-4o-mini, also gpt-5-mini routes)
- Perplexity (sonar)
- AI Overviews (DataForSEO Google AIO)
Premium tier (per-workspace flag):
- Gemini (gemini-2.5-flash, gemini-3-pro-preview)
- Claude (claude-haiku-4-5, claude-sonnet-4-5, claude-opus-4-5)
- Grok (grok-3-mini, grok-4-1-fast-reasoning)
- Copilot (DataForSEO Bing Copilot)
- AI Mode (DataForSEO Google AI Mode)
- ChatGPT Premium (gpt-5.2, o4-mini reasoning), Perplexity Pro (sonar-pro)
Brave Search AI and Mistral were referenced in pre-2026 internal drafts but are not in the production provider mapper.
How collection runs (V3 pipeline)
The V3 Collection Pipeline queries AI providers every day using a queue-based architecture (Cloro).
- Cloud Scheduler triggers daily at 00:30 UTC (
v3-collection-define-eu) - Backend builds a prompt × provider matrix for each eligible workspace
- Cloro async API executes queries with managed concurrency (
CLORO_CONCURRENCY_LIMIT=25) - Webhooks return results, persisted as
ProviderResponserecords viacloro_webhook - Recovery scheduler at 03:30 UTC retries stuck/failed tasks (70min buffer)
- Health check at 04:00 UTC produces a Slack digest
Scheduler timings (production)
| Cron job | Time (UTC) | Purpose |
|---|---|---|
v3-collection-define-eu | 30 0 * * * | Submit Cloro tasks for eligible workspaces |
v3-collection-recovery-eu | 30 3 * * * | Retry stuck/failed tasks |
collection-health-check-eu | 0 4 * * * | Slack digest |
extraction-backlog-processor | */5 5-9 * * * | Safety net for inline extraction (feeds Stage 2) |
source-classification-daily-eu | 0 10 * * * | Classify cited sources (feeds Stage 2) |
detect-duplicates-daily-eu | 0 11 * * * | De-duplicate mentions |
subscription-expiration-check-eu | 0 14 * * * | Billing checks |
Scale
Cloro plan: Standard, 25 concurrent tasks, ~36 tasks/min, ~111 min for ~4,000 tasks.
Why managed API, not scraping
Cloro provides managed async collection against each provider via official APIs and structured scrapes. This matters: competitors that scrape provider UIs (Peec, Hall) face fragility every time a provider changes its interface. Our approach is sustainable and reliable.
See also
- Stage 2 — Mentions & Citations — what happens to the raw responses next
- Architecture Overview — where collection sits in the broader system