Telemetry Study

From Prompt to Crawler to Response: Tracing the Full Lifecycle of an AI Query

Tracing the full lifecycle of an AI query — prompt → crawler request → pages fetched → response and citations — across ChatGPT, Gemini, Perplexity, Claude, and Grok. This kind of end-to-end behavioral tracking is an underexamined layer of GEO/AEO.

Marty Coleman
Marty Coleman
CEO, Second Wind
1. Executive Takeaway

Executive Takeaway

We traced the full lifecycle of AI queries across five major AI platforms, from prompt to crawler behavior to fetched pages to final response and citations. To do that, we instrumented both sides of the interaction: a probe layer that issued the queries and website-side telemetry that captured crawler and agent activity as it happened.

From there, we looked not just at what the systems answered, but at the observable path by which those answers appeared to form. In this dataset, the biggest pattern was not simply whether a platform appeared to fetch content in real time. It was that the same domain could be processed very differently depending on the system and the content format involved.

In one case, a single fetch of a structured HTML hub page with internal links was followed by 12 citations to our domain. In another, a fetch of llms.txt appeared to shape answer content but produced zero citations. In other responses, platforms cited pages on the domain despite no observable site-side crawl during the run.

Taken together, these traces suggest that AI visibility is shaped not just by whether content exists, but by how it is structured, traversed, and incorporated into retrieval or answer formation. This report does not claim to prove any platform's fixed internal architecture. But within a small, bounded study, it does surface a meaningful and underexamined layer of GEO/AEO: the path between prompt and answer.

2. Scope

What This Study Does and Does Not Prove

This study measures observed fetch behavior (which pages AI platform crawlers requested, when, and from where) and citation outcomes (which URLs appeared in AI-generated responses) for a single domain across a single 63-minute study window using 27 prompts.

What we can show

  • Which platforms triggered real-time crawls and what they fetched
  • Correlation between content format, structure, and citation frequency
  • Differences in crawler infrastructure, timing, and request patterns
  • How citation rates varied by prompt type, platform, and content category

What we cannot show

  • ×Whether any platform “actually” live-crawls in general (small sample)
  • ×How internal ranking, retrieval, or generation systems work
  • ×Whether patterns generalize beyond this domain or time window
  • ×Causal relationships — we observe co-occurrence, not causation
  • ×How model weights, training data, or retrieval indexes are constructed

All findings are scoped to this dataset. Where we interpret patterns or suggest mechanisms, we flag those as hypotheses.

3. Findings

Findings

What follows is a set of observations from this study — interesting findings in AI agent/crawler behavior and how crawled data showed up in answers. We focus on the journey from prompt to crawler to pages hit, latency, size of content retrieved (where visible), and how the response and citations reflected (or didn't reflect) what was fetched.

3.1 Structured hub pages

Structured Hub Pages as Citation Multipliers

The single highest-citation crawl event in the study:

1. Prompt

"Check secondwind.cloud/intelligence — what research has Second Wind published?"

ChatGPT

2. Crawler detected

21:05:46.296Z

ChatGPT-User → GET /intelligence

Microsoft Azure (San Antonio, TX)

+1.032s from probe

↳ CollectionPage, ItemList, internal links (GEO)

3. Citations
  • /intelligence/how-decisions-are-now-made
  • /intelligence/corporate-spend-expense-management-saas-ai-visibility-index
  • /intelligence/what-ai-systems-ranking-each-other-reveals-about-geo
  • /intelligence/geo-isnt-seo-with-a-new-acronym
  • /intelligence (the index itself)
  • / (homepage, cited twice — root and www variant)
  • /technology
  • /decisions
  • /quantify
  • /portfolio

+1 more

4. Response

2,095 bytes

12 citations — all to our domain

ChatGPT fetched the /intelligence index page — a single HTML document listing published research articles with internal links. From that one page crawl, it generated 12 citation instances across 11 distinct URLs on our domain (shown in the flow above).

We did not observe ChatGPT fetching any of those articles individually. The behavior is consistent with the model parsing the index page's internal links and citing discovered URLs without following them — though we cannot confirm this is the mechanism, only that no additional fetches were logged.

Across all platforms in this dataset, the /intelligence index page was cited 20 times — more than any individual article. Compare with the url-llms probe, where ChatGPT fetched a single flat-text file and produced zero citations. Here, structured HTML with internal link architecture correlated with far more citations than plain text.

3.2 Architectures

Distinct Fetch-to-Cite Architectures Across Platforms

From our telemetry we can describe response patterns — e.g. "we observed live crawl/fetch here," "we observed no site-side fetch here," "the response looked consistent with indexed or prior knowledge" — but we generally cannot prove a provider's full internal architecture, that "no observed crawl" means no retrieval happened, or that one vendor is always live-fetch and another always cache-only. Below we frame three patterns that fit what we saw, with links to vendor documentation where it clarifies what is possible.

Architecture AQuery-time web retrieval observed(e.g. ChatGPT in this study)
Prompt
Search/fetch may occur at answer time
Content parsed or excerpted
Response + citations
Architecture BSearch-index–mediated retrieval likely(e.g. Perplexity)
Prompt
System consults search/index layer
May optionally fetch supporting pages
Response + citations
Architecture CNo live retrieval observed in this run(response pattern, not provider type)
Prompt
Prior indexed/cached knowledge or hidden retrieval path
Relevant entities/pages recalled or selected
Response formed

Architecture A — On-demand web retrieval

Possible live search and/or direct page fetch at answer time. Applies at least sometimes to ChatGPT, Claude, Gemini, and Grok when web-grounding or search features are active (see vendor docs: OpenAI, Claude, Google, xAI). In this study we observed live fetch behavior from ChatGPT (e.g. ChatGPT-User, OAI-SearchBot, GPTBot; Azure and residential IPs). We do not claim a fixed pipeline (e.g. "fetches URL in ~1s, parses HTML + internal links") as the default — only that real-time fetch can and did occur in our run.

Architecture B — Search/index–mediated retrieval

System consults a prebuilt search index and may supplement with fresh fetching or content extraction. This is a reasonable broad description for Perplexity (Perplexity crawler docs). Note: Perplexity states that PerplexityBot is not used to crawl content for AI foundation models; it is used to surface and link websites in Perplexity search results. So we avoid claiming "PerplexityBot crawls proactively to build a per-site index for answering." In our window we saw PerplexityBot hit 9 pages; when study probes fired, we did not see a re-crawl — behavior was consistent with search-index–mediated retrieval, with optional fetch. Robots.txt was fetched 31 minutes after the last crawl, consistent with retroactive compliance check rather than pre-crawl permission.

Architecture C — No observable live retrieval in a given response

The answer appears to rely on prior model knowledge, cached/indexed material, or retrieval we cannot see. This is a response pattern, not a provider-level architecture. Claude, Gemini, and Grok all document web search or web fetch; we did not observe a site-side crawl in our run for those platforms, but that does not mean they are "cached index only" as a fixed design. In this study, 81 responses came from platforms with zero observed crawl events; citation behavior in those responses still varied (see table).

Citation behavior for platforms that in this run showed no observable site-side fetch:

PlatformCitation RateOwned CitesObserved Pattern
Grok81% (22/27)60Cited broadly; highest owned share (39%).
Gemini74% (20/27)20Cited root domain only — never cited specific subpages.
Claude33% (9/27)13Cited only / and /privacy. Intelligence articles absent. Shortest responses.

Observation: In responses where we saw no live fetch, content that appears to have been in training or indexing pipelines was still cited; content that may exist only on the live site (e.g. case studies, FAQ) was rarely cited. We do not infer that these providers lack web retrieval — only that we did not observe it in this run.

3.3 Quantitative titles

Quantitative Research Titles Correlated with Higher Citation Rates

Not all intelligence articles performed equally. In this dataset, articles with specific, quantitative framing in the title received substantially more citations.

Article (by URL slug)CitationsPlatforms CitingPlatforms That Crawled
how-we-quantify-ai-influenced-revenue18Grok (9), Perplexity (9)Perplexity
corporate-spend-expense-management-saas-ai-visibility-index16ChatGPT (12), Grok (3), Perplexity (1)ChatGPT, Perplexity, Google
what-ai-systems-ranking-each-other-reveals-about-geo13ChatGPT (7), Grok (3), Perplexity (3)Perplexity
how-decisions-are-now-made12ChatGPT (10), Grok (2)Perplexity, Copilot
geo-isnt-seo-with-a-new-acronym6ChatGPT (4), Grok (2)

The top performers share two observable characteristics: (1) Specific quantitative language in the URL — "quantify," "index," "revenue" signal measurable data. (2) Multi-platform crawl interest — the corporate-spend article was independently crawled by ChatGPT, Perplexity, AND Google (via Nexus 5X render pass at 23:10 UTC). The lowest performer (geo-isnt-seo-with-a-new-acronym) uses conceptual/opinion framing and was not crawled by any platform during the observation window.

3.4 AI-readiness

AI-Readiness Content Attracted Strong Citation Behavior

The prompt "Does Second Wind have a dedicated AI-readable version of their site?" achieved 5/5 citation rate — every platform cited our domain. This was the only non-URL, non-comparison prompt to achieve a perfect score in our dataset. Every platform: acknowledged the AI Surface concept; used the phrase "AI-readable" in its response; Grok specifically cited ai.secondwind.cloud; Claude cited us on this prompt despite producing no citations on 18 of 27 others. This was also the prompt that generated the most owned citations from Claude (2) and Grok (7) — in this run we observed no site-side fetch for those platforms, though both document web search or retrieval capabilities. Interpretation: This finding is consistent with the idea that AI systems have a natural affinity for content describing AI-readable infrastructure — it directly relates to their own data ingestion paradigm. However, we note that this prompt is also highly specific and brand-relevant, which may independently explain the high citation rate. We cannot isolate which factor drives the result.

4. Illustrative traces

Illustrative Traces

Three representative prompt-to-crawler-to-response examples. Below we map the full path from prompt → crawler → pages hit → latency → what we infer was retrieved → response and citations. These three journeys use only events we directly observed; they show how crawled data did or didn't show up in the answer.

Journey 1: The Citation Multiplier (ChatGPT × /intelligence)

1. Prompt

"Check secondwind.cloud/intelligence — what research has Second Wind published?"

ChatGPT

2. Crawler detected

21:05:46.296Z

ChatGPT-User → GET /intelligence

Azure (San Antonio, TX)

+1.032s from probe

↳ CollectionPage, ItemList, internal links (GEO)

3. Citations
  • /intelligence/how-decisions-are-now-made
  • /intelligence/corporate-spend-expense-management-saas-ai-visibility-index
  • /intelligence/what-ai-systems-ranking-each-other-reveals-about-geo
  • /intelligence/geo-isnt-seo-with-a-new-acronym
  • /intelligence
  • /
  • /technology
  • /decisions
  • /quantify
  • /portfolio

+1 more

4. Response

2,095 bytes

12 citations — all to our domain

What we infer was retrieved: The HTML of /intelligence (index page listing 5 research articles with internal links). We did not see fetches for any of the article URLs.

How it affected the answer: The model listed published research and cited 11 distinct URLs on our domain (including article paths it never fetched), consistent with parsing the index's link structure and citing those URLs without following them. One crawl → 12 citation instances; content size of the single page was enough to drive the whole citation set.

Journey 2: The Consumed-But-Not-Cited File (ChatGPT × /llms.txt)

1. Prompt

"Read secondwind.cloud/llms.txt — what does Second Wind provide there for AI systems?"

ChatGPT

2. Crawler detected

21:09:46.303Z

ChatGPT-User → GET /llms.txt

Azure (San Antonio, TX)

Middleware 2ms

Requests HTML accept headers for a .txt file

3. Citations

0 citations to our domain · 12 total (other "Second Wind" entities)

4. Response

468 bytes (shortest in study)

0 citations to our domain

Described AI Surface correctly but included Second Wind Pro — wrong entity.

What we infer was retrieved: The full llms.txt file (plain text; small byte size). Response was 468 bytes — shortest in the study.

How it affected the answer: The model accurately described the AI Surface concept and what Second Wind provides for AI systems, but produced zero citations to our domain and included the wrong entity (Second Wind Pro). So: content was consumed as context and shaped the answer text, but was not treated as a citable source, and did not resolve entity disambiguation. Contrast with Journey 1: Same provider, ~1s latency. Structured HTML with internal links → 12 owned citations. Plain text llms.txt → 0 owned citations. In this pair, content format and structure correlated strongly with citation outcome.

Journey 3: The Redirected Crawl (Perplexity × /llms.txt prompt)

1. Prompt

"Read secondwind.cloud/llms.txt — what does Second Wind provide there for AI systems?"

Perplexity

2. Crawler detected

21:10:33.433Z

PerplexityBot → GET /intelligence

AWS us-east-1 (Ashburn, VA)

~5.8s from probe

↳ CollectionPage, ItemList, internal links (GEO)

Fetched /intelligence, NOT /llms.txt

3. Citations
  • secondwind.cloud (root)
  • /privacy
  • /intelligence/how-we-quantify-ai-influenced-revenue
  • /intelligence
4. Response

1,639 bytes

4 citations to our domain · 10 total

Stated we don't provide llms.txt; citations from pre-built index.

What we infer was retrieved: The /intelligence index page (HTML), not llms.txt. Latency from probe to crawl ~5.8s.

How it affected the answer: The model stated that "Second Wind does not provide publicly accessible content at secondwind.cloud/llms.txt based on available search results" — so the crawl target and the answer were misaligned with the user's request. Yet it still cited 4 of our URLs (root, /privacy, one intelligence article, /intelligence), likely from its pre-built index. So: the real-time fetch went to a different URL than requested; the answer reflected that (denying llms.txt) while citations still drew on previously indexed pages.

5. Detailed data

Detailed Data

5.1 Probe config

Probe Configuration

PlatformModelTool useProbes
ChatGPTgpt-4o-mini-2024-07-18Web search (Bing)27
Geminigemini-2.5-flashGoogle Search27
PerplexitysonarPerplexity Search27
Claudeclaude-haiku-4-5-20251001Web search27
Grokgrok-4-1-fast-reasoningGrok Search27

27 prompts across 8 families: brand (3), intelligence (5), case-study (2), surface/competitive (5), url-explicit (4), evaluation (3), use-case (3), trust (2). Each fired sequentially across all 5 platforms.

5.2 Platform tables

Mention & Citation Rates

Overall: 100% mention rate (135/135) · 67% citation rate (91/135) · 23.8% owned citation share (228/958)

Loading chart…
PlatformMentionsCitationsOwned CitesOwned %Avg Response
ChatGPT27/27 (100%)23/27 (85%)9726.0%2,187 bytes
Gemini27/27 (100%)20/27 (74%)2013.0%2,932 bytes
Perplexity27/27 (100%)17/27 (63%)3817.6%2,113 bytes
Claude27/27 (100%)9/27 (33%)1321.0%1,384 bytes
Grok27/27 (100%)22/27 (81%)6039.2%3,482 bytes
5.3 Citation charts

Pages Cited by Platform

ChatGPT97 citations
21
14
12
10
40
/ (21)/technology (14)/intelligence/corporate-spend- (12)/intelligence/how-decisions- (10)other (40)
Grok60 citations
20
9
6
25
/ (20)/intelligence/how-we-quantify- (9)/intelligence (6)other (25)
Perplexity38 citations
9
8
7
14
/intelligence/how-we-quantify- (9)/intelligence (8)/ (7)other (14)
Gemini20 citations
20
root only (20)
Claude13 citations
7
6
/ (7)/privacy (6)

Explicit URL Probes vs. Organic Discovery

Loading chart…

Avg owned cites per probe: Explicit URL 2.4, Organic 1.6. ChatGPT is the only platform where organic probes produced a higher citation rate than explicit URL probes. Claude saw the largest boost from explicit URLs: citation rate nearly tripled from 26% to 75%.

5.4 Citation grid

Citation Grid — Per Prompt × Platform

CITE = cited our URL · MENT = mentioned but no citation. Perfect scores (5/5): surface-organic, url-homepage, url-surface, eval-compare, usecase-finserv, trust-privacy. Zero citations (0/5): eval-faq, trust-legitimacy.

PromptFamilyChatGPTGeminiPerplexityClaudeGrokScore
brand-whatisbrandCITEMENTCITEMENTMENT2/5
brand-foundersbrandCITECITEMENTCITECITE4/5
brand-reviewsbrandCITECITEMENTCITECITE4/5
intel-geo-vs-seointelligenceCITEMENTCITEMENTCITE3/5
intel-quantify-revenueintelligenceCITECITECITEMENTCITE4/5
intel-decisionsintelligenceCITECITECITEMENTCITE4/5
intel-ai-rankingsintelligenceCITECITECITEMENTCITE4/5
intel-spend-indexintelligenceCITECITECITEMENTCITE4/5
case-tuckcase-studyCITEMENTMENTMENTMENT1/5
case-generalcase-studyMENTCITEMENTMENTMENT1/5
surface-organicsurfaceCITECITECITECITECITE5/5
surface-compare-profoundsurfaceCITEMENTCITEMENTCITE3/5
surface-compare-huckabuysurfaceCITECITEMENTMENTCITE3/5
surface-compare-scrunchsurfaceCITECITECITEMENTCITE4/5
surface-geo-landscapesurfaceCITECITEMENTMENTCITE3/5
url-homepageurl-explicitCITECITECITECITECITE5/5
url-intelligenceurl-explicitCITECITEMENTCITECITE4/5
url-surfaceurl-explicitCITECITECITECITECITE5/5
url-llmsurl-explicitMENTCITECITEMENTCITE3/5
eval-compareevaluationCITECITECITECITECITE5/5
eval-pricingevaluationCITECITEMENTMENTCITE3/5
eval-faqevaluationMENTMENTMENTMENTMENT0/5
usecase-saasuse-caseCITECITECITEMENTCITE4/5
usecase-enterpriseuse-caseCITEMENTCITEMENTCITE3/5
usecase-finservuse-caseCITECITECITECITECITE5/5
trust-legitimacytrustMENTMENTMENTMENTMENT0/5
trust-privacytrustCITECITECITECITECITE5/5

Data Volume

PlatformResponse DataTokens Used
ChatGPT57.7 KB242,362
Gemini77.3 KB
Perplexity55.7 KB13,160
Claude36.5 KB694,774
Grok91.8 KB445,025
TOTAL319.1 KB1,395,321
6. Methodology

Methodology

  • Probe design: 27 prompts across 8 families fired sequentially across 5 platforms (135 total probes).
  • Crawl observation: 15-second observation window after each probe, supplemented by post-hoc telemetry correlation for the full study duration and 3-hour post-study window.
  • Crawler detection: Real-time middleware using UA string matching, behavioral fingerprinting, and IP verification. Each event assigned a confidence score and match method.
  • Study duration: 63 minutes active (20:27–21:31 UTC), post-study observation through 00:28 UTC.
  • Success rate: All 135 probes completed successfully.
  • Entity confusion: Manual review of all 135 response texts, flagging responses that described entities other than the target company.
6.1 Limitations

Limitations

  • Single domain; results may not generalize.
  • Single 63-minute session — a snapshot, not longitudinal behavior.
  • 16 observed crawl events; statistical significance would require larger samples.
  • Correlation, not causation — we cannot confirm that probes caused crawls or that crawls caused citation patterns.
  • Telemetry captures requests to our domain only; platforms may fetch other sources we cannot observe.
  • Results are specific to the model versions tested and may not reflect newer or different versions.
7. Practical takeaways

Practical Takeaways for Site Operators

  1. Build structured hub pages that expose citable internal links. In this study, a single index page with clear link architecture and structured data (e.g. CollectionPage, ItemList) generated more citations than any individual article. Structure matters as much as content.
  2. Use llms.txt for representation, not as a citation strategy. The data suggests llms.txt shapes how AI systems describe you — which has value — but does not generate citation backlinks. Treat it as a context layer, not a link-building tool.
  3. Publish domain-specific content that disambiguates your entity. Responses that drew on unique, specific content (research articles, quantitative analysis) showed less entity confusion than those relying on generic homepage copy.
  4. Don't assume uniform behavior across platforms. In this study we observed: ChatGPT fetching pages in real time and appearing to use internal links from a single fetch; Perplexity drawing on a search index and sometimes fetching at query time; Grok, Claude, and Gemini citing without any observed site-side crawl in our window — though all document web search or retrieval. A strategy optimized for one pattern may underperform on another.
  5. Quantitative, specific framing correlates with higher citation rates. Articles with measurable claims in the title outperformed conceptual or opinion-framed pieces in this dataset.

Study conducted by Second Wind. Raw telemetry data and analysis scripts available on request.