Study

From Prompt to Crawler to Response: Tracing the Full Lifecycle of an AI Query

Name: From Prompt to Crawler to Response: Tracing the Full Lifecycle of an AI Query
Creator: Second Wind
Published: 2026-03-16

Telemetry across ChatGPT, Gemini, Perplexity, Claude, and Grok: 135 probes, crawler events, pages fetched, and how citations show up—end-to-end behavior most GEO/AEO work never measures.

Marty Coleman

CEO, Second Wind

March 16, 2026

1. Executive Takeaway

Executive Takeaway

We traced the full lifecycle of AI queries across five major AI platforms, from prompt to crawler behavior to fetched pages to final response and citations. To do that, we instrumented both sides of the interaction: a probe layer that issued the queries and website-side telemetry that recorded crawler and agent activity as it happened.

From there, we looked at both what the systems answered and the observable path by which those answers appeared to form. In this dataset, the biggest pattern was differential processing: the same domain could be handled very differently depending on the system and the content format involved.

In one case, a single fetch of a structured HTML hub page with internal links was followed by 12 citations to our domain. In another, a fetch of llms.txt appeared to shape answer content but produced zero citations. In other responses, platforms cited pages on the domain despite no observable site-side crawl during the run.

Taken together, these traces suggest that AI visibility is shaped by both content existence and how content is structured, traversed, and incorporated into retrieval or answer formation. This report does not claim to prove any platform's fixed internal architecture. Within a small, bounded study, it surfaces a meaningful and underexamined layer of GEO/AEO: the path between prompt and answer.

2. Scope

What This Study Does and Does Not Prove

This study measures observed fetch behavior (which pages AI platform crawlers requested, when, and from where) and citation outcomes (which URLs appeared in AI-generated responses) for a single domain across a single 63-minute study window using 27 prompts.

✓

What we can show

•Which platforms triggered real-time crawls and what they fetched
•Correlation between content format, structure, and citation frequency
•Differences in crawler infrastructure, timing, and request patterns
•How citation rates varied by prompt type, platform, and content category

What we cannot show

×Whether any platform “actually” live-crawls in general (small sample)
×How internal ranking, retrieval, or generation systems work
×Whether patterns generalize beyond this domain or time window
×Causal relationships — we observe co-occurrence, not causation
×How model weights, training data, or retrieval indexes are constructed

All findings are scoped to this dataset. Where we interpret patterns or suggest mechanisms, we flag those as hypotheses.

3. Findings

Findings

What follows is a set of observations from this study — interesting findings in AI agent/crawler behavior and how crawled data showed up in answers. We focus on the journey from prompt to crawler to pages hit, latency, size of content retrieved (where visible), and how the response and citations reflected (or didn't reflect) what was fetched.

3.1 Structured hub pages

Structured Hub Pages as Citation Multipliers

The single highest-citation crawl event in the study:

1. Prompt

"Check secondwind.cloud/intelligence — what research has Second Wind published?"

ChatGPT

2. Crawler detected

21:05:46.296Z

ChatGPT-User → GET /intelligence

Microsoft Azure (San Antonio, TX)

+1.032s from probe

↳ CollectionPage, ItemList, internal links (GEO)

3. Citations

/intelligence/how-decisions-are-now-made
/intelligence/corporate-spend-expense-management-saas-ai-visibility-index
/intelligence/what-ai-systems-ranking-each-other-reveals-about-geo
/intelligence/geo-isnt-seo-with-a-new-acronym
/intelligence (the index itself)
/ (homepage, cited twice — root and www variant)
/technology
/decisions
/quantify
/portfolio

+1 more

4. Response

2,095 bytes

12 citations — all to our domain

ChatGPT fetched the /intelligence index page — a single HTML document listing published research articles with internal links. From that one page crawl, it generated 12 citation instances across 11 distinct URLs on our domain (shown in the flow above).

We did not observe ChatGPT fetching any of those articles individually. The behavior is consistent with the model parsing the index page's internal links and citing discovered URLs without following them — though we cannot confirm this is the mechanism, only that no additional fetches were logged.

Across all platforms in this dataset, the /intelligence index page was cited 20 times — more than any individual article. Compare with the url-llms probe, where ChatGPT fetched a single flat-text file and produced zero citations. Here, structured HTML with internal link architecture correlated with far more citations than plain text.

3.2 Architectures

Distinct Fetch-to-Cite Architectures Across Platforms

From our telemetry we can describe response patterns — e.g. "we observed live crawl/fetch here," "we observed no site-side fetch here," "the response looked consistent with indexed or prior knowledge" — but we generally cannot prove a provider's full internal architecture, that "no observed crawl" means no retrieval happened, or that one vendor is always live-fetch and another always cache-only. Below we frame three patterns that fit what we saw, with links to vendor documentation where it clarifies what is possible.

Architecture AQuery-time web retrieval observed(e.g. ChatGPT in this study)

Prompt

Search/fetch may occur at answer time

Content parsed or excerpted

Response + citations

Architecture BSearch-index–mediated retrieval likely(e.g. Perplexity)

Prompt

System consults search/index layer

May optionally fetch supporting pages

Response + citations

Architecture CNo live retrieval observed in this run(response pattern, not provider type)

Prompt

Prior indexed/cached knowledge or hidden retrieval path

Relevant entities/pages recalled or selected

Response formed

Architecture A — On-demand web retrieval

Possible live search and/or direct page fetch at answer time. Applies at least sometimes to ChatGPT, Claude, Gemini, and Grok when web-grounding or search features are active (see vendor docs: OpenAI, Claude, Google, xAI). In this study we observed live fetch behavior from ChatGPT (e.g. ChatGPT-User, OAI-SearchBot, GPTBot; Azure and residential IPs). We do not claim a fixed pipeline (e.g. "fetches URL in ~1s, parses HTML + internal links") as the default — only that real-time fetch can and did occur in our run.

Architecture B — Search/index–mediated retrieval

System consults a prebuilt search index and may supplement with fresh fetching or content extraction. This is a reasonable broad description for Perplexity (Perplexity crawler docs). Note: Perplexity states that PerplexityBot is not used to crawl content for AI foundation models; it is used to surface and link websites in Perplexity search results. So we avoid claiming "PerplexityBot crawls proactively to build a per-site index for answering." In our window we saw PerplexityBot hit 9 pages; when study probes fired, we did not see a re-crawl — behavior was consistent with search-index–mediated retrieval, with optional fetch. Robots.txt was fetched 31 minutes after the last crawl, consistent with retroactive compliance check rather than pre-crawl permission.

Architecture C — No observable live retrieval in a given response

The answer appears to rely on prior model knowledge, cached/indexed material, or retrieval we cannot see. Treat this as a response pattern rather than a provider-level architecture. Claude, Gemini, and Grok all document web search or web fetch; our run showed no site-side crawl for those platforms, which is not enough to classify them as "cached index only" by design. In this study, 81 responses came from platforms with zero observed crawl events; citation behavior in those responses still varied (see table).

Citation behavior for platforms that in this run showed no observable site-side fetch:

Platform	Citation Rate	Owned Cites	Observed Pattern
Grok	81% (22/27)	60	Cited broadly; highest owned share (39%).
Gemini	74% (20/27)	20	Cited root domain only — never cited specific subpages.
Claude	33% (9/27)	13	Cited only / and /privacy. Intelligence articles absent. Shortest responses.

Observation: In responses where we saw no live fetch, content that appears to have been in training or indexing pipelines was still cited; content that may exist only on the live site (e.g. case studies, FAQ) was rarely cited. We do not infer that these providers lack web retrieval — only that we did not observe it in this run.

3.3 Quantitative titles

Quantitative Research Titles Correlated with Higher Citation Rates

Not all intelligence articles performed equally. In this dataset, articles with specific, quantitative framing in the title received substantially more citations.

Article (by URL slug)	Citations	Platforms Citing	Platforms That Crawled
how-we-quantify-ai-influenced-revenue	18	Grok (9), Perplexity (9)	Perplexity
corporate-spend-expense-management-saas-ai-visibility-index	16	ChatGPT (12), Grok (3), Perplexity (1)	ChatGPT, Perplexity, Google
what-ai-systems-ranking-each-other-reveals-about-geo	13	ChatGPT (7), Grok (3), Perplexity (3)	Perplexity
how-decisions-are-now-made	12	ChatGPT (10), Grok (2)	Perplexity, Copilot
geo-isnt-seo-with-a-new-acronym	6	ChatGPT (4), Grok (2)	—

The top performers share two observable characteristics: (1) Specific quantitative language in the URL — "quantify," "index," "revenue" signal measurable data. (2) Multi-platform crawl interest — the corporate-spend article was independently crawled by ChatGPT, Perplexity, AND Google (via Nexus 5X render pass at 23:10 UTC). The lowest performer (geo-isnt-seo-with-a-new-acronym) uses conceptual/opinion framing and was not crawled by any platform during the observation window.

3.4 AI-readiness

AI-Readiness Content Attracted Strong Citation Behavior

The prompt "Does Second Wind have a dedicated AI-readable version of their site?" achieved 5/5 citation rate — every platform cited our domain. This was the only non-URL, non-comparison prompt to achieve a perfect score in our dataset. Every platform: acknowledged the AI Surface concept; used the phrase "AI-readable" in its response; Grok specifically cited ai.secondwind.cloud; Claude cited us on this prompt despite producing no citations on 18 of 27 others. This was also the prompt that generated the most owned citations from Claude (2) and Grok (7) — in this run we observed no site-side fetch for those platforms, though both document web search or retrieval capabilities. Interpretation: This finding is consistent with the idea that AI systems have a natural affinity for content describing AI-readable infrastructure — it directly relates to their own data ingestion paradigm. However, we note that this prompt is also highly specific and brand-relevant, which may independently explain the high citation rate. We cannot isolate which factor drives the result.

4. Illustrative traces

Illustrative Traces

Three representative prompt-to-crawler-to-response examples. Below we map the full path from prompt → crawler → pages hit → latency → what we infer was retrieved → response and citations. These three journeys use only events we directly observed; they show how crawled data did or didn't show up in the answer.

Journey 1: The Citation Multiplier (ChatGPT × /intelligence)

1. Prompt

"Check secondwind.cloud/intelligence — what research has Second Wind published?"

ChatGPT

2. Crawler detected

21:05:46.296Z

ChatGPT-User → GET /intelligence

Azure (San Antonio, TX)

+1.032s from probe

↳ CollectionPage, ItemList, internal links (GEO)

3. Citations

/intelligence/how-decisions-are-now-made
/intelligence/corporate-spend-expense-management-saas-ai-visibility-index
/intelligence/what-ai-systems-ranking-each-other-reveals-about-geo
/intelligence/geo-isnt-seo-with-a-new-acronym
/intelligence
/
/technology
/decisions
/quantify
/portfolio

+1 more

4. Response

2,095 bytes

12 citations — all to our domain

What we infer was retrieved: The HTML of /intelligence (index page listing 5 research articles with internal links). We did not see fetches for any of the article URLs.

How it affected the answer: The model listed published research and cited 11 distinct URLs on our domain (including article paths it never fetched), consistent with parsing the index's link structure and citing those URLs without following them. One crawl → 12 citation instances; content size of the single page was enough to drive the whole citation set.

Journey 2: The Consumed-But-Not-Cited File (ChatGPT × /llms.txt)

1. Prompt

"Read secondwind.cloud/llms.txt — what does Second Wind provide there for AI systems?"

ChatGPT

2. Crawler detected

21:09:46.303Z

ChatGPT-User → GET /llms.txt

Azure (San Antonio, TX)

Middleware 2ms

Requests HTML accept headers for a .txt file

3. Citations

0 citations to our domain · 12 total (other "Second Wind" entities)

4. Response

468 bytes (shortest in study)

0 citations to our domain

Described AI Surface correctly but included Second Wind Pro — wrong entity.

What we infer was retrieved: The full llms.txt file (plain text; small byte size). Response was 468 bytes — shortest in the study.

How it affected the answer: The model accurately described the AI Surface concept and what Second Wind provides for AI systems, but produced zero citations to our domain and included the wrong entity (Second Wind Pro). So: content was consumed as context and shaped the answer text, but was not treated as a citable source, and did not resolve entity disambiguation. Contrast with Journey 1: Same provider, ~1s latency. Structured HTML with internal links → 12 owned citations. Plain text llms.txt → 0 owned citations. In this pair, content format and structure correlated strongly with citation outcome.

Journey 3: The Redirected Crawl (Perplexity × /llms.txt prompt)

1. Prompt

"Read secondwind.cloud/llms.txt — what does Second Wind provide there for AI systems?"

Perplexity

2. Crawler detected

21:10:33.433Z

PerplexityBot → GET /intelligence

AWS us-east-1 (Ashburn, VA)

~5.8s from probe

↳ CollectionPage, ItemList, internal links (GEO)

Fetched /intelligence, NOT /llms.txt

3. Citations

secondwind.cloud (root)
/privacy
/intelligence/how-we-quantify-ai-influenced-revenue
/intelligence

4. Response

1,639 bytes

4 citations to our domain · 10 total

Stated we don't provide llms.txt; citations from pre-built index.

What we infer was retrieved: The /intelligence index page (HTML), not llms.txt. Latency from probe to crawl ~5.8s.

How it affected the answer: The model stated that "Second Wind does not provide publicly accessible content at secondwind.cloud/llms.txt based on available search results" — so the crawl target and the answer were misaligned with the user's request. Yet it still cited 4 of our URLs (root, /privacy, one intelligence article, /intelligence), likely from its pre-built index. So: the real-time fetch went to a different URL than requested; the answer reflected that (denying llms.txt) while citations still drew on previously indexed pages.

5. Detailed data

Detailed Data

5.1 Probe config

Probe Configuration

Platform	Model	Tool use	Probes
ChatGPT	gpt-4o-mini-2024-07-18	Web search (Bing)	27
Gemini	gemini-2.5-flash	Google Search	27
Perplexity	sonar	Perplexity Search	27
Claude	claude-haiku-4-5-20251001	Web search	27
Grok	grok-4-1-fast-reasoning	Grok Search	27

27 prompts across 8 families: brand (3), intelligence (5), case-study (2), surface/competitive (5), url-explicit (4), evaluation (3), use-case (3), trust (2). Each fired sequentially across all 5 platforms.

5.2 Platform tables

Mention & Citation Rates

Overall: 100% mention rate (135/135) · 67% citation rate (91/135) · 23.8% owned citation share (228/958)

Loading chart…

Platform	Mentions	Citations	Owned Cites	Owned %	Avg Response
ChatGPT	27/27 (100%)	23/27 (85%)	97	26.0%	2,187 bytes
Gemini	27/27 (100%)	20/27 (74%)	20	13.0%	2,932 bytes
Perplexity	27/27 (100%)	17/27 (63%)	38	17.6%	2,113 bytes
Claude	27/27 (100%)	9/27 (33%)	13	21.0%	1,384 bytes
Grok	27/27 (100%)	22/27 (81%)	60	39.2%	3,482 bytes

5.3 Citation charts

Pages Cited by Platform

ChatGPT97 citations

/ (21)/technology (14)/intelligence/corporate-spend- (12)/intelligence/how-decisions- (10)other (40)

Grok60 citations

/ (20)/intelligence/how-we-quantify- (9)/intelligence (6)other (25)

Perplexity38 citations

/intelligence/how-we-quantify- (9)/intelligence (8)/ (7)other (14)

Gemini20 citations

root only (20)

Claude13 citations

/ (7)/privacy (6)

Explicit URL Probes vs. Organic Discovery

Loading chart…

Avg owned cites per probe: Explicit URL 2.4, Organic 1.6. ChatGPT is the only platform where organic probes produced a higher citation rate than explicit URL probes. Claude saw the largest boost from explicit URLs: citation rate nearly tripled from 26% to 75%.

5.4 Citation grid

Citation Grid — Per Prompt × Platform

CITE = cited our URL · MENT = mentioned but no citation. Perfect scores (5/5): surface-organic, url-homepage, url-surface, eval-compare, usecase-finserv, trust-privacy. Zero citations (0/5): eval-faq, trust-legitimacy.

Prompt	Family	ChatGPT	Gemini	Perplexity	Claude	Grok	Score
brand-whatis	brand	CITE	MENT	CITE	MENT	MENT	2/5
brand-founders	brand	CITE	CITE	MENT	CITE	CITE	4/5
brand-reviews	brand	CITE	CITE	MENT	CITE	CITE	4/5
intel-geo-vs-seo	intelligence	CITE	MENT	CITE	MENT	CITE	3/5
intel-quantify-revenue	intelligence	CITE	CITE	CITE	MENT	CITE	4/5
intel-decisions	intelligence	CITE	CITE	CITE	MENT	CITE	4/5
intel-ai-rankings	intelligence	CITE	CITE	CITE	MENT	CITE	4/5
intel-spend-index	intelligence	CITE	CITE	CITE	MENT	CITE	4/5
case-tuck	case-study	CITE	MENT	MENT	MENT	MENT	1/5
case-general	case-study	MENT	CITE	MENT	MENT	MENT	1/5
surface-organic	surface	CITE	CITE	CITE	CITE	CITE	5/5
surface-compare-profound	surface	CITE	MENT	CITE	MENT	CITE	3/5
surface-compare-huckabuy	surface	CITE	CITE	MENT	MENT	CITE	3/5
surface-compare-scrunch	surface	CITE	CITE	CITE	MENT	CITE	4/5
surface-geo-landscape	surface	CITE	CITE	MENT	MENT	CITE	3/5
url-homepage	url-explicit	CITE	CITE	CITE	CITE	CITE	5/5
url-intelligence	url-explicit	CITE	CITE	MENT	CITE	CITE	4/5
url-surface	url-explicit	CITE	CITE	CITE	CITE	CITE	5/5
url-llms	url-explicit	MENT	CITE	CITE	MENT	CITE	3/5
eval-compare	evaluation	CITE	CITE	CITE	CITE	CITE	5/5
eval-pricing	evaluation	CITE	CITE	MENT	MENT	CITE	3/5
eval-faq	evaluation	MENT	MENT	MENT	MENT	MENT	0/5
usecase-saas	use-case	CITE	CITE	CITE	MENT	CITE	4/5
usecase-enterprise	use-case	CITE	MENT	CITE	MENT	CITE	3/5
usecase-finserv	use-case	CITE	CITE	CITE	CITE	CITE	5/5
trust-legitimacy	trust	MENT	MENT	MENT	MENT	MENT	0/5
trust-privacy	trust	CITE	CITE	CITE	CITE	CITE	5/5

Data Volume

Platform	Response Data	Tokens Used
ChatGPT	57.7 KB	242,362
Gemini	77.3 KB	—
Perplexity	55.7 KB	13,160
Claude	36.5 KB	694,774
Grok	91.8 KB	445,025
TOTAL	319.1 KB	1,395,321

6. Methodology

Methodology

Probe design: 27 prompts across 8 families fired sequentially across 5 platforms (135 total probes).
Crawl observation: 15-second observation window after each probe, supplemented by post-hoc telemetry correlation for the full study duration and 3-hour post-study window.
Crawler detection: Real-time middleware using UA string matching, behavioral fingerprinting, and IP verification. Each event assigned a confidence score and match method.
Study duration: 63 minutes active (20:27–21:31 UTC), post-study observation through 00:28 UTC.
Success rate: All 135 probes completed successfully.
Entity confusion: Manual review of all 135 response texts, flagging responses that described entities other than the target company.

6.1 Limitations

Limitations

Single domain; results may not generalize.
Single 63-minute session — a snapshot, not longitudinal behavior.
16 observed crawl events; statistical significance would require larger samples.
Correlation, not causation — we cannot confirm that probes caused crawls or that crawls caused citation patterns.
Telemetry records requests to our domain only; platforms may fetch other sources we cannot observe.
Results are specific to the model versions tested and may not reflect newer or different versions.

7. Practical takeaways

Practical Takeaways for Site Operators

Build structured hub pages that expose citable internal links. In this study, a single index page with clear link architecture and structured data (e.g. CollectionPage, ItemList) generated more citations than any individual article. Structure matters as much as content.
Use llms.txt for representation, not as a citation strategy. The data suggests llms.txt shapes how AI systems describe you — which has value — but does not generate citation backlinks. Treat it as a context layer, not a link-building tool.
Publish domain-specific content that disambiguates your entity. Responses that drew on unique, specific content (research articles, quantitative analysis) showed less entity confusion than those relying on generic homepage copy.
Don't assume uniform behavior across platforms. In this study we observed: ChatGPT fetching pages in real time and appearing to use internal links from a single fetch; Perplexity drawing on a search index and sometimes fetching at query time; Grok, Claude, and Gemini citing without any observed site-side crawl in our window — though all document web search or retrieval. A strategy optimized for one pattern may underperform on another.
Quantitative, specific framing correlates with higher citation rates. Articles with measurable claims in the title outperformed conceptual or opinion-framed pieces in this dataset.

Study conducted by Second Wind. Raw telemetry data and analysis scripts available on request.

More from us

Article

From Prompt to Crawler to Response: Tracing the Full Lifecycle of an AI Query

Executive Takeaway

What This Study Does and Does Not Prove

What we can show

What we cannot show

Findings

Structured Hub Pages as Citation Multipliers

Distinct Fetch-to-Cite Architectures Across Platforms

Architecture A — On-demand web retrieval

Architecture B — Search/index–mediated retrieval

Architecture C — No observable live retrieval in a given response

Quantitative Research Titles Correlated with Higher Citation Rates

AI-Readiness Content Attracted Strong Citation Behavior

Illustrative Traces

Journey 1: The Citation Multiplier (ChatGPT × /intelligence)

Journey 2: The Consumed-But-Not-Cited File (ChatGPT × /llms.txt)

Journey 3: The Redirected Crawl (Perplexity × /llms.txt prompt)

Detailed Data

Probe Configuration

Mention & Citation Rates

Pages Cited by Platform

Explicit URL Probes vs. Organic Discovery

Citation Grid — Per Prompt × Platform

Data Volume

Methodology

Limitations

Practical Takeaways for Site Operators

More from us

Google Just Classified Your GEO Strategy as Spam

Recommendation Is Becoming a Security Layer

AI Isn't Search. It's a New Buyer.