Relying solely on search queries grounded in Search Console and Bing can highlight intent, but it may not fully capture broader patterns of engagement. To understand how AI systems access and display content, we need to look beyond queries and move closer to the infrastructure itself.
Traditionally, log files have been difficult for SEO teams to work with – often locked behind technical barriers and owned by infrastructure or security teams. It’s probably quicker to find a needle in a haystack than to earn access to server logs in an SEO role.
Fortunately, the rapid growth of AI platforms has changed this dynamic. As AI crawlers and agents increasingly interact with websites, their impact on infrastructure and cybersecurity has pushed log providers to make this data far more accessible and usable.
As ChatGPT dominate AI referrals, collecting proprietary server logs tied to the ChatGPT-User agent, with valid status codes, provides a strong starting point for shifting behavioural analyses from what users might ask to how AI systems actually access your site.
In this article, Iโll show how server logs can reveal AI access patterns, measure zero-click interactions, and provide actionable insights for improving visibility across both search engines and AI-driven platforms.
โ ๏ธ Disclaimer
This is part 2 of a series about improving AI search tracking.
Catch up with the previous article on how your Google Search Console and Bing data can track AI search behavioural trails.
This is not a prompt tracking framework
The only thing you should measure is how often your brand appears in LLM responses for your most representative topics
Does the LLM model correctly associate your brand with its marketed categories?
Granted that tracking tools provide inconsistent output, does the model consistently place your brand in the right ‘bucket’?
What You Need to Get Through
The main requirements to replicate this framework are:
- Access to server logs – via web server, CDN (e.g; Akamai) or a third-party tool (i.e; Botify log analyzer) โ I strongly suggest you get access to your web server or CDN.
- Familiarity with Python and SQL โ a piece of cake in the era of LLMs.
What is ChatGPT-User
This user agent is only used when ChatGPT answers a userโs question and decides to fetch your page to support its response.
So every hit from ChatGPT-User is a direct signal of user intent โ it reflects real prompts happening inside ChatGPT.
It tends to retrieve only the raw HTML and skip JavaScript altogether. So if key content appears only after JS execution or if metadata is injected dynamically, ChatGPT-User is unlikely to see it or use it in its output.
Allowing or disallowing ChatGPT-User from your website is not as straightforward. Recent updates in the OpenAI documentation confirmed that ChatGPT-User may no longer follow robots.txt rules as it might tandem with GPTBot or even be triggered by custom GPTS.
If you really want to block it from accessing your website to avoid using your content to furnish answers in chats, make sure you restrict the IP with either your hosting provider or at the CDN level with aggressive WAF.
Query your Log Files
Compose a SQL query retrieving your server logs for:
HTTP 200andHTTP 304(not modified)- response content type
text/htmlto trigger only HTML requests - user agent
ChatGPT-User - cliIP or the IP address used by ChatGPT-User to hit your endpoints
- The last 7 days – a more retrospective outlook may exhaust bandwidth, depending on the configured data retention with your provider (usually last 30 days).
๐กA few Heads-Ups
1. Use DNS checker to run IP lookups and identify potential bot impersonators. You can use this insight to limit the output of the following query to real ChatGPT-User IPs.
2. Bear in mind that all you need is not accuracy but a good sample.
Run this beautiful query:
WITH base AS ( SELECT reqPath, cliIP, reqTimeSec, transferTimeMSec, downloadTime, toUnixTimestamp64Milli(reqTimeSec) - toUnixTimestamp64Milli(lagInFrame(reqTimeSec, 1) OVER (PARTITION BY cliIP ORDER BY reqTimeSec ASC)) AS time_diff FROM akamai.logs WHERE $__timeFilter(reqTimeSec) AND statusCode IN ('200', '304') AND reqHost = 'www.yoursite.com' AND rspContentType = 'text/html' AND UA LIKE '%ChatGPT-User%'),sessions AS ( SELECT *, sum( CASE WHEN time_diff IS NULL THEN 1 WHEN time_diff > 100 THEN 1 -- 100 ms threshold to capture full round trip of LLM requests* ELSE 0 END ) OVER ( PARTITION BY cliIP ORDER BY reqTimeSec ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS session_id FROM base)SELECT cliIP, session_id, min(reqTimeSec) AS session_start_time, toUnixTimestamp64Milli(max(reqTimeSec)) - toUnixTimestamp64Milli(min(reqTimeSec)) AS duration_ms, COUNT(*) AS count_requests, COUNT(DISTINCT reqPath) AS unique_urls, groupArray(reqPath) AS reqpathFROM sessionsGROUP BY cliIP, session_idORDER BY session_start_time DESC This query is quite advanced, and I was able to put it together off the back of recent insightful findings from Ruben Remy
This is the expected output, exported as a CSV file:
What do you have in there?
- session_id: a unique code that bundles simultaneous requests from ChatGPT-User into one specific event – occurring after 100 ms to create the “parallel bursts” that define a query fan-out event, according to Ruben’s research.
- session_start_time: The exact moment ChatGPT-User began its parallel crawl for that specific question
- duration_ms: Measures the session length, reflecting how quickly the AI “fanned out” across your site.
The longer a session, the greater the chances that ChatGPT-User fetches distinct URLs - count_requests: Total number of times ChatGPT-User touched your server in a single session
- unique_urls: Total number of distinct pages fetched by the bot in one session.
A high number is a signal of topical authority: ChatGPT-User returned an answer based on a sequence of topically-related pages on your site. - reqPath: A list of pages that ChatGPT-User considers to be related to a single topic.
This is already an advanced deep dive into server logs.
This can easily suggest how many query fan-out cycles (total number of unique session_id) or some sort of indication around ChatGPT topical authority, based on the number of total distinct URLs fetched within one session.
URL Paths Pre-Processing & Clustering
But the project needs surgery.
So, the next step is pre-processing URL paths so we can convert them into traditional search queries and group them into clean clusters.
You can crack on with a Google Colab, as it covers this section.
The script will:
- Normalise paths similar to search queries.
- Perform HDBSCAN clustering and assign a score to each similar URL.
The choice of the number of clusters is arbitrary, but dependent on how many rows you managed to export from the above query.
If you end up with around 1000 rows, aim for 15-30 clusters – adjust MIN_CLUSTER_SIZE (20 is a good start)
Once fine-tuned to your decisions, run the script. A similar output will be returned.
The output file will be saved in a CSV file.
Make sure to open it and remove all rows where cluster_probability < 1 โ this might contain outliers from irrelevant cluster assignments.
Also, remove all headers before normalised_path – from now on, we’re not going to need insights from server logs
One-Hot Encoding to Rename Clusters
Upload the document to Gemini, Claude or ChatGPT and ask to perform a simple NLP task, such as translating your HDBSCAN cluster scores into human-readable topic labels.
I’m not going to recommend one LLM or another – Gemini 3 is now the default model for AI Overviews and ChatGPT may handle content better than Claude.
However, turning numbers into words is a basic NLP operation that shouldn’t make a significant difference between models.
Prompt example:
Give me a non-generic content topic that represents all pages in each cluster. Populate the cluster_label column using the table copied from my clipboard.
As an example, I used Gemini
And copied and pasted the suggestions in the CSV file to ultimately operate a VLOOKUP.
That’s how I mapped out cluster numbers to their corresponding topic labels.
Now that you’ve got cluster, let’s tie them to reality.
- Concatenate normalized_path with your site’s domain
- Run a list crawl with the GSC API to gather Impressions and Clicks for concatenated URLs
- Vlookup the findings against your original export (i.e; the last image above) to retain only Impressions (optionally clicks too).
Now, you have two options based on the strategic driver:
| Strategy | Tactic | Target | Assignment |
|---|---|---|---|
| Knowledge Protection | Build or enhance reporting around zero-click searches and ChatGPT-User access | Internal team (SEO, Paid media) | Dump the keywords in Ahrefs to expand the range of similar long tail queries |
| Offensive Actions | Prompt building and tracking by funnel stages to intercept gaps in ChatGPT retrieval | Clients, company stakeholders and C-Suite | Review the list of paths and select a keyword modifier that specifies search intent for custom prompts |
The second option is the most requested in SEO services. For the life of me, I wouldn’t subscribe, but I have to get paid somehow, so I will do!
Below is an example of how that tactic will render in practice based on the process detailed in this article.
At this point, you might consider submitting these prompts to tracking platforms such as Peec AI โ again, this is not a sponsored link!
Server Logs are not without Limitations
In the previous article, we noted that proprietary data may not reflect the full picture, as the multiple preprocessing steps involved can introduce confirmation bias.
Similarly, this approach also comes with some caveats.
- The link between the generated response and the referenced source is not always factual; in some cases, the citation is produced by template logic rather than true source attribution.
- LLMs tend to provide sources or citations, even when they arenโt fully accurate. Tracking citations shows how confident the model is, not necessarily whatโs correct.
- This method doesnโt scale well unless you run it on your own server.
Pulling more than 30 days of logs can use a lot of resources and may trigger limits from your CDNโthis happened to me when I overqueried.
AI Search Tracking moves from Attribution to Understanding
Tracking AI visibility is not about chasing perfect attribution. It is about building directional clarity in an ecosystem where traditional analytics no longer apply.
By combining first-party search data with server-side signals, you can move from speculation to structured observation โ identifying patterns that explain how AI systems surface your brand.
This framework will not tell you exactly which prompt, topic, query or keyword triggered an answer, but it will help you understand why your content is being selected.