GPTBot

Definition

GPTBot is OpenAI's primary web crawler for collecting public web content used in pre-training future GPT model versions. It identifies itself with the User-Agent string Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.x; +https://openai.com/gptbot) and publishes its IP range list at openai.com/gptbot.

Why it matters

GPTBot crawl access determines whether your content can train future models. Blocking it does not affect ChatGPT real-time search citations (those come from OAI-SearchBot — a separate bot using a different User-Agent), but it does remove your content from the training corpus that shapes baseline model knowledge.

Crawl behavior

Respects robots.txt
Honors Crawl-delay directive
Uses 429 + Retry-After backoff
Refresh cadence: weekly to monthly for established sites
No execution of client-side JavaScript (server-side rendering required for content visibility)
No image OCR — image alt text is captured, image content is not

Common confusions

GPTBot ≠ ChatGPT-User. ChatGPT-User is the bot that fetches pages when a ChatGPT user pastes a URL into chat. It's user-initiated, not a crawler.
GPTBot ≠ OAI-SearchBot. OAI-SearchBot grounds ChatGPT's web search citations. Allowing GPTBot but blocking OAI-SearchBot is a common misconfiguration that hurts ChatGPT visibility.
GPTBot blocked at Cloudflare ≠ blocked in robots.txt. Cloudflare's AI Bot Block feature (and the WAF rules customers enable) can override your robots.txt allow. Check both layers.

How to allow it (and verify)

In robots.txt:

User-Agent: GPTBot
Allow: /

In Cloudflare: Security → Bots → AI Crawl Control → "Do not block (allow crawlers)" + "Do not manage robots.txt".

Verify access at the server-log level:

grep -i "gptbot" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Zero hits over 14 days while robots.txt allows GPTBot indicates an upstream block (WAF, Bot Management, Geo-IP rule).

See /ai-bot-crawlers for the full bot reference table.

Frequently asked

Should I block GPTBot from training on my content?

Trade-off. Blocking removes your content from future GPT model pre-training, which over years can reduce baseline brand knowledge in the model. Allowing accepts that your content shapes the model's understanding of your category. Most brands choose allow because pre-training visibility compounds.

What's the difference between GPTBot and OAI-SearchBot?

GPTBot collects training data for model pre-training. OAI-SearchBot fetches pages in real time when ChatGPT generates web search responses. Different User-Agents, different IP ranges, different purposes. To appear in ChatGPT search citations, allow OAI-SearchBot. To shape future-model baseline knowledge, allow GPTBot. Most sites should allow both.

Does GPTBot render JavaScript?

No. GPTBot fetches the server-side HTML and does not execute client-side scripts. Content rendered only by JavaScript after page load is invisible to GPTBot. Server-side rendering or pre-rendering of priority content is required.