citare
AI bots + crawlers

GPTBot

GPTBot is OpenAI's web crawler for training data collection, identifiable by the User-Agent string 'GPTBot/1.x' and the published IP range at openai.com/gptbot — distinct from OAI-SearchBot which fetches pages for real-time ChatGPT search citations.

Definition

GPTBot is OpenAI's primary web crawler for collecting public web content used in pre-training future GPT model versions. It identifies itself with the User-Agent string Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.x; +https://openai.com/gptbot) and publishes its IP range list at openai.com/gptbot.

Why it matters

GPTBot crawl access determines whether your content can train future models. Blocking it does not affect ChatGPT real-time search citations (those come from OAI-SearchBot — a separate bot using a different User-Agent), but it does remove your content from the training corpus that shapes baseline model knowledge.

Crawl behavior

  • Respects robots.txt
  • Honors Crawl-delay directive
  • Uses 429 + Retry-After backoff
  • Refresh cadence: weekly to monthly for established sites
  • No execution of client-side JavaScript (server-side rendering required for content visibility)
  • No image OCR — image alt text is captured, image content is not

Common confusions

  • GPTBot ≠ ChatGPT-User. ChatGPT-User is the bot that fetches pages when a ChatGPT user pastes a URL into chat. It's user-initiated, not a crawler.
  • GPTBot ≠ OAI-SearchBot. OAI-SearchBot grounds ChatGPT's web search citations. Allowing GPTBot but blocking OAI-SearchBot is a common misconfiguration that hurts ChatGPT visibility.
  • GPTBot blocked at Cloudflare ≠ blocked in robots.txt. Cloudflare's AI Bot Block feature (and the WAF rules customers enable) can override your robots.txt allow. Check both layers.

How to allow it (and verify)

In robots.txt:

User-Agent: GPTBot
Allow: /

In Cloudflare: Security → Bots → AI Crawl Control → "Do not block (allow crawlers)" + "Do not manage robots.txt".

Verify access at the server-log level:

grep -i "gptbot" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Zero hits over 14 days while robots.txt allows GPTBot indicates an upstream block (WAF, Bot Management, Geo-IP rule).

See /ai-bot-crawlers for the full bot reference table.

Frequently asked

Should I block GPTBot from training on my content?

Trade-off. Blocking removes your content from future GPT model pre-training, which over years can reduce baseline brand knowledge in the model. Allowing accepts that your content shapes the model's understanding of your category. Most brands choose allow because pre-training visibility compounds.

What's the difference between GPTBot and OAI-SearchBot?

GPTBot collects training data for model pre-training. OAI-SearchBot fetches pages in real time when ChatGPT generates web search responses. Different User-Agents, different IP ranges, different purposes. To appear in ChatGPT search citations, allow OAI-SearchBot. To shape future-model baseline knowledge, allow GPTBot. Most sites should allow both.

Does GPTBot render JavaScript?

No. GPTBot fetches the server-side HTML and does not execute client-side scripts. Content rendered only by JavaScript after page load is invisible to GPTBot. Server-side rendering or pre-rendering of priority content is required.

Related

Stop guessing where you rank in AI search

Citare measures citation rate and share of voice across ChatGPT, Google AI Overview, Gemini, Claude, and Perplexity — weekly, for your priority queries. Free forever tier.