AI Crawler Setup Checklist: Configure Your Site for GEO, LLMO & AI Visibility
Complete technical checklist for AI SEO audit and AI readiness. Configure GPTBot, PerplexityBot, and Google-Extended for maximum LLM visibility and ChatGPT optimization.
Ok So You Know How AI Crawlers Work. Now Let's Actually Set This Up.
The previous article covered how AI crawlers read your website. Now let's get practical.
Whether you care about GEO, LLMO, or AEO (honestly these terms overlap a lot), this checklist will help you configure your site for AI visibility.
We'll cover robots.txt, sitemaps, HTTP headers, schema validation, and how to check if your AI readiness is actually improving.
Bookmark this page. You'll come back to it.
Why Bother With All This
What we're optimizing for:
- Getting recommended when people ask ChatGPT questions
- Showing up as a source in Perplexity answers
- Being featured in Google's AI summaries
- Making sure AI models can actually read and cite your stuff
Part 1: robots.txt Configuration
Your robots.txt is the first thing AI crawlers check. Mess this up and your whole AI visibility fails before it starts.
The AI Crawlers You Need to Know (2026 Complete List)
| Crawler | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Powers ChatGPT's web browsing and training |
| ChatGPT-User | OpenAI | Real-time browsing for ChatGPT users |
| Google-Extended | Gemini and AI Overviews | |
| Googlebot | Traditional search (also feeds AI) | |
| PerplexityBot | Perplexity | Perplexity AI search engine |
| Anthropic-AI | Anthropic | Claude's web capabilities |
| ClaudeBot | Anthropic | Claude's browsing feature |
| Bytespider | ByteDance | AI features across ByteDance products |
| Applebot-Extended | Apple | Apple Intelligence and Siri |
| Meta-ExternalAgent | Meta | Meta AI features |
| Cohere-AI | Cohere | Enterprise AI applications |
# Allow all AI crawlers for maximum visibility
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Anthropic-AI
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
User-agent: Cohere-AI
Allow: /
# Standard search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Default rule for all other bots
User-agent: *
Allow: /
# Sitemap location
Sitemap: https://yoursite.com/sitemap.xmlIf You Want to Block Specific Crawlers
Maybe you want AI recommendations but don't want your content used for training. Here's a selective approach:
# Allow browsing, block training (where supported)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Allow: /Note: This is a trade-off. Blocking GPTBot may reduce your visibility in ChatGPT recommendations.
Common robots.txt Mistakes
Mistake 1: Accidental wildcard blocks
# BAD - blocks everything including AI crawlers
User-agent: *
Disallow: /Mistake 2: Blocking important directories
# BAD - blocks product pages from AI
User-agent: *
Disallow: /products/Mistake 3: Conflicting rules
# CONFUSING - order matters, be explicit
User-agent: GPTBot
Allow: /blog/
Disallow: /Verify Your robots.txt
yoursite.com/robots.txt directlyPart 2: Sitemap Configuration
Sitemaps tell crawlers what pages exist and when they were updated. AI crawlers use this to prioritize what to index.
Basic Sitemap Structure
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yoursite.com/</loc>
<lastmod>2026-01-26</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://yoursite.com/products/widget</loc>
<lastmod>2026-01-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>Sitemap Best Practices for AI
Keep lastmod accurate
AI crawlers use lastmod to decide whether to re-crawl. If it hasn't changed, they might skip it.
<!-- Update this whenever content actually changes -->
<lastmod>2026-01-26</lastmod>Prioritize important pages
Use priority values to signal what matters most:
- Homepage: 1.0
- Category pages: 0.8
- Product pages: 0.7
- Blog posts: 0.6
- Utility pages: 0.3
If you have more than 50,000 URLs, split into multiple sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yoursite.com/sitemap-products.xml</loc>
<lastmod>2026-01-26</lastmod>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-blog.xml</loc>
<lastmod>2026-01-26</lastmod>
</sitemap>
</sitemapindex>Generate Sitemaps Automatically
Shopify: Built-in at yourstore.com/sitemap.xml
WordPress: Use Yoast SEO or RankMath plugins
Next.js: Use next-sitemap package
Custom sites: Generate dynamically from your database/CMS
Part 3: HTTP Headers for AI Crawlers
HTTP headers give crawlers instructions about how to handle your pages.
Essential Headers
Cache-Control
Tell crawlers how fresh your content is:
Cache-Control: public, max-age=3600This says "content is public, cache for 1 hour."
Last-Modified
When the content was last changed:
Last-Modified: Sun, 26 Jan 2026 10:00:00 GMTCrawlers use this to avoid re-fetching unchanged content.
X-Robots-Tag
Page-level crawler instructions (alternative to meta robots):
X-Robots-Tag: index, followFor AI-specific control:
X-Robots-Tag: googlebot: index, follow
X-Robots-Tag: gptbot: noindexHeaders to Avoid
Don't block with authentication:
# BAD - crawlers can't authenticate
WWW-Authenticate: Basic realm="Protected"Don't use aggressive rate limiting:
AI crawlers are generally polite, but overly aggressive rate limiting might cause them to give up.
Part 4: Structured Data Validation for GEO
Schema markup is critical for Generative Engine Optimization. It helps AI understand your content with zero ambiguity. But broken schema is worse than no schema.
Validation Tools
Paste your URL, see what structured data Google detects.
More detailed validation against the full schema.org spec.
Debug JSON-LD syntax issues.
Common Schema Errors
Missing required properties:
{
"@type": "Product"
// ERROR: missing "name" property
}Wrong data types:
{
"@type": "Product",
"price": "one hundred dollars" // ERROR: should be number
}Invalid URLs:
{
"@type": "Product",
"image": "logo.png" // ERROR: should be full URL
}Schema Checklist by Page Type
Product Pages:
- ☐ Product name
- ☐ Description (50+ words)
- ☐ Price with currency
- ☐ Availability status
- ☐ Brand
- ☐ SKU or productID
- ☐ Images (full URLs)
- ☐ Reviews/ratings (if available)
- ☐ Headline
- ☐ Author with name
- ☐ datePublished
- ☐ dateModified
- ☐ Publisher with logo
- ☐ Image
- ☐ Name
- ☐ URL
- ☐ Logo
- ☐ Contact information
- ☐ Social profiles
- ☐ Question text
- ☐ Answer text (complete, not truncated)
Part 5: Page Speed for Crawlers
AI crawlers have timeouts. If your page takes too long to load, they get partial content or nothing.
Speed Targets
| Metric | Target | Why It Matters |
|---|---|---|
| Time to First Byte | < 200ms | Crawler connection success |
| First Contentful Paint | < 1.8s | Initial content visible |
| Largest Contentful Paint | < 2.5s | Main content loaded |
| Total page weight | < 2MB | Faster transfer |
1. Compress images
- Use WebP format
- Lazy load below-fold images
- Serve appropriate sizes
- Remove unused code
- Defer non-critical scripts
- Avoid render-blocking JS
# In .htaccess or server config
AddOutputFilterByType DEFLATE text/html text/css application/javascript4. Use a CDN
- Cloudflare (free tier available)
- Fastly
- AWS CloudFront
Test Your Speed
Part 6: AI Readiness Verification Checklist
Run through this AI readiness checklist after making changes:
robots.txt
- ☐ File accessible at /robots.txt
- ☐ GPTBot is allowed (or intentionally blocked)
- ☐ PerplexityBot is allowed
- ☐ Google-Extended is allowed
- ☐ No accidental wildcard blocks
- ☐ Sitemap URL is listed
Sitemap
- ☐ Sitemap accessible at listed URL
- ☐ All important pages included
- ☐ lastmod dates are accurate
- ☐ No 404s in sitemap URLs
- ☐ Sitemap submitted to Google Search Console
Headers
- ☐ Pages return 200 status
- ☐ No authentication required for public pages
- ☐ Cache headers present
- ☐ No noindex on important pages
Structured Data
- ☐ Passes Google Rich Results Test
- ☐ All required properties present
- ☐ No errors in Schema.org Validator
- ☐ Schema matches visible content
Page Speed
- ☐ TTFB under 200ms
- ☐ LCP under 2.5s
- ☐ No render-blocking resources
- ☐ Images optimized
Content Accessibility
- ☐ Critical content in HTML (not just JavaScript)
- ☐ No content behind login walls
- ☐ No content in iframes
- ☐ Alt text on images
Part 7: Monitoring AI Crawler Activity and LLM Visibility
After setup, monitor to ensure crawlers are actually visiting and your AI visibility is improving.
Server Log Analysis
Look for these user agents in your logs:
GPTBot/1.0
ChatGPT-User/1.0
PerplexityBot/1.0
Google-Extended
Anthropic-AI
ClaudeBot/1.0Log analysis command (Linux):
grep -E "GPTBot|PerplexityBot|ClaudeBot" /var/log/nginx/access.logGoogle Search Console
Check the Crawl Stats report for Googlebot activity. While this doesn't show AI-specific crawlers, it indicates overall crawlability.
Recomaze Monitoring
Recomaze tracks your AI visibility over time and alerts you to issues.
Quick Reference Card
Save this for quick reference:
| Task | Check | Tool |
|---|---|---|
| robots.txt | /robots.txt accessible | Browser |
| AI crawler access | GPTBot allowed | robots.txt Tester |
| Sitemap | Valid XML, no 404s | Google Search Console |
| Schema | No errors | Rich Results Test |
| Page speed | LCP < 2.5s | PageSpeed Insights |
| Overall AI readiness | Score 80+ | Recomaze Audit |
Understanding GEO, LLMO, and AEO: What's the Difference?
You'll hear these terms used interchangeably. Here's what each actually means:
GEO (Generative Engine Optimization)
The practice of optimizing content for generative AI systems like ChatGPT, Perplexity, and Google AI Overviews. Focuses on getting your content cited and recommended in AI-generated responses.
LLMO (Large Language Model Optimization)
A broader term covering optimization for all LLMs. While GEO focuses on search-oriented AI, LLMO includes optimization for any AI system that might reference your content, including Claude, Gemini, and enterprise AI tools.
AEO (Answer Engine Optimization)
Specifically targets AI systems that provide direct answers, like Google's AI Overviews and featured snippets. AEO emphasizes structured answers, FAQ content, and clear formatting that AI can extract and display.
The Bottom Line:
All three share the same core requirements: clear content structure, proper schema markup, accessible crawling, and authoritative information. This checklist covers the technical foundation for all of them.
AI Readiness Score: What to Aim For
After implementing this checklist, use an AI readiness checker to measure your progress:
| Score | Status | What It Means |
|---|---|---|
| 0-40 | Critical | AI crawlers likely blocked or major issues |
| 41-60 | Needs Work | Basic access works, optimization gaps |
| 61-80 | Good | Solid foundation, room for content improvement |
| 81-100 | Excellent | Well-optimized for AI visibility |
What to Do After Setup
Configuration is just the beginning. Once your technical foundation is solid:
Tools for Ongoing AI Visibility Monitoring
Stay on top of your LLM visibility with these approaches:
- Recomaze AI Audit - Free AI readiness checker with detailed recommendations
- Manual testing - Ask ChatGPT and Perplexity questions about your industry and track mentions
- Server logs - Monitor GPTBot and PerplexityBot crawl frequency
- Search Console - Track overall crawlability and indexing
Need Help With Your AI SEO Audit?
Technical setup can be tricky. If you're not sure whether your configuration is correct:
Run a free Recomaze AI readiness audit - We'll check your robots.txt, sitemap, schema, and overall AI visibility in under a minute.
Get your baseline AI readiness score, identify GEO issues, and use this checklist to fix them.
Whether you call it GEO, LLMO, or AEO, the goal is the same: make your content visible to AI. Start with this checklist, measure your progress, and keep optimizing.
