Technology
AI Crawlers
GEO
LLMO

AI Crawler Setup Checklist: Configure Your Site for GEO, LLMO & AI Visibility

Complete technical checklist for AI SEO audit and AI readiness. Configure GPTBot, PerplexityBot, and Google-Extended for maximum LLM visibility and ChatGPT optimization.

RecomazeJipianu Adin-Daniel11 min read
Jipianu Adin-Daniel

Jipianu Adin-Daniel

CTO & Co-Founder at Recomaze. AI and ecommerce expert with years of experience in search technology, generative engine optimization (GEO), and AI visibility strategies. Specialist in helping ecommerce businesses get discovered and recommended by AI assistants like ChatGPT, Perplexity, and Google AI.

Ok So You Know How AI Crawlers Work. Now Let's Actually Set This Up.

The previous article covered how AI crawlers read your website. Now let's get practical.

Whether you care about GEO, LLMO, or AEO (honestly these terms overlap a lot), this checklist will help you configure your site for AI visibility.

We'll cover robots.txt, sitemaps, HTTP headers, schema validation, and how to check if your AI readiness is actually improving.

Bookmark this page. You'll come back to it.


Why Bother With All This

What we're optimizing for:

  • Getting recommended when people ask ChatGPT questions
  • Showing up as a source in Perplexity answers
  • Being featured in Google's AI summaries
  • Making sure AI models can actually read and cite your stuff
A proper AI readiness audit checks all this. This checklist helps you fix what audits find.


Part 1: robots.txt Configuration

Your robots.txt is the first thing AI crawlers check. Mess this up and your whole AI visibility fails before it starts.

The AI Crawlers You Need to Know (2026 Complete List)

CrawlerCompanyPurpose
GPTBotOpenAIPowers ChatGPT's web browsing and training
ChatGPT-UserOpenAIReal-time browsing for ChatGPT users
Google-ExtendedGoogleGemini and AI Overviews
GooglebotGoogleTraditional search (also feeds AI)
PerplexityBotPerplexityPerplexity AI search engine
Anthropic-AIAnthropicClaude's web capabilities
ClaudeBotAnthropicClaude's browsing feature
BytespiderByteDanceAI features across ByteDance products
Applebot-ExtendedAppleApple Intelligence and Siri
Meta-ExternalAgentMetaMeta AI features
Cohere-AICohereEnterprise AI applications
### Recommended robots.txt for Maximum LLM Visibility

# Allow all AI crawlers for maximum visibility
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Anthropic-AI
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Cohere-AI
Allow: /

# Standard search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Default rule for all other bots
User-agent: *
Allow: /

# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml

If You Want to Block Specific Crawlers

Maybe you want AI recommendations but don't want your content used for training. Here's a selective approach:

# Allow browsing, block training (where supported)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Google-Extended
Disallow: /

User-agent: Googlebot
Allow: /

Note: This is a trade-off. Blocking GPTBot may reduce your visibility in ChatGPT recommendations.

Common robots.txt Mistakes

Mistake 1: Accidental wildcard blocks

# BAD - blocks everything including AI crawlers
User-agent: *
Disallow: /

Mistake 2: Blocking important directories

# BAD - blocks product pages from AI
User-agent: *
Disallow: /products/

Mistake 3: Conflicting rules

# CONFUSING - order matters, be explicit
User-agent: GPTBot
Allow: /blog/
Disallow: /

Verify Your robots.txt

  • Visit yoursite.com/robots.txt directly
  • Use Google's robots.txt Tester
  • Run a Recomaze audit to check AI-specific access

  • Part 2: Sitemap Configuration

    Sitemaps tell crawlers what pages exist and when they were updated. AI crawlers use this to prioritize what to index.

    Basic Sitemap Structure

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://yoursite.com/</loc>
        <lastmod>2026-01-26</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
      </url>
      <url>
        <loc>https://yoursite.com/products/widget</loc>
        <lastmod>2026-01-25</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
      </url>
    </urlset>

    Sitemap Best Practices for AI

    Keep lastmod accurate

    AI crawlers use lastmod to decide whether to re-crawl. If it hasn't changed, they might skip it.

    <!-- Update this whenever content actually changes -->
    <lastmod>2026-01-26</lastmod>

    Prioritize important pages

    Use priority values to signal what matters most:

    • Homepage: 1.0
    • Category pages: 0.8
    • Product pages: 0.7
    • Blog posts: 0.6
    • Utility pages: 0.3
    Use sitemap index for large sites

    If you have more than 50,000 URLs, split into multiple sitemaps:

    <?xml version="1.0" encoding="UTF-8"?>
    <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <sitemap>
        <loc>https://yoursite.com/sitemap-products.xml</loc>
        <lastmod>2026-01-26</lastmod>
      </sitemap>
      <sitemap>
        <loc>https://yoursite.com/sitemap-blog.xml</loc>
        <lastmod>2026-01-26</lastmod>
      </sitemap>
    </sitemapindex>

    Generate Sitemaps Automatically

    Shopify: Built-in at yourstore.com/sitemap.xml

    WordPress: Use Yoast SEO or RankMath plugins

    Next.js: Use next-sitemap package

    Custom sites: Generate dynamically from your database/CMS


    Part 3: HTTP Headers for AI Crawlers

    HTTP headers give crawlers instructions about how to handle your pages.

    Essential Headers

    Cache-Control

    Tell crawlers how fresh your content is:

    Cache-Control: public, max-age=3600

    This says "content is public, cache for 1 hour."

    Last-Modified

    When the content was last changed:

    Last-Modified: Sun, 26 Jan 2026 10:00:00 GMT

    Crawlers use this to avoid re-fetching unchanged content.

    X-Robots-Tag

    Page-level crawler instructions (alternative to meta robots):

    X-Robots-Tag: index, follow

    For AI-specific control:

    X-Robots-Tag: googlebot: index, follow
    X-Robots-Tag: gptbot: noindex

    Headers to Avoid

    Don't block with authentication:

    # BAD - crawlers can't authenticate
    WWW-Authenticate: Basic realm="Protected"

    Don't use aggressive rate limiting:

    AI crawlers are generally polite, but overly aggressive rate limiting might cause them to give up.


    Part 4: Structured Data Validation for GEO

    Schema markup is critical for Generative Engine Optimization. It helps AI understand your content with zero ambiguity. But broken schema is worse than no schema.

    Validation Tools

    Google Rich Results Test

    Paste your URL, see what structured data Google detects.

    Schema.org Validator

    More detailed validation against the full schema.org spec.

    JSON-LD Playground

    Debug JSON-LD syntax issues.

    Common Schema Errors

    Missing required properties:

    {
      "@type": "Product"
      // ERROR: missing "name" property
    }

    Wrong data types:

    {
      "@type": "Product",
      "price": "one hundred dollars"  // ERROR: should be number
    }

    Invalid URLs:

    {
      "@type": "Product",
      "image": "logo.png"  // ERROR: should be full URL
    }

    Schema Checklist by Page Type

    Product Pages:

    • ☐ Product name
    • ☐ Description (50+ words)
    • ☐ Price with currency
    • ☐ Availability status
    • ☐ Brand
    • ☐ SKU or productID
    • ☐ Images (full URLs)
    • ☐ Reviews/ratings (if available)
    Article Pages:
    • ☐ Headline
    • ☐ Author with name
    • ☐ datePublished
    • ☐ dateModified
    • ☐ Publisher with logo
    • ☐ Image
    Organization (site-wide):
    • ☐ Name
    • ☐ URL
    • ☐ Logo
    • ☐ Contact information
    • ☐ Social profiles
    FAQ Pages:
    • ☐ Question text
    • ☐ Answer text (complete, not truncated)

    Part 5: Page Speed for Crawlers

    AI crawlers have timeouts. If your page takes too long to load, they get partial content or nothing.

    Speed Targets

    MetricTargetWhy It Matters
    Time to First Byte< 200msCrawler connection success
    First Contentful Paint< 1.8sInitial content visible
    Largest Contentful Paint< 2.5sMain content loaded
    Total page weight< 2MBFaster transfer
    ### Quick Speed Wins

    1. Compress images

    • Use WebP format
    • Lazy load below-fold images
    • Serve appropriate sizes
    2. Minimize JavaScript
    • Remove unused code
    • Defer non-critical scripts
    • Avoid render-blocking JS
    3. Enable compression
    # In .htaccess or server config
    AddOutputFilterByType DEFLATE text/html text/css application/javascript

    4. Use a CDN

    Test Your Speed


    Part 6: AI Readiness Verification Checklist

    Run through this AI readiness checklist after making changes:

    robots.txt

    • ☐ File accessible at /robots.txt
    • ☐ GPTBot is allowed (or intentionally blocked)
    • ☐ PerplexityBot is allowed
    • ☐ Google-Extended is allowed
    • ☐ No accidental wildcard blocks
    • ☐ Sitemap URL is listed

    Sitemap

    • ☐ Sitemap accessible at listed URL
    • ☐ All important pages included
    • ☐ lastmod dates are accurate
    • ☐ No 404s in sitemap URLs
    • ☐ Sitemap submitted to Google Search Console

    Headers

    • ☐ Pages return 200 status
    • ☐ No authentication required for public pages
    • ☐ Cache headers present
    • ☐ No noindex on important pages

    Structured Data

    • ☐ Passes Google Rich Results Test
    • ☐ All required properties present
    • ☐ No errors in Schema.org Validator
    • ☐ Schema matches visible content

    Page Speed

    • ☐ TTFB under 200ms
    • ☐ LCP under 2.5s
    • ☐ No render-blocking resources
    • ☐ Images optimized

    Content Accessibility

    • ☐ Critical content in HTML (not just JavaScript)
    • ☐ No content behind login walls
    • ☐ No content in iframes
    • ☐ Alt text on images

    Part 7: Monitoring AI Crawler Activity and LLM Visibility

    After setup, monitor to ensure crawlers are actually visiting and your AI visibility is improving.

    Server Log Analysis

    Look for these user agents in your logs:

    GPTBot/1.0
    ChatGPT-User/1.0
    PerplexityBot/1.0
    Google-Extended
    Anthropic-AI
    ClaudeBot/1.0

    Log analysis command (Linux):

    grep -E "GPTBot|PerplexityBot|ClaudeBot" /var/log/nginx/access.log

    Google Search Console

    Check the Crawl Stats report for Googlebot activity. While this doesn't show AI-specific crawlers, it indicates overall crawlability.

    Recomaze Monitoring

    Recomaze tracks your AI visibility over time and alerts you to issues.


    Quick Reference Card

    Save this for quick reference:

    TaskCheckTool
    robots.txt/robots.txt accessibleBrowser
    AI crawler accessGPTBot allowedrobots.txt Tester
    SitemapValid XML, no 404sGoogle Search Console
    SchemaNo errorsRich Results Test
    Page speedLCP < 2.5sPageSpeed Insights
    Overall AI readinessScore 80+Recomaze Audit
    ---

    Understanding GEO, LLMO, and AEO: What's the Difference?

    You'll hear these terms used interchangeably. Here's what each actually means:

    GEO (Generative Engine Optimization)

    The practice of optimizing content for generative AI systems like ChatGPT, Perplexity, and Google AI Overviews. Focuses on getting your content cited and recommended in AI-generated responses.

    LLMO (Large Language Model Optimization)

    A broader term covering optimization for all LLMs. While GEO focuses on search-oriented AI, LLMO includes optimization for any AI system that might reference your content, including Claude, Gemini, and enterprise AI tools.

    AEO (Answer Engine Optimization)

    Specifically targets AI systems that provide direct answers, like Google's AI Overviews and featured snippets. AEO emphasizes structured answers, FAQ content, and clear formatting that AI can extract and display.

    The Bottom Line:

    All three share the same core requirements: clear content structure, proper schema markup, accessible crawling, and authoritative information. This checklist covers the technical foundation for all of them.


    AI Readiness Score: What to Aim For

    After implementing this checklist, use an AI readiness checker to measure your progress:

    ScoreStatusWhat It Means
    0-40CriticalAI crawlers likely blocked or major issues
    41-60Needs WorkBasic access works, optimization gaps
    61-80GoodSolid foundation, room for content improvement
    81-100ExcellentWell-optimized for AI visibility
    Most sites start in the 40-60 range. Following this checklist should get you to 70+.


    What to Do After Setup

    Configuration is just the beginning. Once your technical foundation is solid:

  • Create AI-friendly content - Answer questions directly, use clear structure, include statistics and quotes
  • Build entity clarity - Ensure AI understands who you are and what you represent
  • Optimize for answer engines - Structure content with clear headings, bullet points, and direct answers
  • Monitor your AI visibility - Track mentions across ChatGPT, Perplexity, and Google AI Overviews
  • Iterate based on data - Run regular AI SEO audits to identify new opportunities

  • Tools for Ongoing AI Visibility Monitoring

    Stay on top of your LLM visibility with these approaches:

    • Recomaze AI Audit - Free AI readiness checker with detailed recommendations
    • Manual testing - Ask ChatGPT and Perplexity questions about your industry and track mentions
    • Server logs - Monitor GPTBot and PerplexityBot crawl frequency
    • Search Console - Track overall crawlability and indexing

    Need Help With Your AI SEO Audit?

    Technical setup can be tricky. If you're not sure whether your configuration is correct:

    Run a free Recomaze AI readiness audit - We'll check your robots.txt, sitemap, schema, and overall AI visibility in under a minute.

    Get your baseline AI readiness score, identify GEO issues, and use this checklist to fix them.

    Whether you call it GEO, LLMO, or AEO, the goal is the same: make your content visible to AI. Start with this checklist, measure your progress, and keep optimizing.

    AI Crawlers
    GEO
    LLMO
    AI SEO Audit
    AI Readiness
    ChatGPT Optimization
    AI Visibility

    Check Your AI Readiness

    Get a free audit of your website's GEO optimization and AI visibility.

    Start Free Audit
    Recomaze AI Assistant

    Audit Assistant

    Powered by Recomaze AI

    Recomaze AI Assistant response

    Hi! I'm your AI Readiness Audit assistant. I can answer any questions about how audits work, how scores are calculated, what the metrics mean, and how to improve your site's AI readiness.

    What would you like to know?

    Quick questions: