The robots.txt file is a text file placed in the root directory of a website that tells search engine robots (crawlers) which parts of the site can be crawled and which should be skipped. It’s a fundamental element of technical SEO and crawl budget optimization.

How Does robots.txt Work?

The robots.txt file uses the Robots Exclusion Protocol (REP) - a standard created in 1994 and still used by all major search engines today.

How It Works

  1. A search engine robot (e.g., Googlebot) visits the site
  2. Before crawling, it checks https://example.com/robots.txt
  3. Parses the directives and follows them (or doesn’t)
  4. Crawls allowed pages according to the rules

Important Caveat

Robots.txt is a recommendation, not a block. Well-behaved robots (Google, Bing) respect these directives, but:

  • Malicious bots may ignore them
  • Blocked URLs can still appear in results (if linked to)
  • This is not a tool for protecting private data

Basic Syntax

User-agent: *
Allow: /
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml

robots.txt File Elements

DirectiveDescriptionExample
User-agentSpecifies the robot (or * for all)User-agent: Googlebot
AllowAllows crawling of pathAllow: /public/
DisallowBlocks crawling of pathDisallow: /admin/
SitemapURL of XML sitemapSitemap: https://example.com/sitemap.xml
Crawl-delayDelay between requests (not Google)Crawl-delay: 10

Syntax Rules

  • Case matters in paths (case-sensitive)
  • One User-agent per block or multiple directives under one
  • Empty line separates blocks
  • Comments start with #
  • Order matters - more specific rules win

Directives in Detail

User-agent

Specifies which robot the following rules apply to:

# All robots
User-agent: *

# Only Googlebot
User-agent: Googlebot

# Only Googlebot Images
User-agent: Googlebot-Image

# Bing
User-agent: Bingbot

Popular search engine robots:

RobotSearch Engine/Service
GooglebotGoogle Search
Googlebot-ImageGoogle Images
Googlebot-NewsGoogle News
BingbotBing Search
SlurpYahoo
DuckDuckBotDuckDuckGo
YandexYandex

SEO tool robots:

RobotTool
AhrefsBotAhrefs (link analysis)
MJ12botMajestic (link analysis)
SemrushBotSemrush (SEO analysis)
DotBotMoz (SEO analysis)
BLEXBotWebmeup (backlink explorer)
SeznamBotSeznam (Czech search engine)
PetalBotHuawei (search engine)

Social media robots:

RobotPlatform
facebotFacebook
TwitterbotTwitter/X
LinkedInBotLinkedIn
PinterestPinterest

AI robots:

RobotService
GPTBotOpenAI/ChatGPT
ChatGPT-UserChatGPT browsing
CCBotCommon Crawl (AI training data)
anthropic-aiAnthropic/Claude
Google-ExtendedGoogle AI (Gemini)
BytespiderByteDance/TikTok AI
ClaudeBotAnthropic Claude

Disallow

Blocks crawling of specified paths:

# Block directory
Disallow: /admin/

# Block specific file
Disallow: /private-page.html

# Block everything
Disallow: /

# Empty value = allow everything
Disallow:

Allow

Allows crawling (overrides Disallow):

# Block directory, but allow specific file
Disallow: /folder/
Allow: /folder/public-file.html

Wildcards

Google and Bing support wildcards:

# * = any sequence of characters
Disallow: /private-*/

# $ = end of URL
Disallow: /*.pdf$
Disallow: /*.doc$

# Combination
Disallow: /*?sessionid=

Usage examples:

# Block all URL parameters
Disallow: /*?

# Block sorting and filters
Disallow: /*?sort=
Disallow: /*?filter=

# Block search result pages
Disallow: /search
Disallow: /*?s=
Disallow: /*?q=

Sitemap

Specifies the location of the XML sitemap:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml
  • Can have multiple Sitemap directives
  • Use full URL (with https://)
  • Sitemap can also be submitted through Search Console

Crawl-delay

Specifies minimum time (in seconds) between requests:

User-agent: Bingbot
Crawl-delay: 10

Note: Google ignores Crawl-delay. For Google, set crawl rate limits in Search Console.

Configuration Examples for Different Platforms

WordPress

User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /*?s=
Disallow: /*?p=
Disallow: /category/*?
Disallow: /tag/*?

Sitemap: https://example.com/sitemap_index.xml

E-commerce (WooCommerce, Shopify)

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter=
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap_index.xml

Astro / Static Sites

User-agent: *
Allow: /

# Block technical pages
Disallow: /404/
Disallow: /_astro/

Sitemap: https://example.com/sitemap-index.xml

Blocking AI Bots

In 2024+, many sites block AI bots due to content scraping:

# Block AI bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

# Other robots - allowed
User-agent: *
Allow: /

Note: Google-Extended only controls content use for AI training (Bard/Gemini), doesn’t affect Google Search.

Blocking SEO Tool Bots

Some sites block SEO tool bots to make it harder for competitors to analyze their link profile:

# Block competitor SEO tools
User-agent: AhrefsBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: BLEXBot
Disallow: /

Disadvantages of blocking SEO bots:

  • Makes it harder to analyze your own site in these tools
  • Doesn’t fully block - historical data remains
  • Competitors can use other methods (e.g., proxies)

When to consider blocking:

  • When you want to hide your link strategy
  • For protection against negative SEO
  • For sites with sensitive structure

robots.txt vs meta robots

Aspectrobots.txtmeta robots
Control levelDirectories, filesIndividual pages
LocationRoot directory<head> tag or HTTP header
FlexibilityLessMore
Blocks crawlingYesNo (page is crawled)
Blocks indexingNo*Yes (noindex)

*Important: robots.txt blocks crawling, but the page can still be indexed if linked from other sources.

How to Effectively Block Indexing?

If you want a page not to appear in search results, use meta robots:

<meta name="robots" content="noindex, nofollow">

Or HTTP header:

X-Robots-Tag: noindex, nofollow

Do not block a page in robots.txt if you want to noindex it - Googlebot must be able to read it to see the noindex directive.

Crawl Budget Optimization

Crawl budget is the number of pages Google is willing to crawl in a given time. For large sites (100k+ URLs), robots.txt optimization is crucial.

What to Block

# Pages without SEO value
Disallow: /thank-you/
Disallow: /confirmation/

# Duplicate content
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Internal resources
Disallow: /internal/
Disallow: /staging/

# Internal search
Disallow: /search
Disallow: /*?q=

What NOT to Block

  • CSS and JavaScript - Google needs them for rendering
  • Images - unless you intentionally want to exclude them from Google Images
  • Important pages - check carefully before blocking

More about crawl optimization in the context of Core Web Vitals.

Common Mistakes

1. Accidentally Blocking the Entire Site

# WRONG - blocks everything!
User-agent: *
Disallow: /

2. Missing robots.txt File

If the file doesn’t exist, robots will assume everything is allowed. Always create a file, even minimal:

User-agent: *
Allow: /

3. Blocking Resources Needed for Rendering

# WRONG - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/

4. Expecting robots.txt to Protect Data

Robots.txt is not a security measure. Protect sensitive data through:

  • Authorization (login/password)
  • .htaccess / nginx config
  • Proper server permissions

5. Conflicting Rules

# Conflicting - which rule wins?
Disallow: /folder/
Allow: /folder/

Google chooses the most specific rule. In case of equal length, Allow wins.

6. Blocking Pages with noindex

# WRONG - Googlebot won't see noindex
Disallow: /private-page/

# On the page:
# <meta name="robots" content="noindex">

If you block a page in robots.txt, Googlebot won’t read it and won’t see the noindex directive.

Testing and Validation

Google Search Console

  1. Go to Search Console
  2. URL Inspection → Test blocked resources
  3. Check “Robots.txt Tester” (old tool, still available)

Online Tools

  • Google Search Console - official tool
  • Bing Webmaster Tools - for Bingbot
  • robots.txt Checker - online validators

Manual Testing

curl https://example.com/robots.txt

Where to Place the File?

The file must be in the root directory of the domain:

https://example.com/robots.txt       ✓ Correct
https://example.com/folder/robots.txt  ✗ Ignored

For subdomains, a separate file is needed:

https://example.com/robots.txt
https://blog.example.com/robots.txt
https://shop.example.com/robots.txt

robots.txt and Security

Don’t Reveal Sensitive Paths

# WRONG - informing attackers about admin panel
Disallow: /super-secret-admin-panel/
Disallow: /backup-database/

Attackers can read robots.txt and discover hidden paths. Instead of blocking in robots.txt, secure these resources with authorization.

Proper Approach

  • Public pages → robots.txt for crawl control
  • Private data → authorization + firewall
  • Staging/dev → IP whitelist or HTTP auth

Summary

A properly configured robots.txt file:

  • Optimizes crawl budget - robots focus on important pages
  • Prevents indexing duplicates - filters, sorting, parameters
  • Controls bot access - including AI bots
  • Points to sitemap - facilitates content discovery

However, remember that:

  • Robots.txt is not a data security measure
  • Does not guarantee removal from index
  • Is respected by good bots, but ignored by malicious ones

Regularly check your robots.txt file and adjust it to changes in site structure. Combined with meta robots and proper server configuration, it’s an important element of White Hat SEO strategy.

Sources

  1. Google Search Central - robots.txt https://developers.google.com/search/docs/crawling-indexing/robots/intro

  2. Google Search Central - Create and submit a robots.txt file https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt

  3. Google Search Central - robots.txt Specifications https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt

  4. Bing Webmaster Guidelines - robots.txt https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec

  5. robotstxt.org - Original Specification https://www.robotstxt.org/