robots.txt File - Complete Configuration Guide

The robots.txt file is a text file placed in the root directory of a website that tells search engine robots (crawlers) which parts of the site can be crawled and which should be skipped. It’s a fundamental element of technical SEO and crawl budget optimization.

How Does robots.txt Work?

The robots.txt file uses the Robots Exclusion Protocol (REP) - a standard created in 1994 and still used by all major search engines today.

How It Works

A search engine robot (e.g., Googlebot) visits the site
Before crawling, it checks https://example.com/robots.txt
Parses the directives and follows them (or doesn’t)
Crawls allowed pages according to the rules

Important Caveat

Robots.txt is a recommendation, not a block. Well-behaved robots (Google, Bing) respect these directives, but:

Malicious bots may ignore them
Blocked URLs can still appear in results (if linked to)
This is not a tool for protecting private data

Basic Syntax

User-agent: *
Allow: /
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml

robots.txt File Elements

Directive	Description	Example
`User-agent`	Specifies the robot (or `*` for all)	`User-agent: Googlebot`
`Allow`	Allows crawling of path	`Allow: /public/`
`Disallow`	Blocks crawling of path	`Disallow: /admin/`
`Sitemap`	URL of XML sitemap	`Sitemap: https://example.com/sitemap.xml`
`Crawl-delay`	Delay between requests (not Google)	`Crawl-delay: 10`

Syntax Rules

Case matters in paths (case-sensitive)
One User-agent per block or multiple directives under one
Empty line separates blocks
Comments start with #
Order matters - more specific rules win

Directives in Detail

User-agent

Specifies which robot the following rules apply to:

# All robots
User-agent: *

# Only Googlebot
User-agent: Googlebot

# Only Googlebot Images
User-agent: Googlebot-Image

# Bing
User-agent: Bingbot

Popular search engine robots:

Robot	Search Engine/Service
`Googlebot`	Google Search
`Googlebot-Image`	Google Images
`Googlebot-News`	Google News
`Bingbot`	Bing Search
`Slurp`	Yahoo
`DuckDuckBot`	DuckDuckGo
`Yandex`	Yandex

SEO tool robots:

Robot	Tool
`AhrefsBot`	Ahrefs (link analysis)
`MJ12bot`	Majestic (link analysis)
`SemrushBot`	Semrush (SEO analysis)
`DotBot`	Moz (SEO analysis)
`BLEXBot`	Webmeup (backlink explorer)
`SeznamBot`	Seznam (Czech search engine)
`PetalBot`	Huawei (search engine)

Social media robots:

Robot	Platform
`facebot`	Facebook
`Twitterbot`	Twitter/X
`LinkedInBot`	LinkedIn
`Pinterest`	Pinterest

AI robots:

Robot	Service
`GPTBot`	OpenAI/ChatGPT
`ChatGPT-User`	ChatGPT browsing
`CCBot`	Common Crawl (AI training data)
`anthropic-ai`	Anthropic/Claude
`Google-Extended`	Google AI (Gemini)
`Bytespider`	ByteDance/TikTok AI
`ClaudeBot`	Anthropic Claude

Disallow

Blocks crawling of specified paths:

# Block directory
Disallow: /admin/

# Block specific file
Disallow: /private-page.html

# Block everything
Disallow: /

# Empty value = allow everything
Disallow:

Allow

Allows crawling (overrides Disallow):

# Block directory, but allow specific file
Disallow: /folder/
Allow: /folder/public-file.html

Wildcards

Google and Bing support wildcards:

# * = any sequence of characters
Disallow: /private-*/

# $ = end of URL
Disallow: /*.pdf$
Disallow: /*.doc$

# Combination
Disallow: /*?sessionid=

Usage examples:

# Block all URL parameters
Disallow: /*?

# Block sorting and filters
Disallow: /*?sort=
Disallow: /*?filter=

# Block search result pages
Disallow: /search
Disallow: /*?s=
Disallow: /*?q=

Sitemap

Specifies the location of the XML sitemap:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml

Can have multiple Sitemap directives
Use full URL (with https://)
Sitemap can also be submitted through Search Console

Crawl-delay

Specifies minimum time (in seconds) between requests:

User-agent: Bingbot
Crawl-delay: 10

Note: Google ignores Crawl-delay. For Google, set crawl rate limits in Search Console.

Configuration Examples for Different Platforms

WordPress

User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /*?s=
Disallow: /*?p=
Disallow: /category/*?
Disallow: /tag/*?

Sitemap: https://example.com/sitemap_index.xml

E-commerce (WooCommerce, Shopify)

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter=
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap_index.xml

Astro / Static Sites

User-agent: *
Allow: /

# Block technical pages
Disallow: /404/
Disallow: /_astro/

Sitemap: https://example.com/sitemap-index.xml

Blocking AI Bots

In 2024+, many sites block AI bots due to content scraping:

# Block AI bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

# Other robots - allowed
User-agent: *
Allow: /

Note: Google-Extended only controls content use for AI training (Bard/Gemini), doesn’t affect Google Search.

Blocking SEO Tool Bots

Some sites block SEO tool bots to make it harder for competitors to analyze their link profile:

# Block competitor SEO tools
User-agent: AhrefsBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: BLEXBot
Disallow: /

Disadvantages of blocking SEO bots:

Makes it harder to analyze your own site in these tools
Doesn’t fully block - historical data remains
Competitors can use other methods (e.g., proxies)

When to consider blocking:

When you want to hide your link strategy
For protection against negative SEO
For sites with sensitive structure

robots.txt vs meta robots

Aspect	robots.txt	meta robots
Control level	Directories, files	Individual pages
Location	Root directory	`<head>` tag or HTTP header
Flexibility	Less	More
Blocks crawling	Yes	No (page is crawled)
Blocks indexing	No*	Yes (noindex)

*Important: robots.txt blocks crawling, but the page can still be indexed if linked from other sources.

How to Effectively Block Indexing?

If you want a page not to appear in search results, use meta robots:

<meta name="robots" content="noindex, nofollow">

Or HTTP header:

X-Robots-Tag: noindex, nofollow

Do not block a page in robots.txt if you want to noindex it - Googlebot must be able to read it to see the noindex directive.

Crawl Budget Optimization

Crawl budget is the number of pages Google is willing to crawl in a given time. For large sites (100k+ URLs), robots.txt optimization is crucial.

What to Block

# Pages without SEO value
Disallow: /thank-you/
Disallow: /confirmation/

# Duplicate content
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Internal resources
Disallow: /internal/
Disallow: /staging/

# Internal search
Disallow: /search
Disallow: /*?q=

What NOT to Block

CSS and JavaScript - Google needs them for rendering
Images - unless you intentionally want to exclude them from Google Images
Important pages - check carefully before blocking

More about crawl optimization in the context of Core Web Vitals.

Common Mistakes

1. Accidentally Blocking the Entire Site

# WRONG - blocks everything!
User-agent: *
Disallow: /

2. Missing robots.txt File

If the file doesn’t exist, robots will assume everything is allowed. Always create a file, even minimal:

User-agent: *
Allow: /

3. Blocking Resources Needed for Rendering

# WRONG - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/

4. Expecting robots.txt to Protect Data

Robots.txt is not a security measure. Protect sensitive data through:

Authorization (login/password)
.htaccess / nginx config
Proper server permissions

5. Conflicting Rules

# Conflicting - which rule wins?
Disallow: /folder/
Allow: /folder/

Google chooses the most specific rule. In case of equal length, Allow wins.

6. Blocking Pages with noindex

# WRONG - Googlebot won't see noindex
Disallow: /private-page/

# On the page:
# <meta name="robots" content="noindex">

If you block a page in robots.txt, Googlebot won’t read it and won’t see the noindex directive.

Testing and Validation

Google Search Console

Go to Search Console
URL Inspection → Test blocked resources
Check “Robots.txt Tester” (old tool, still available)

Online Tools

Google Search Console - official tool
Bing Webmaster Tools - for Bingbot
robots.txt Checker - online validators

Manual Testing

curl https://example.com/robots.txt

Where to Place the File?

The file must be in the root directory of the domain:

https://example.com/robots.txt       ✓ Correct
https://example.com/folder/robots.txt  ✗ Ignored

For subdomains, a separate file is needed:

https://example.com/robots.txt
https://blog.example.com/robots.txt
https://shop.example.com/robots.txt

robots.txt and Security

Don’t Reveal Sensitive Paths

# WRONG - informing attackers about admin panel
Disallow: /super-secret-admin-panel/
Disallow: /backup-database/

Attackers can read robots.txt and discover hidden paths. Instead of blocking in robots.txt, secure these resources with authorization.

Proper Approach

Public pages → robots.txt for crawl control
Private data → authorization + firewall
Staging/dev → IP whitelist or HTTP auth

Summary

A properly configured robots.txt file:

Optimizes crawl budget - robots focus on important pages
Prevents indexing duplicates - filters, sorting, parameters
Controls bot access - including AI bots
Points to sitemap - facilitates content discovery

However, remember that:

Robots.txt is not a data security measure
Does not guarantee removal from index
Is respected by good bots, but ignored by malicious ones

Regularly check your robots.txt file and adjust it to changes in site structure. Combined with meta robots and proper server configuration, it’s an important element of White Hat SEO strategy.

Sources

Google Search Central - robots.txt https://developers.google.com/search/docs/crawling-indexing/robots/intro
Google Search Central - Create and submit a robots.txt file https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
Google Search Central - robots.txt Specifications https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
Bing Webmaster Guidelines - robots.txt https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec
robotstxt.org - Original Specification https://www.robotstxt.org/