The robots.txt file is a text file placed in the root directory of a website that tells search engine robots (crawlers) which parts of the site can be crawled and which should be skipped. It’s a fundamental element of technical SEO and crawl budget optimization.
How Does robots.txt Work?
The robots.txt file uses the Robots Exclusion Protocol (REP) - a standard created in 1994 and still used by all major search engines today.
How It Works
- A search engine robot (e.g., Googlebot) visits the site
- Before crawling, it checks
https://example.com/robots.txt - Parses the directives and follows them (or doesn’t)
- Crawls allowed pages according to the rules
Important Caveat
Robots.txt is a recommendation, not a block. Well-behaved robots (Google, Bing) respect these directives, but:
- Malicious bots may ignore them
- Blocked URLs can still appear in results (if linked to)
- This is not a tool for protecting private data
Basic Syntax
User-agent: *
Allow: /
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
robots.txt File Elements
| Directive | Description | Example |
|---|---|---|
User-agent | Specifies the robot (or * for all) | User-agent: Googlebot |
Allow | Allows crawling of path | Allow: /public/ |
Disallow | Blocks crawling of path | Disallow: /admin/ |
Sitemap | URL of XML sitemap | Sitemap: https://example.com/sitemap.xml |
Crawl-delay | Delay between requests (not Google) | Crawl-delay: 10 |
Syntax Rules
- Case matters in paths (case-sensitive)
- One User-agent per block or multiple directives under one
- Empty line separates blocks
- Comments start with
# - Order matters - more specific rules win
Directives in Detail
User-agent
Specifies which robot the following rules apply to:
# All robots
User-agent: *
# Only Googlebot
User-agent: Googlebot
# Only Googlebot Images
User-agent: Googlebot-Image
# Bing
User-agent: Bingbot
Popular search engine robots:
| Robot | Search Engine/Service |
|---|---|
Googlebot | Google Search |
Googlebot-Image | Google Images |
Googlebot-News | Google News |
Bingbot | Bing Search |
Slurp | Yahoo |
DuckDuckBot | DuckDuckGo |
Yandex | Yandex |
SEO tool robots:
| Robot | Tool |
|---|---|
AhrefsBot | Ahrefs (link analysis) |
MJ12bot | Majestic (link analysis) |
SemrushBot | Semrush (SEO analysis) |
DotBot | Moz (SEO analysis) |
BLEXBot | Webmeup (backlink explorer) |
SeznamBot | Seznam (Czech search engine) |
PetalBot | Huawei (search engine) |
Social media robots:
| Robot | Platform |
|---|---|
facebot | |
Twitterbot | Twitter/X |
LinkedInBot | |
Pinterest |
AI robots:
| Robot | Service |
|---|---|
GPTBot | OpenAI/ChatGPT |
ChatGPT-User | ChatGPT browsing |
CCBot | Common Crawl (AI training data) |
anthropic-ai | Anthropic/Claude |
Google-Extended | Google AI (Gemini) |
Bytespider | ByteDance/TikTok AI |
ClaudeBot | Anthropic Claude |
Disallow
Blocks crawling of specified paths:
# Block directory
Disallow: /admin/
# Block specific file
Disallow: /private-page.html
# Block everything
Disallow: /
# Empty value = allow everything
Disallow:
Allow
Allows crawling (overrides Disallow):
# Block directory, but allow specific file
Disallow: /folder/
Allow: /folder/public-file.html
Wildcards
Google and Bing support wildcards:
# * = any sequence of characters
Disallow: /private-*/
# $ = end of URL
Disallow: /*.pdf$
Disallow: /*.doc$
# Combination
Disallow: /*?sessionid=
Usage examples:
# Block all URL parameters
Disallow: /*?
# Block sorting and filters
Disallow: /*?sort=
Disallow: /*?filter=
# Block search result pages
Disallow: /search
Disallow: /*?s=
Disallow: /*?q=
Sitemap
Specifies the location of the XML sitemap:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml
- Can have multiple Sitemap directives
- Use full URL (with https://)
- Sitemap can also be submitted through Search Console
Crawl-delay
Specifies minimum time (in seconds) between requests:
User-agent: Bingbot
Crawl-delay: 10
Note: Google ignores Crawl-delay. For Google, set crawl rate limits in Search Console.
Configuration Examples for Different Platforms
WordPress
User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /*?s=
Disallow: /*?p=
Disallow: /category/*?
Disallow: /tag/*?
Sitemap: https://example.com/sitemap_index.xml
E-commerce (WooCommerce, Shopify)
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter=
Allow: /wp-content/uploads/
Sitemap: https://example.com/sitemap_index.xml
Astro / Static Sites
User-agent: *
Allow: /
# Block technical pages
Disallow: /404/
Disallow: /_astro/
Sitemap: https://example.com/sitemap-index.xml
Blocking AI Bots
In 2024+, many sites block AI bots due to content scraping:
# Block AI bots
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
# Other robots - allowed
User-agent: *
Allow: /
Note: Google-Extended only controls content use for AI training (Bard/Gemini), doesn’t affect Google Search.
Blocking SEO Tool Bots
Some sites block SEO tool bots to make it harder for competitors to analyze their link profile:
# Block competitor SEO tools
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: BLEXBot
Disallow: /
Disadvantages of blocking SEO bots:
- Makes it harder to analyze your own site in these tools
- Doesn’t fully block - historical data remains
- Competitors can use other methods (e.g., proxies)
When to consider blocking:
- When you want to hide your link strategy
- For protection against negative SEO
- For sites with sensitive structure
robots.txt vs meta robots
| Aspect | robots.txt | meta robots |
|---|---|---|
| Control level | Directories, files | Individual pages |
| Location | Root directory | <head> tag or HTTP header |
| Flexibility | Less | More |
| Blocks crawling | Yes | No (page is crawled) |
| Blocks indexing | No* | Yes (noindex) |
*Important: robots.txt blocks crawling, but the page can still be indexed if linked from other sources.
How to Effectively Block Indexing?
If you want a page not to appear in search results, use meta robots:
<meta name="robots" content="noindex, nofollow">
Or HTTP header:
X-Robots-Tag: noindex, nofollow
Do not block a page in robots.txt if you want to noindex it - Googlebot must be able to read it to see the noindex directive.
Crawl Budget Optimization
Crawl budget is the number of pages Google is willing to crawl in a given time. For large sites (100k+ URLs), robots.txt optimization is crucial.
What to Block
# Pages without SEO value
Disallow: /thank-you/
Disallow: /confirmation/
# Duplicate content
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
# Internal resources
Disallow: /internal/
Disallow: /staging/
# Internal search
Disallow: /search
Disallow: /*?q=
What NOT to Block
- CSS and JavaScript - Google needs them for rendering
- Images - unless you intentionally want to exclude them from Google Images
- Important pages - check carefully before blocking
More about crawl optimization in the context of Core Web Vitals.
Common Mistakes
1. Accidentally Blocking the Entire Site
# WRONG - blocks everything!
User-agent: *
Disallow: /
2. Missing robots.txt File
If the file doesn’t exist, robots will assume everything is allowed. Always create a file, even minimal:
User-agent: *
Allow: /
3. Blocking Resources Needed for Rendering
# WRONG - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/
4. Expecting robots.txt to Protect Data
Robots.txt is not a security measure. Protect sensitive data through:
- Authorization (login/password)
- .htaccess / nginx config
- Proper server permissions
5. Conflicting Rules
# Conflicting - which rule wins?
Disallow: /folder/
Allow: /folder/
Google chooses the most specific rule. In case of equal length, Allow wins.
6. Blocking Pages with noindex
# WRONG - Googlebot won't see noindex
Disallow: /private-page/
# On the page:
# <meta name="robots" content="noindex">
If you block a page in robots.txt, Googlebot won’t read it and won’t see the noindex directive.
Testing and Validation
Google Search Console
- Go to Search Console
- URL Inspection → Test blocked resources
- Check “Robots.txt Tester” (old tool, still available)
Online Tools
- Google Search Console - official tool
- Bing Webmaster Tools - for Bingbot
- robots.txt Checker - online validators
Manual Testing
curl https://example.com/robots.txt
Where to Place the File?
The file must be in the root directory of the domain:
https://example.com/robots.txt ✓ Correct
https://example.com/folder/robots.txt ✗ Ignored
For subdomains, a separate file is needed:
https://example.com/robots.txt
https://blog.example.com/robots.txt
https://shop.example.com/robots.txt
robots.txt and Security
Don’t Reveal Sensitive Paths
# WRONG - informing attackers about admin panel
Disallow: /super-secret-admin-panel/
Disallow: /backup-database/
Attackers can read robots.txt and discover hidden paths. Instead of blocking in robots.txt, secure these resources with authorization.
Proper Approach
- Public pages → robots.txt for crawl control
- Private data → authorization + firewall
- Staging/dev → IP whitelist or HTTP auth
Summary
A properly configured robots.txt file:
- Optimizes crawl budget - robots focus on important pages
- Prevents indexing duplicates - filters, sorting, parameters
- Controls bot access - including AI bots
- Points to sitemap - facilitates content discovery
However, remember that:
- Robots.txt is not a data security measure
- Does not guarantee removal from index
- Is respected by good bots, but ignored by malicious ones
Regularly check your robots.txt file and adjust it to changes in site structure. Combined with meta robots and proper server configuration, it’s an important element of White Hat SEO strategy.
Sources
-
Google Search Central - robots.txt https://developers.google.com/search/docs/crawling-indexing/robots/intro
-
Google Search Central - Create and submit a robots.txt file https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
-
Google Search Central - robots.txt Specifications https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
-
Bing Webmaster Guidelines - robots.txt https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec
-
robotstxt.org - Original Specification https://www.robotstxt.org/



