---
title: "robots.txt File - Complete Configuration Guide"
description: "The robots.txt file controls search engine robot access to your site. Learn how to configure it correctly, understand modern directives, and apply best practices."
date: 2015-11-05
updated: 2025-01-10
category: Optimization
tags: ["SEO", "robots.txt", "optimization", "crawling", "Googlebot"]
url: https://uper.pl/en/blog/robots-txt/
---

# robots.txt File - Complete Configuration Guide

**The robots.txt file** is a text file placed in the root directory of a website that tells search engine robots (crawlers) which parts of the site can be crawled and which should be skipped. It's a fundamental element of technical SEO and crawl budget optimization.

## How Does robots.txt Work?

The robots.txt file uses the **Robots Exclusion Protocol (REP)** - a standard created in 1994 and still used by all major search engines today.

### How It Works

1. A search engine robot (e.g., Googlebot) visits the site
2. Before crawling, it checks `https://example.com/robots.txt`
3. Parses the directives and follows them (or doesn't)
4. Crawls allowed pages according to the rules

### Important Caveat

**Robots.txt is a recommendation, not a block.** Well-behaved robots (Google, Bing) respect these directives, but:
- Malicious bots may ignore them
- Blocked URLs can still appear in results (if linked to)
- This is not a tool for protecting private data

## Basic Syntax

```txt
User-agent: *
Allow: /
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
```

### robots.txt File Elements

| Directive | Description | Example |
|-----------|-------------|---------|
| `User-agent` | Specifies the robot (or `*` for all) | `User-agent: Googlebot` |
| `Allow` | Allows crawling of path | `Allow: /public/` |
| `Disallow` | Blocks crawling of path | `Disallow: /admin/` |
| `Sitemap` | URL of XML sitemap | `Sitemap: https://example.com/sitemap.xml` |
| `Crawl-delay` | Delay between requests (not Google) | `Crawl-delay: 10` |

### Syntax Rules

- **Case matters** in paths (case-sensitive)
- **One User-agent per block** or multiple directives under one
- **Empty line** separates blocks
- **Comments** start with `#`
- **Order matters** - more specific rules win

## Directives in Detail

### User-agent

Specifies which robot the following rules apply to:

```txt
# All robots
User-agent: *

# Only Googlebot
User-agent: Googlebot

# Only Googlebot Images
User-agent: Googlebot-Image

# Bing
User-agent: Bingbot
```

**Popular search engine robots:**
| Robot | Search Engine/Service |
|-------|----------------------|
| `Googlebot` | Google Search |
| `Googlebot-Image` | Google Images |
| `Googlebot-News` | Google News |
| `Bingbot` | Bing Search |
| `Slurp` | Yahoo |
| `DuckDuckBot` | DuckDuckGo |
| `Yandex` | Yandex |

**SEO tool robots:**
| Robot | Tool |
|-------|------|
| `AhrefsBot` | Ahrefs (link analysis) |
| `MJ12bot` | Majestic (link analysis) |
| `SemrushBot` | Semrush (SEO analysis) |
| `DotBot` | Moz (SEO analysis) |
| `BLEXBot` | Webmeup (backlink explorer) |
| `SeznamBot` | Seznam (Czech search engine) |
| `PetalBot` | Huawei (search engine) |

**Social media robots:**
| Robot | Platform |
|-------|----------|
| `facebot` | Facebook |
| `Twitterbot` | Twitter/X |
| `LinkedInBot` | LinkedIn |
| `Pinterest` | Pinterest |

**AI robots:**
| Robot | Service |
|-------|---------|
| `GPTBot` | OpenAI/ChatGPT |
| `ChatGPT-User` | ChatGPT browsing |
| `CCBot` | Common Crawl (AI training data) |
| `anthropic-ai` | Anthropic/Claude |
| `Google-Extended` | Google AI (Gemini) |
| `Bytespider` | ByteDance/TikTok AI |
| `ClaudeBot` | Anthropic Claude |

### Disallow

Blocks crawling of specified paths:

```txt
# Block directory
Disallow: /admin/

# Block specific file
Disallow: /private-page.html

# Block everything
Disallow: /

# Empty value = allow everything
Disallow:
```

### Allow

Allows crawling (overrides Disallow):

```txt
# Block directory, but allow specific file
Disallow: /folder/
Allow: /folder/public-file.html
```

### Wildcards

Google and Bing support wildcards:

```txt
# * = any sequence of characters
Disallow: /private-*/

# $ = end of URL
Disallow: /*.pdf$
Disallow: /*.doc$

# Combination
Disallow: /*?sessionid=
```

**Usage examples:**

```txt
# Block all URL parameters
Disallow: /*?

# Block sorting and filters
Disallow: /*?sort=
Disallow: /*?filter=

# Block search result pages
Disallow: /search
Disallow: /*?s=
Disallow: /*?q=
```

### Sitemap

Specifies the location of the XML sitemap:

```txt
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-pages.xml
```

- Can have multiple Sitemap directives
- Use full URL (with https://)
- Sitemap can also be submitted through Search Console

### Crawl-delay

Specifies minimum time (in seconds) between requests:

```txt
User-agent: Bingbot
Crawl-delay: 10
```

**Note:** Google **ignores** Crawl-delay. For Google, set crawl rate limits in Search Console.

## Configuration Examples for Different Platforms

### WordPress

```txt
User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /*?s=
Disallow: /*?p=
Disallow: /category/*?
Disallow: /tag/*?

Sitemap: https://example.com/sitemap_index.xml
```

### E-commerce (WooCommerce, Shopify)

```txt
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter=
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap_index.xml
```

### Astro / Static Sites

```txt
User-agent: *
Allow: /

# Block technical pages
Disallow: /404/
Disallow: /_astro/

Sitemap: https://example.com/sitemap-index.xml
```

### Blocking AI Bots

In 2024+, many sites block [AI bots](/en/blog/ai-crawlers-vs-search-crawlers/) due to content scraping:

```txt
# Block AI bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

# Other robots - allowed
User-agent: *
Allow: /
```

**Note:** `Google-Extended` only controls content use for AI training (Bard/Gemini), doesn't affect Google Search.

### Blocking SEO Tool Bots

Some sites block SEO tool bots to make it harder for competitors to analyze their link profile:

```txt
# Block competitor SEO tools
User-agent: AhrefsBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: BLEXBot
Disallow: /
```

**Disadvantages of blocking SEO bots:**
- Makes it harder to analyze your own site in these tools
- Doesn't fully block - historical data remains
- Competitors can use other methods (e.g., proxies)

**When to consider blocking:**
- When you want to hide your link strategy
- For protection against negative SEO
- For sites with sensitive structure

## robots.txt vs meta robots

| Aspect | robots.txt | meta robots |
|--------|------------|-------------|
| **Control level** | Directories, files | Individual pages |
| **Location** | Root directory | `<head>` tag or HTTP header |
| **Flexibility** | Less | More |
| **Blocks crawling** | Yes | No (page is crawled) |
| **Blocks indexing** | No* | Yes (noindex) |

*Important: robots.txt blocks **crawling**, but the page can still be **indexed** if linked from other sources.

### How to Effectively Block Indexing?

If you want a page **not to appear in search results**, use meta robots:

```html
<meta name="robots" content="noindex, nofollow">
```

Or HTTP header:

```
X-Robots-Tag: noindex, nofollow
```

**Do not** block a page in robots.txt if you want to noindex it - Googlebot must be able to read it to see the noindex directive.

## Crawl Budget Optimization

Crawl budget is the number of pages Google is willing to crawl in a given time. For large sites (100k+ URLs), robots.txt optimization is crucial.

### What to Block

```txt
# Pages without SEO value
Disallow: /thank-you/
Disallow: /confirmation/

# Duplicate content
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Internal resources
Disallow: /internal/
Disallow: /staging/

# Internal search
Disallow: /search
Disallow: /*?q=
```

### What NOT to Block

- **CSS and JavaScript** - Google needs them for rendering
- **Images** - unless you intentionally want to exclude them from Google Images
- **Important pages** - check carefully before blocking

More about crawl optimization in the context of [Core Web Vitals](/en/blog/core-web-vitals/).

## Common Mistakes

### 1. Accidentally Blocking the Entire Site

```txt
# WRONG - blocks everything!
User-agent: *
Disallow: /
```

### 2. Missing robots.txt File

If the file doesn't exist, robots will assume everything is allowed. Always create a file, even minimal:

```txt
User-agent: *
Allow: /
```

### 3. Blocking Resources Needed for Rendering

```txt
# WRONG - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/
```

### 4. Expecting robots.txt to Protect Data

Robots.txt is **not a security measure**. Protect sensitive data through:
- Authorization (login/password)
- .htaccess / nginx config
- Proper server permissions

### 5. Conflicting Rules

```txt
# Conflicting - which rule wins?
Disallow: /folder/
Allow: /folder/
```

Google chooses the **most specific** rule. In case of equal length, Allow wins.

### 6. Blocking Pages with noindex

```txt
# WRONG - Googlebot won't see noindex
Disallow: /private-page/

# On the page:
# <meta name="robots" content="noindex">
```

If you block a page in robots.txt, Googlebot won't read it and won't see the noindex directive.

## Testing and Validation

### Google Search Console

1. Go to Search Console
2. URL Inspection → Test blocked resources
3. Check "Robots.txt Tester" (old tool, still available)

### Online Tools

- **[Google Search Console](/en/blog/google-search-console/)** - official tool
- **Bing Webmaster Tools** - for Bingbot
- **robots.txt Checker** - online validators

### Manual Testing

```bash
curl https://example.com/robots.txt
```

## Where to Place the File?

The file **must** be in the root directory of the domain:

```
https://example.com/robots.txt       ✓ Correct
https://example.com/folder/robots.txt  ✗ Ignored
```

For subdomains, a separate file is needed:

```
https://example.com/robots.txt
https://blog.example.com/robots.txt
https://shop.example.com/robots.txt
```

## robots.txt and Security

### Don't Reveal Sensitive Paths

```txt
# WRONG - informing attackers about admin panel
Disallow: /super-secret-admin-panel/
Disallow: /backup-database/
```

Attackers can read robots.txt and discover hidden paths. Instead of blocking in robots.txt, secure these resources with authorization.

### Proper Approach

- **Public pages** → robots.txt for crawl control
- **Private data** → authorization + firewall
- **Staging/dev** → IP whitelist or HTTP auth

## Summary

A properly configured robots.txt file:

- **Optimizes crawl budget** - robots focus on important pages
- **Prevents indexing duplicates** - filters, sorting, parameters
- **Controls bot access** - including AI bots
- **Points to sitemap** - facilitates content discovery

However, remember that:
- Robots.txt is **not a data security measure**
- **Does not guarantee** removal from index
- Is **respected** by good bots, but **ignored** by malicious ones

Regularly check your robots.txt file and adjust it to changes in site structure. Combined with meta robots and proper server configuration, it's an important element of [White Hat SEO](/en/blog/white-hat-seo/) strategy.

## Sources

1. **Google Search Central - robots.txt**
[https://developers.google.com/search/docs/crawling-indexing/robots/intro](https://developers.google.com/search/docs/crawling-indexing/robots/intro)

2. **Google Search Central - Create and submit a robots.txt file**
[https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt)

3. **Google Search Central - robots.txt Specifications**
[https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

4. **Bing Webmaster Guidelines - robots.txt**
[https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec](https://www.bing.com/webmasters/help/how-to-create-a-robots-txt-file-cb7c31ec)

5. **robotstxt.org - Original Specification**
[https://www.robotstxt.org/](https://www.robotstxt.org/)
