robots.txt is a plain-text file located at the root of a website (e.g., https://example.com/robots.txt) that tells web crawlers which parts of the site they may or may not access. It is part of the Robots Exclusion Protocol, originally proposed in 1994 and now formalized as RFC 9309.

robots.txt is voluntarily honored by well-behaved crawlers. It is not a security mechanism, malicious bots typically ignore it.

Basic structure

A minimal example:

User-agent: *
Disallow: /admin/
Disallow: /cart/

User-agent: Googlebot
Allow: /admin/preview/

Sitemap: https://example.com/sitemap.xml

Each block consists of:

User-agent:, which crawler the rules apply to (or * for all)
Allow: and Disallow: directives specifying paths
Optional Sitemap: reference (placed at the file level, not inside a user-agent block)

Common directives

Directive	Effect
`User-agent: *`	Rules apply to all crawlers
`User-agent: Googlebot`	Rules apply only to Google’s crawler
`Disallow: /`	Block all paths
`Disallow: /private/`	Block paths starting with `/private/`
`Disallow:` (empty)	Allow everything (no restrictions)
`Allow: /private/public-page`	Allow this specific path within an otherwise disallowed directory
`Sitemap: https://...`	Indicate the sitemap location
`Crawl-delay: 10`	Request delay between requests (Google ignores this; some other crawlers honor it)

Common patterns

Allow everything

User-agent: *
Disallow:

(Equivalent to having no robots.txt at all)

Block everything

User-agent: *
Disallow: /

(Used during development or for staging sites; should never be left on production)

Block admin and shopping cart

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

Block all crawlers except Google

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

What `robots.txt` does and does not do

robots.txt:

Politely asks crawlers not to access specified URLs
Helps reduce crawl load on the server
Indicates where the sitemap is located
Honored by major search engines and most well-behaved bots

robots.txt does NOT:

Prevent indexing of disallowed pages (Google may still index URLs found through links, even if it cannot crawl them)
Provide security (sensitive content should be protected by authentication, not robots.txt)
Block malicious or non-compliant crawlers
Prevent users from accessing pages directly

`robots.txt` vs `noindex`

These solve different problems:

Mechanism	Purpose
`robots.txt` Disallow	Tells crawlers not to fetch the page
`noindex` meta tag	Tells crawlers not to include the page in search results

If a page is disallowed in robots.txt, search engines cannot fetch it to read a noindex tag, so they may index the URL anyway based on external signals. To reliably exclude a page from search results, allow crawling but use noindex.

Common mistakes

Blocking CSS or JavaScript. Search engines need to fetch CSS/JS to render and rank pages correctly; blocking these resources can hurt rankings
Blocking /wp-admin/ but not /wp-admin/admin-ajax.php. Some sites need admin-ajax.php to be accessible
Forgetting to remove the development block on launch. A site launched with Disallow: / cannot be indexed at all
Using robots.txt for security. Disallowed paths are publicly visible in robots.txt itself
Conflicting Allow and Disallow rules. Most crawlers prefer the most specific rule, but interpretation varies; keeping rules simple is safer

Crawler behavior

Compliance with robots.txt varies:

Major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo) honor robots.txt
Reputable crawlers (Ahrefs, Semrush, Moz, archive.org, etc.) generally honor it
AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) typically honor robots.txt; specific user-agent strings can be blocked
Malicious bots routinely ignore robots.txt

Where to put `robots.txt`

robots.txt must be at the root of the domain: https://example.com/robots.txt. It cannot live in a subdirectory or subdomain (each subdomain needs its own robots.txt).

For sites generated by static site generators, robots.txt is typically placed in the public/ or static assets folder so it’s served at the root.

How to test `robots.txt`

Google Search Console has a robots.txt tester that shows how Google interprets the file
Manually fetch /robots.txt to verify it’s being served correctly
Crawl the site with a tool like Screaming Frog (it respects robots.txt by default; can be toggled to ignore for diagnostics)

Common misconceptions

“robots.txt prevents pages from appearing in Google.” It prevents crawling, not indexing; Google may show the URL based on external signals.
“You need a robots.txt file.” It is optional; a site without one is treated as allowing all crawling.
“Disallowing in robots.txt is private.” Anyone can read /robots.txt directly.
“All bots respect robots.txt.” Only voluntarily.