Robots.txt
A plain-text file at the root of a website that tells web crawlers which parts of the site they may or may not access.
Also known as: robots.txt, robots file
robots.txt is a plain-text file located at the root of a website (e.g., https://example.com/robots.txt) that tells web crawlers which parts of the site they may or may not access. It is part of the Robots Exclusion Protocol, originally proposed in 1994 and now formalized as RFC 9309.
robots.txt is voluntarily honored by well-behaved crawlers. It is not a security mechanism, malicious bots typically ignore it.
Basic structure
A minimal example:
User-agent: *
Disallow: /admin/
Disallow: /cart/
User-agent: Googlebot
Allow: /admin/preview/
Sitemap: https://example.com/sitemap.xml
Each block consists of:
User-agent:, which crawler the rules apply to (or*for all)Allow:andDisallow:directives specifying paths- Optional
Sitemap:reference (placed at the file level, not inside a user-agent block)
Common directives
| Directive | Effect |
|---|---|
User-agent: * | Rules apply to all crawlers |
User-agent: Googlebot | Rules apply only to Google’s crawler |
Disallow: / | Block all paths |
Disallow: /private/ | Block paths starting with /private/ |
Disallow: (empty) | Allow everything (no restrictions) |
Allow: /private/public-page | Allow this specific path within an otherwise disallowed directory |
Sitemap: https://... | Indicate the sitemap location |
Crawl-delay: 10 | Request delay between requests (Google ignores this; some other crawlers honor it) |
Common patterns
Allow everything
User-agent: *
Disallow:
(Equivalent to having no robots.txt at all)
Block everything
User-agent: *
Disallow: /
(Used during development or for staging sites; should never be left on production)
Block admin and shopping cart
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xml
Block all crawlers except Google
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
What robots.txt does and does not do
robots.txt:
- Politely asks crawlers not to access specified URLs
- Helps reduce crawl load on the server
- Indicates where the sitemap is located
- Honored by major search engines and most well-behaved bots
robots.txt does NOT:
- Prevent indexing of disallowed pages (Google may still index URLs found through links, even if it cannot crawl them)
- Provide security (sensitive content should be protected by authentication, not
robots.txt) - Block malicious or non-compliant crawlers
- Prevent users from accessing pages directly
robots.txt vs noindex
These solve different problems:
| Mechanism | Purpose |
|---|---|
robots.txt Disallow | Tells crawlers not to fetch the page |
noindex meta tag | Tells crawlers not to include the page in search results |
If a page is disallowed in robots.txt, search engines cannot fetch it to read a noindex tag, so they may index the URL anyway based on external signals. To reliably exclude a page from search results, allow crawling but use noindex.
Common mistakes
- Blocking CSS or JavaScript. Search engines need to fetch CSS/JS to render and rank pages correctly; blocking these resources can hurt rankings
- Blocking
/wp-admin/but not/wp-admin/admin-ajax.php. Some sites need admin-ajax.php to be accessible - Forgetting to remove the development block on launch. A site launched with
Disallow: /cannot be indexed at all - Using
robots.txtfor security. Disallowed paths are publicly visible inrobots.txtitself - Conflicting Allow and Disallow rules. Most crawlers prefer the most specific rule, but interpretation varies; keeping rules simple is safer
Crawler behavior
Compliance with robots.txt varies:
- Major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo) honor
robots.txt - Reputable crawlers (Ahrefs, Semrush, Moz, archive.org, etc.) generally honor it
- AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) typically honor
robots.txt; specific user-agent strings can be blocked - Malicious bots routinely ignore
robots.txt
Where to put robots.txt
robots.txt must be at the root of the domain: https://example.com/robots.txt. It cannot live in a subdirectory or subdomain (each subdomain needs its own robots.txt).
For sites generated by static site generators, robots.txt is typically placed in the public/ or static assets folder so it’s served at the root.
How to test robots.txt
- Google Search Console has a robots.txt tester that shows how Google interprets the file
- Manually fetch
/robots.txtto verify it’s being served correctly - Crawl the site with a tool like Screaming Frog (it respects
robots.txtby default; can be toggled to ignore for diagnostics)
Common misconceptions
- “
robots.txtprevents pages from appearing in Google.” It prevents crawling, not indexing; Google may show the URL based on external signals. - “You need a robots.txt file.” It is optional; a site without one is treated as allowing all crawling.
- “Disallowing in
robots.txtis private.” Anyone can read/robots.txtdirectly. - “All bots respect
robots.txt.” Only voluntarily.