Skip to content
SiteShiftCo

Robots.txt

A plain-text file at the root of a website that tells web crawlers which parts of the site they may or may not access.

Also known as: robots.txt, robots file

robots.txt is a plain-text file located at the root of a website (e.g., https://example.com/robots.txt) that tells web crawlers which parts of the site they may or may not access. It is part of the Robots Exclusion Protocol, originally proposed in 1994 and now formalized as RFC 9309.

robots.txt is voluntarily honored by well-behaved crawlers. It is not a security mechanism, malicious bots typically ignore it.

Basic structure

A minimal example:

User-agent: *
Disallow: /admin/
Disallow: /cart/

User-agent: Googlebot
Allow: /admin/preview/

Sitemap: https://example.com/sitemap.xml

Each block consists of:

  • User-agent:, which crawler the rules apply to (or * for all)
  • Allow: and Disallow: directives specifying paths
  • Optional Sitemap: reference (placed at the file level, not inside a user-agent block)

Common directives

DirectiveEffect
User-agent: *Rules apply to all crawlers
User-agent: GooglebotRules apply only to Google’s crawler
Disallow: /Block all paths
Disallow: /private/Block paths starting with /private/
Disallow: (empty)Allow everything (no restrictions)
Allow: /private/public-pageAllow this specific path within an otherwise disallowed directory
Sitemap: https://...Indicate the sitemap location
Crawl-delay: 10Request delay between requests (Google ignores this; some other crawlers honor it)

Common patterns

Allow everything

User-agent: *
Disallow:

(Equivalent to having no robots.txt at all)

Block everything

User-agent: *
Disallow: /

(Used during development or for staging sites; should never be left on production)

Block admin and shopping cart

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

Block all crawlers except Google

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

What robots.txt does and does not do

robots.txt:

  • Politely asks crawlers not to access specified URLs
  • Helps reduce crawl load on the server
  • Indicates where the sitemap is located
  • Honored by major search engines and most well-behaved bots

robots.txt does NOT:

  • Prevent indexing of disallowed pages (Google may still index URLs found through links, even if it cannot crawl them)
  • Provide security (sensitive content should be protected by authentication, not robots.txt)
  • Block malicious or non-compliant crawlers
  • Prevent users from accessing pages directly

robots.txt vs noindex

These solve different problems:

MechanismPurpose
robots.txt DisallowTells crawlers not to fetch the page
noindex meta tagTells crawlers not to include the page in search results

If a page is disallowed in robots.txt, search engines cannot fetch it to read a noindex tag, so they may index the URL anyway based on external signals. To reliably exclude a page from search results, allow crawling but use noindex.

Common mistakes

  • Blocking CSS or JavaScript. Search engines need to fetch CSS/JS to render and rank pages correctly; blocking these resources can hurt rankings
  • Blocking /wp-admin/ but not /wp-admin/admin-ajax.php. Some sites need admin-ajax.php to be accessible
  • Forgetting to remove the development block on launch. A site launched with Disallow: / cannot be indexed at all
  • Using robots.txt for security. Disallowed paths are publicly visible in robots.txt itself
  • Conflicting Allow and Disallow rules. Most crawlers prefer the most specific rule, but interpretation varies; keeping rules simple is safer

Crawler behavior

Compliance with robots.txt varies:

  • Major search engines (Google, Bing, Yandex, Baidu, DuckDuckGo) honor robots.txt
  • Reputable crawlers (Ahrefs, Semrush, Moz, archive.org, etc.) generally honor it
  • AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) typically honor robots.txt; specific user-agent strings can be blocked
  • Malicious bots routinely ignore robots.txt

Where to put robots.txt

robots.txt must be at the root of the domain: https://example.com/robots.txt. It cannot live in a subdirectory or subdomain (each subdomain needs its own robots.txt).

For sites generated by static site generators, robots.txt is typically placed in the public/ or static assets folder so it’s served at the root.

How to test robots.txt

  • Google Search Console has a robots.txt tester that shows how Google interprets the file
  • Manually fetch /robots.txt to verify it’s being served correctly
  • Crawl the site with a tool like Screaming Frog (it respects robots.txt by default; can be toggled to ignore for diagnostics)

Common misconceptions

  • robots.txt prevents pages from appearing in Google.” It prevents crawling, not indexing; Google may show the URL based on external signals.
  • “You need a robots.txt file.” It is optional; a site without one is treated as allowing all crawling.
  • “Disallowing in robots.txt is private.” Anyone can read /robots.txt directly.
  • “All bots respect robots.txt.” Only voluntarily.