robots.txt Generator
Generate a robots.txt file for search engine crawlers
User-agent: *
Disallow:
About This Tool
Builds a robots.txt file from a structured form: user-agent rules, allow/disallow paths, sitemap URLs, and crawl-delay directives. Output follows the de facto standard documented in RFC 9309.
Directives are case-insensitive but path matching is case-sensitive. Search engines have varied support for non-standard fields like crawl-delay; major engines mostly ignore it.
The Robots Exclusion Protocol was first proposed in 1994 by Martijn Koster and remained an informal convention until being codified as RFC 9309 in 2022. The format is line-oriented: each block opens with one or more 'User-agent: <name>' lines naming a crawler, followed by 'Allow:' and 'Disallow:' lines listing paths. A wildcard 'User-agent: *' applies to crawlers not specifically named. The file lives at the root of a host (https://example.com/robots.txt), and crawlers fetch it before any other URL on that host. Subdomains require their own robots.txt; one at the apex does not cover sub.example.com.
A worked example: a typical e-commerce robots.txt blocks crawlers from cart and checkout while allowing the product catalog. 'User-agent: *\nDisallow: /cart\nDisallow: /checkout\nDisallow: /api/\nAllow: /api/sitemap.xml\nSitemap: https://example.com/sitemap.xml'. The Allow exception for /api/sitemap.xml overrides the broader /api/ disallow because RFC 9309 specifies that the most-specific (longest matching) rule wins. Google and Bing follow this convention; older or smaller crawlers may evaluate rules in document order, where the first match takes precedence.
Limitations are widely misunderstood. Disallow blocks crawling, not indexing. A page that cannot be crawled can still appear in search results if other sites link to it, sometimes with no description because the crawler never read the content. Suppressing indexing requires a noindex meta tag or X-Robots-Tag HTTP header, which can only be discovered if the page is crawlable. Robots.txt is also strictly advisory; well-behaved crawlers honor it, malicious scrapers ignore it. Sensitive paths should never be hidden by robots.txt alone, because the file itself is publicly readable and effectively advertises the locations of restricted content. Authentication, server-side IP blocking, or rate limiting are appropriate for genuine access control.
Wildcard syntax (* and $) is supported by Google, Bing, Yandex, and most major engines but is technically an extension to the original 1994 spec. Crawl-delay (a numeric pause between requests in seconds) is honored by Bing and Yandex but ignored by Google, which uses its own crawl-rate algorithm. The Sitemap directive is universally supported and is independent of user-agent blocks.
The about text and FAQ on this page were drafted with AI assistance and reviewed by a member of the Coherence Daddy team before publishing. See our Content Policy for editorial standards.