
Robots.txt is a plain text file in the site root that uses the robots exclusion protocol to instruct web robots which pages to crawl and which pages not to crawl. Before visiting a target page, a search engine checks robots.txt for instructions. Different robots.txt configurations can be used to control crawler behavior. Robots.txt can be used to guide bots away from technical clutter and low-value pages so they spend time on important content. Several AI crawlers can be blocked individually by using their specific user-agent strings. Common mistakes include disallowing all paths on a live site, blocking CSS or JavaScript files that affect rendering, and confusing disallow with noindex, since disallowed pages can still be indexed if linked externally.
"Robots.txt is a plain text file in your root directory that tells search engine and AI crawlers which pages on your site to crawl and which to skip. By guiding bots away from technical clutter and low-value pages, you make sure they spend their time on the important, high-value content that drives results."
"The four AI crawlers most worth knowing (GPTBot, ClaudeBot, Google-Extended, and CCBot) respect robots.txt directives and can be blocked individually with their user-agent strings. Common robots.txt mistakes include using disallow: / on a live site, blocking CSS or JavaScript files (which hurts rendering), and confusing disallow with noindex, since a disallowed page can still be indexed if linked externally."
"The robots.txt file, also known as the robots exclusion protocol or standard, is a text file that tells web robots (often search engine crawlers and AI scrapers) which pages on your site to crawl. It also tells web robots which pages not to crawl. Let's say a search engine is about to visit a site. Before it visits the target page, it will check the robots.txt for instructions."
Read at Neil Patel
Unable to calculate read time
Collection
[
|
...
]