What is a Robots.txt File?
Robots.txt is a file that tells search engine spiders to not crawl certain pages or sections of a website. Most major search engines (including Google, Bing and Yahoo) recognize and honour Robots.txt requests.
Why Is Robots.txt file Important?
Most websites don’t need a robots.txt file. That’s because Google can usually find and index all of the important pages on your site. Plus they’ll automatically NOT index pages that aren’t important or duplicate versions of other pages.
That said, there are 3 main reasons that you’d want to use a robots.txt file.
Block Non-Public Pages: Sometimes you have pages on your site that you don’t want indexed. For example, you might have a staging version of a page. Or a login page. These pages need to exist. But you don’t want random people landing on them. This is a case where you’d use robots.txt to block these pages from search engine crawlers and bots.
Maximize Crawl Budget: If you’re having a tough time getting all of your pages indexed, you might have a crawl budget problem. By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that actually matter.
Prevent Indexing of Resources: Using meta directives can work just as well as Robots.txt for preventing pages from getting indexed. However, meta directives don’t work well for multimedia resources, like PDFs and images. That’s where robots.txt comes into play.
The robots.txt file lives at the root of your site. So, for site www.example.com, the robots.txt file lives at www.example.com/robots.txt. The bottom line? Robots.txt tells search engine spiders not to crawl specific pages on your website.
This helpful guide from Google has more info the different rules you can use to block or allow bots from crawling different pages of your site.