Rebel Insights

The Ultimate Guide to Optimizing Your Robots.txt

RebelMouse ready to assist to optimize robots.txt file

A robots.txt file is a critical page on your website that provides a set of instructions to web crawlers and web robots on which pages they can or cannot access.

It is used to help control the indexing behavior of search engine crawlers, so that your website is not overwhelmed with requests and certain pages are not indexed by crawlers. If you want to keep a specific page off of Google Search, you should use a noindex directive or protect your page with a password. But if you want to protect lots of pages, robots.txt works well.

It’s important that you fully understand the power of robots.txt because it can severely damage your site’s SEO if it is written improperly. On the flip side, it has plenty of benefits: improve website performance by blocking crawlers from parts of your website they shouldn’t access which reduces traffic to your servers, improve your website’s security by protecting the most sensitive information from being accessed by unauthorized users, and improve the search indexing process by guiding crawlers to your most relevant pages.

Components of Robots.txt

The most important lines of a robots.txt file can be broken down into four buckets:

  1. User-agent: This specifies which web crawler or user agent the rules apply to. A wildcard character (*) signifies that the rules apply to all crawlers. An example of calling out specific user agents like Google-Extended and GPTBot can be found in Narcity’s robots.txt.

  2. Disallow: This directive simply tells crawlers which pages or directories they are not allowed to crawl. One aspect of using disallow is to prevent particularly sensitive information from being indexed. Google says it is a best practice to block pages you don’t want indexed with disallow, and this can also reduce crawl budget by preventing crawlers from wasting time on such pages. Oftentimes you’ll block certain directories of files, for example anything with /core/* is blocked in our robots.txt.

  3. Allow: There may be instances when you want to make exceptions to the disallow rule. This is when you use the allow directive. These specific pages or directories are fine to be crawled despite a larger disallow rule. For example, Raw Story’s robots.txt allows for /r/kappa/api/ to be indexed as it contains a custom-built sitemap, despite otherwise disallowing the folder /r/.

  4. Sitemap: This directive provides the location of your XML sitemap file, which lists all of the URLs on your website that you want to be indexed. A good crawler will find these on its own, but a sitemap speeds up the process. In some cases, websites have multiple sitemaps and this is where they belong. An example of listing multiple sitemaps can be found in Panorama’s robots.txt. Please check that any sitemap is working properly with elements in it when you're including it in robots.txt.

With the four components above, you can configure your robots.txt in a way that makes it clear which pages you want crawlers to index and which pages you want robots to stay away from. You can hide internal resources or non-public pages and block any duplicate content from confusing crawlers. Through the process, you are also optimizing your crawl budget.

One important note: While robots.txt provides a set of instructions, it doesn’t enforce them. Search engine crawlers and site health crawlers like Semrush are among the good bots that follow the rules, but spam bots are likely to ignore them. For that reason, be especially careful with any sensitive information that you are exposing on your website.

Common Issues

Search Engine Journal has a great list of the most common issues with robots.txt files that you should definitely give a read. Some of these include:

  • noindex: If you have this in your robots.txt, your file may be very outdated, as Google began ignoring noindex rules in robots.txt as of 2019. It's best to remove noindex references.
  • crawl-delay: This is supported by Bing but not Google, and crawl settings were removed entirely from Google Search Console at the end of 2023. So it doesn't have a great usefulness if it's in your robots.txt.
  • missing sitemap: At least one sitemap should be in your robots.txt file.
  • incorrect use of wildcards: The asterisk (*) represents any instances of a valid character and the dollar sign ($) denotes the final part of a URL, such as a filetype extension. Use these carefully so you don't block entire parts of your site accidentally.

Update Your Robots.txt

RebelMouse users can easily make changes to their robots.txt by launching Layout & Design Tool in your Posts Dashboard menu. Navigate to Global Settings and you’ll find a line for robots.txt. After clicking it, you can make updates right there.

Validate Your Robots.txt Setup

Google Search Console has added the ability to check that your robots.txt is set up properly. To do this, simply navigate to Settings at the bottom of the left-side navigation menu. Under crawling, you should see robots.txt: “Valid.” To gain more insights, you can open up the robots.txt report (right side of the screen), which tells you the last time it was checked, the file path, the fetch status (fetched successfully or not fetched for reasons such as not found), and the size of the file. Any issues will be noted. If you need to request a recrawl, you can do so on this page.

valid robots.txt file in Google Search ConsoleThis is what you should see in Google Search Console for a valid robots.txt file.

If the robots.txt is not valid, you will see an error message and you can troubleshoot from there.

Request a Review

If you’d like one of our strategists to take a look at your robots.txt and make suggestions for optimizing it, simply get in touch and we can set that up with you.

What Is RebelMouse?
Request a Proposal

Where 
Websites Are Built

The Fastest Sites in the World Run on RebelMouse

Let’s Chat

new!

RebelMouse Performance Monitoring

Real-Time Core Web Vitals

Get Started
DISCOVER MORE

Our Core Features

Our platform is a complete digital publishing toolbox that's built for modern-day content creators, and includes game-changing features such as our:

animate
Layout and Design toolLayout and Design tool on mobile
animate

Why RebelMouse?

Unprecedented Scale

RebelMouse sites reach more than 120M people a month, with an always-modern solution that combines cutting-edge technology with decades of media savvy. And due to our massive scale, 1 in 3 Americans have visited a website powered by RebelMouse.

120M+ Users
550M+ Pageviews
17+ Avg. Minutes per User
6+ Avg. Pages per User

Today's Top Websites Use RebelMouse

Thanks to the tremendous scale of our network, we are able to analyze a wealth of traffic data that informs our strategies and allows us to be a true strategic partner instead of just a vendor.

upworthyindy100Vault12No Film SchoolRawStoryResponsible StatecraftPrideRolling Stone QuebecPremierGuitarPenskeINN Educate Brand ConnectThe FulcrumGZEROOkayafricaBrit+CoPaper MagazinePowerToFlyNarcityCommonDreamsAllBusiness

What Clients Say

We’re here to help you weigh and understand every tech and strategic decision that affects your digital presence. Spend less time managing everything yourself, and more time focused on creating the quality content your users deserve.

Case Studies

A Team (and an Agency) Built Like No Other

RebelMouse employs a unique, diverse, and decentralized team that consists of 70+ digital traffic experts across more than 25 different countries. We have no central office, and we cover every time zone to ensure that we’re able to deliver amazing results and enterprise-grade support around the clock.

Our team is well-versed in all things product, content, traffic, and revenue, and we strategically deploy ourselves to help with each element across all of our clients. We thrive on solving the complex.

Let's Chat