John Lincoln

What is Robots.txt? My Process On How to Block Your Content

What is Robots.txt.? My Process On How to Block Your Content

As a digital marketer or ecommerce website owner, how your site ranks on search engine results pages could make or break your business.

There are many things you can do to control the way your website is ranked, both on and off your home page. Things like SEO and keyword research might come to mind but are you familiar with your robots.txt file?

It plays a big role in how your site is indexed and ranked so you’ll want to pay close attention to it.

Let’s talk about what a robots.txt file is, how it affects your SEO, and how and when you should use it on your site.

What is Robots.txt?

A robots.txt, also known as Robots Exclusion file, is a text file that tells search engine robots how to crawl and index your website. It is a key technical SEO tool used to prevent search engine robots from crawling restricted areas of your site.

How these robots crawl your website is very important in terms of how your site is indexed. In turn, this has huge implications on how your website ranks on search engine results pages. 

Sometimes, you will have information or files on your site that are important to the function of your site but aren’t necessarily important enough to be indexed or viewed. When you install a robots.txt file, it will block those files from being crawled.

Where Does a Robots.txt File Exist?

A robots.txt file exists on the root of the domain. It will look like this:

Robots.txt in Website Domain
Robots.txt in Website Domain

Keep in mind, this file only applies to this specific domain. Each individual subdomain or subpage should have its own robots.txt file.

Google explains all of the specifications necessary in their guide but, in general, it should be a plain text file encoded in UTF-8. The records should be separated by CR, CR/LF, or LF. While each search engine has its own maximum file size limits, the maximum file size for Google is 500KB.

When Should a Robots.txt File Be Used?

The goal of your site should be to make it as easy to crawl as possible. Since this file will interrupt the crawling and indexing a bit, be picky when it comes to deciding which page of your site requires a robots.txt file. 

Rather than always using a robots.txt file, focus more on keeping your site clean and easily indexable.

However, situations requiring a robots.txt file are not always avoidable. This feature was made available to improve times of server issues or crawl efficiency issues.

Examples of these issues include:

  • Pages with sensitive content or information
  • Unmoderated user-generated content such as comments
  • Category pages with non-standard sorting that creates duplication
  • Internal search pages that can result in an infinite number of pages
  • Calendar pages which produce a new page for each date

If you have an instance on your site where a Googlebot could get trapped and waste time, you should install a robots.txt file. Not only will this same indexing time but it will also improve the way your site is indexed and later ranked.

When Shouldn’t You Use a Robots.txt File?

While a robots.txt file should be used sparingly, there are situations when they should definitely not even be considered.

Examples of these situations are:

  • To Block Javascript/CSS: These aspects dramatically affect a user’s experience on your site but blocking them may result in manual penalties from Google. These penalties have a deeper negative impact on your ranking than anything else and should be avoided whenever possible.
  • When There is Nothing to Block: This one seems obvious. If your site has a clean and organized architecture, returning a 404 status is not as big of a problem as those sites with messy or disconnected site maps.
  • To Block Access from Staging or Developmental Sites. Most likely, you wouldn’t want a staging site indexed but there are better ways to achieve that goal than to use a robots.txt file. To eliminate confusion, simply make the site inaccessible to anyone outside of your administrative team.
  • Ignoring Social Media Network Crawlers. Robots.txt files can affect the snippets that can be built for social media networks from your page. Keep this in mind when you’re building your site. You want a snippet to pop up when someone shares your site on social media so don’t install a robots.txt file that will block it.
  • Blocking URL Parameters. Handle any parameter-specific issues directly inside of the Google Search console. 
  • Blocking URLs with Backlinks. A website’s authority is built heavily upon backlinks. When you use a robots.txt file to block those backlinks, you’re harming the authority that your SEO processes have been working so hard to build.
  • Getting Indexed Pages Deindexed. Sometimes disallowed pages may still get indexed. Don’t use a robots.txt file to stop that process.

A robots.txt file is very helpful when it comes to crawling but, if used incorrectly, it does have the ability to do more harm than good.

How to Format Robots.txt & Technical Robots.txt Syntax

There is a standardized syntax and specific formatting rules to follow when it comes to preparing a robots.txt file.

The three main robots.txt files are:

  • Full allow – meaning all content is allowed to be crawled
  • Full disallow – meaning no content is allowed to be crawled
  • Conditional allow – meaning that your robots.txt file outlines which aspects are open for crawling and what content is blocked

There are also some rules to follow when creating your robots.txt file. Let’s explore them.

Comments

Within the file, you can add comments. These lines start with a # and are ignored by search engines. They exist solely so that you can add in notes or comments about what each line of your file does and when and why it was added.

Specific User Agent Tokens

You can also specify a block of rules to a specific user agent. To do so, use the “User-agent” directive. This will instruct certain applications to follow these rules while allowing others to ignore your instructions.

The most comment specific user agent tokens are:

  • Googlebot: All Google crawlers.
  • Googlebot-News: The crawler for Google News.
  • Googlebot-Image: The crawler for Google Images.
  • Mediapartners-Google: Google’s Adsense crawler.
  • Twitterbot: Twitter’s crawler.
  • Facebot: Facebook’s crawler.
  • Bingbot: Bing’s crawler.
  • Yandex: Yandex’s crawler
  • Baiduspider: Baidu’s crawler
  • *: Rules apply to every bot

While these are a few of the most popular specific user agent tokens, this list is far from complete. Be sure to check for more on each search engine’s website: Google, Twitter, Facebook, Bing, Yandex, and Baidu.

Robots.txt Sitemap Link

You can also insert a sitemap directive into your robots.txt file. This will tell search engines where to find the sitemap.

A sitemap will expose all of the URLs on a website and further guide search engines on where your content is located.

While this addition helps search engines discover all of your URLs, your sitemap will not appear in Bing Webmaster Tools or Google Search console if not manually added.

Pattern Matching URLs

If you want a certain URL string to be blocked from search engine crawling, using a pattern matching URL in your robots.txt file is much more effective than including a list of complete URLs.

To use this tool, you’ll need the $ and * symbols.

The * symbol is a wildcard that can be used to represent any amount of any character. It can be used as many times as you would like in any area of the URL string.

The $ symbol signifies the end of a URL string.

For example:

    • *?*search= will block the following string of URLs: /everything?any=parameter&search=word
  • /a will block any URLs starting with a lowercase a
  • *.pdf$ will block all PDF files on your site

Pattern matching URLs is a very effective way to block a large number of URLs or related URLs.

Robots.txt Blocks

To block particular content from being viewed, you’ll want to use the Robots.txt disallow rule.

When you insert Disallow: before your symbols, search engines will know to ignore whatever URLs are produced by those symbols. 

For example, the code below will block Google’s bots from crawling any of your URLs that start with a lowercase a:

User-agent: googlebot

Disallow: /a

Robots.txt Allow

On the flip side, if you want to specifically point out a URL to be crawled, adding “Allow:” will override a Robots.txt disallow rule and ensure that it is viewed.

This can be used in situations where you want one or two pages of a large group of content viewed, while blocking the remainder of that group.

Robots.txt NoIndex

Robots.txt noindex is a tool that allows you to manage search engine indexing without using up a crawl budget. It ensures that a particular URL is not indexed.

However, Google does not officially recognize noindex so while it works today, it’s important to have a backup plan in case it stops working tomorrow.

Common Robots.txt Issues

Like any powerful web tool, robots.txt has its issues. Some of them include:

  • Case sensitivity problems
  • Crawl Delay can cause issues
  • Disallowed backlinks affect a site’s authority
  • Keep your robots.txt file up to date so that your directions are clearly followed
  • Robots.txt disallow will override the parameter removal tool so don’t use them together
  • Disallowing a migrated domain can affect the success of migration and redirects will not be successful

All of these issues are small when handled properly but can compound to larger issues when neglected. To minimize any negative impact, be sure to keep your robots.txt file up to date and accurate.

Best SEO Practices for Using Robots.txt

The SEO best practice that you can follow is to be sure to save and test your robots.txt file accurately. 

You can complete this task in your Google Search Console account. Using the tester tool, you can see which parts of your site are being crawled and which aren’t.

If you find that some of your content isn’t being crawled properly, you’ll know that you need to update your robots.txt file.

In addition, an improperly-installed robots.txt file can cause Google to ignore your site altogether. When your site is not crawled and indexed, it won’t appear on any search engine results pages. 

Wrapping Up with Robots.txt

Robots.txt can be a really cool tool that can benefit your site in a positive way. However, it can do a lot of damage if not used properly.

Be sure that you’re installing your file correctly and updating it regularly to see the best results from this process.

Still confused? You can always find a quality website developer to walk you through the process!

Leave a Reply

Your email address will not be published. Required fields are marked *

Welcome To John Lincoln Marketing

Welcome to John Lincoln’s personal website. You can learn about John Lincoln’s books, films, book him to speak and contact him. John is directly associated with many of the businesses mentioned on this website and freely discloses this information. 

About the Author

John Lincoln is CEO of Ignite Visibility, one of the top digital marketing agencies in the nation. Ignite Visibility is a 6x  Inc. 5,000 company. Ignite Visibility offers a unique digital marketing program tied directly to ROI with a focus on using SEO, social media, paid media, CRO, email and PR to achieve results. Outside of Ignite Visibility, Lincoln is a frequent speaker and author of the books Advolution, Digital Influencer and The Forecaster Method. Lincoln is consistently named one of the top digital marketers in the industry and was the recipient of the coveted Search Engine Land “Search Marketer of The Year” award. Lincoln has taught digital marketing and Web Analytics at the University of California San Diego since 2010, has been named as one of San Diego’s most admired CEO’s and a top business leader under 40. Lincoln has also made “SEO: The Movie” and “Social Media Marketing: The Movie.” His business mission is to help others through digital marketing.

Contact John Lincoln

Want to get in touch with John Lincoln? Click Here To Reach Out.

Related Posts