What is Robots.txt File and How to Create it?

robots.txt file tells search engine's spiders not to crawl certain webpages. Explore this blog to know how to create a perfect robots.txt to improve site SEO.

The robots.txt file tells search engine "robots" which page to crawl, and which is not to crawl. Simply, it means that it restricts Google' crawling behavior. Robots.txt, when used correctly, can improve SEO efforts by making it easier for crawlers to crawl your website.

If you are willing to know how to create the robot.txt file then don't worry because by the end of this post, you will have a full knowledge of this topic.

So, let’s quickly dive right in!

According to Google, what is the Robot.txt file used for?

According to Google:

A robot.txt file is used primarily to manage web crawler traffic to your site, and usually to keep a file off Google.

Why is the Robot.txt file important?

There are three main reasons you might want to use a robots.txt file, and this article will go over each one.

1: Block non public pages

Sometimes, you may have pages on your site that you want to keep only for yourself, and don’t want to index them for a variety of reasons.

For example, you might have a staging version of the page available as an alternative to the live website. To ensure these sections stay hidden from search engines and bots, you use the robots.txt files.

2: Specify sitemap's location

It is important to let crawlers know where your sitemap is located so they can crawl it.

3: Keeps a duplicate content away from the SERPs:

Adding a rule to your robots can prevent crawlers from indexing web pages which contain duplicated content.

Steps to setup a Robot.txt file:

1. Create a Robot.txt file

When you create a robots.txt file, it is recommended to use Notepad, Text Edit, or emacs. These text editors can produce a valid .txt file with UTF-8 encoding without any formatting problems for crawlers.

Don't use a word processor for this task, because word processors often save files in an unreadable format that crawlers may find difficult to interpret.

Limitations:

  • The file should be named "robots.txt '' as it's case sensitive so you can't write it Robot.txt or anything else.
  • Your site can only have one robots.txt file.
  • Make sure to Save the file as UTF-8 if prompted when saving.

2. Set User-agent for Robot.txt

User agents are actually crawlers and you need to specify them in order for your pages to get crawled by a defined program.

When creating a user-agent within your robots.txt file, there are three different ways of doing it.

  • Define a single User-agent:

Here you only define one user-agent, and the syntax will be:                                 User-agent: NameOfBot

Note:  List of basic Google user agents / Name of popular crawlers can be found here.

  • Define more than one User-agent:

If more than one bot will be needed to be added, input the name of each bot on subsequent lines. In this example, we used DuckDuckBot and Facebot.

Example:                                                                                                                           User-agent: DuckDuckBot                                                                                             User-agent: FaceBot

  • Set all crawlers as the User-agent:

To block all bots or crawlers, substitute an asterisk (*) in place of the bot name.

# Example to use asterisk                                                                                              User-agent: *
Note: The # symbol appears in comments to indicate the start of a comment.

3. Set Directives for your Robot.txt

It’s a set of rules/ instructions about which files or directories the user-agent (crawler) can or cannot access. It includes of:

  • Disallow:

You can prevent bots from accessing a certain section of the site by specifying the disallow directive.                                                                        

Example 1: # block crawler to crawl a complete web                                         disallow: /                                                                                                                                  

Example 2: # block crawler to crawl a specific directory                                   disallow: /subfolder-name/                                                                                                    

Example 3: # Empty Disallow                                                                                 disallow:

An empty disallow means it is a free-for-all, and the bots can please themselves as to wherever they want to visit.                                                                                              

Example 4: # Block Specific web page                                                                   disallow: /website/pagename.html

Note: file names are case sensitive so, this rule will apply on /website/pagename.html but not on /website/Pagename.html

  • Allow:

By specifying this directive, you can allow bots from accessing a certain section of the site.

  • Sitemap:

This command is used to call out any XML sitemap’ location, associated with the website.

Example:

# Sitemap files

Sitemap: https://www.brandoverflow.com/sitemap.xml
Note: this command is only supported by Google, Ask, Bing, and Yahoo.

  • Crawl-delay:

It defines a waiting time for any crawler, so that a crawler can wait accordingly just before loading and crawling the page content.

4. Upload Robots.txt files

Once you  create a robots.txt file, it will be uploaded in the root directory of your site, which you can find by opening up your FTP cPanel and browsing to the public_html folder on your website.

5. Verify your Robots.txt File is functioning properly

Your robots.txt file is the key to making sure that bots and search engines are following your instructions, so it should be tested in every way possible! One great tool for testing this text document's syntax and logic is Google’s Robots Tester Tool available on their Search Console page.

Robot.txt syntax with few useful Rules:

#Disallow crawling of the entire website, from all crawlers:

User-agent: *

Disallow: /

                                                                                                                                                     

#Disallow crawling of the single directory, and its content from all crawlers:

User-agent: *

Disallow:  /calender/

Disallow: /seo/

                                                                                                                                                   

#Allow access to a single crawler to crawl a complete website

User-agent: *

Allow: /

                                                                                                                                                   

#Block a specific images from a single agent called “Googlebot-images”:

User-agent: Googlebot-image

Disallow: /images/seo.png

                                                                                                                                                                                                   
#Block all images from a single crawler called “Googlebot-images”:

User-agent: Googlebot-image

Disallow: /

                                                                                                                                                   

#Block all the URLs that ends with a specific strings:

User-agent: Googlebot-image

Disallow: /*.xls$  

Note: $ use to match a string that ends with .xls/or any other extension.

Final Words

Robot.txt file is important for both your website design and your SEO strategy. It is important that you have one, and as it is not hard to make you can have it today. I have mentioned all the points in this blog, hope it will help you to create a robot.txt file for your website.

To visit any website's robots.txt file simply type website.com/robots.txt in google search bar.

In case of any query while creating a robot.txt file, feel free to contact me at [email protected]