Bobcares

robots.txt allow and disallow – How we create it

by | Feb 26, 2021

Willing to know more about robots.txt allow and disallow functionality? Take a peek at this blog.

Robots.txt is named by robots exclusion standard.

It is a text file using which we can tell how the search engines must crawl the website.

At Bobcares, we often receive requests on robots.txt creation and errors as a part of Server Management Services.

Today, let’s explore more on robots.txt and see how our Support Engineers create it and avoid its related errors.

 

Robots.txt allow and disallow

Robots.txt basically works like a “No Trespassing” sign. It actually, tells robots whether we want them to crawl the website or not. So, it does not block access.

The robots.txt file belongs to the document root folder.

Now, let’s explore more about how to allow and disallow search engine access to website folders using robots.txt directives.

 

Disallow robots and search engines from crawling

We can tell search engines which parts or folders it must not access on a website. This is easily done using the ‘disallow’ directive.

After the directive, we specify a path or the folder name which the search engine must not access. If there is no path or folder mentioned then the directive is ignored.

Here is an example:

User-agent: *
Disallow: /wp-admin/

 

Allow robots and search engines to crawl

We can also tell Search engines about which folders it must access while crawling the website. This is easily done using the ‘allow’ directive.

Using both the allow and disallow directive together we can tell search engines to access only specific directories. And the rest is set to disallow.

Here is an example:

User-agent: *
Allow: /blog/terms-and-condition.pdf
Disallow: /blog/

In the above example, the search engine will not crawl the entire folder blog except the file terms-and-condition.pdf.

 

Few common mistakes done while creating robots.txt allow or disallow

 

1. Separate line for each directive while using allow or disallow

When mentioning the directives for allowing or disallowing, each one must be in a separate line.

One of our customers had added the below code in robots.txt and it was not working.

User-agent: * Disallow: /directory-1/ Disallow: /directory-2/ Disallow: /directory-3/

The above is the incorrect way of mentioning the directives in robots.txt. Our Support Engineers corrected the file by adding it with below code:

User-agent: *
Disallow: /directory-1/
Disallow: /directory-2/
Disallow: /directory-3/

Finally, adding this code the robots.txt started working fine.

 

2. Conflicting directives while using robots.txt

Recently, one of our customers had a robots.txt file with the below code in it.

User-agent: *
Allow: /directory
Disallow: /*.html

Here, the search engines are unsure about what to do with the URL http://domain.com/directory.html. Also, it is not clear to them whether they’re allowed to access.

So our Support Engineers wrote the code in a better way by adding wildcards.

User-agent: *
Allow: /directory
Disallow: /*.html$

In the above code, the search engines don’t provide any access to the URLs that end with .html. However, URLs like https://example.com/page/html?lang=en is accessible as it doesn’t end with .html.

[Need further assistance with robots.txt? – We’ll help you]

 

Conclusion

In short, we can instruct the crawler as to which page to crawl and which page not to crawl using the robots.txt allow and disallow directives. Today, we saw how our Support Engineers set the robots.txt and fix errors related to it.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.