- Web Development, WordPress
- 10 min
7 Methods to Fix the http Error When Uploading Images in WordPress
In this post, we talk about the ways to solve one of the most common problems WordPress users face: HTTP error.
This post is your ultimate guide to noindexing web pages. You will learn why you need to noindex a page, what nonindexing methods you can apply, and other essentials.
In this post, we’ll explain what indexing is, why you might need to noindex or deindex a page, what techniques are available, and how to noindex a page smartly. Keep reading to broaden your SEO skills, learn professional tricks, and improve your website’s ranking.
Search engine indexing is the process of gathering, analyzing, and storing information to make its retrieval faster and more accurate.
Search engines are a tool for users who are trying to find some information. As users want to get necessary (and relevant) answers as fast as possible, search engines must organize the available information in advance. Web crawlers scan information from billions of web pages, collect it, and keep it in the search index, i.e., a huge database. According to Google, its search index comprises more than 100,000,000 gigabytes of data.
Web crawlers analyze publicly available web pages. Robots meta directives (or ‘meta tags’) represent pieces of coding that instruct search bots how to crawl a website’s pages. The two basic parameters are ‘index’ and ‘noindex.’ By default, the page has an ‘index’ parameter. Precisely, search bots can access a page when
<meta name="robots" content="index,nofollow">
);<meta name="robots" content="nofollow">
).Accordingly, if you apply a ‘noindex’ meta tag or a directive in an HTTP response header, you’re instructing search crawlers not to scan your page. To inform search engines about your website’s list of pages available for crawling, you can use sitemaps.
As new websites constantly appear and search bots scan billions of pages every day, it’s clear that search engines can’t crawl each page every day. Search crawlers will come to the page periodically, checking whether there are any changes.
Additionally, to prevent overload on servers, search bots don’t scan all pages of multipage websites simultaneously to avoid overload on servers. Thus, it can take from some days to several months for the whole website to be inspected and indexed.
There are certain cases when website pages should not be crawled (or indexed) by search engines. Let’s have a look at them.
If you are just in the process of creating your website, it’s necessary to prevent search engines from indexing it. It’s better to noindex your website when it is in the ‘construction mode.’ So, while installing a theme or template and customizing your pages, website crawlers should not get access to them. You never know when a search bot visits your page. And you don’t want incomplete information to get into the index, do you?
Search engines care about the quality of pages and provide quality guidelines on how to create useful pages. The basic principle is making your pages primarily for users, not search engines, and avoiding tricks to improve search engine rankings (such as using hidden texts or links, for instance).
Consequently, if your page doesn’t contain useful information for website users, search bots may consider it as ‘thin content’ and even penalize the website.
Not all pages have the same value for users and should be available in search engine results. Often, blogs have numerous archives, author, and tag pages that by themselves don’t have much importance for readers. When these archives or tags are indexed, people may see numerous results for the pages contained in archives and tags, alongside blog posts.
Archive or tag pages don’t add to users’ experience and may confuse visitors. Thus, it can be wise to noindex archive and tag pages and not waste the crawl budget. In other words, noindexing specific pages helps to prioritize crawling and have better results on the SERP.
If your website has an internal search, it’s a wise idea to noindex its multiple pages. This way, you are helping search crawlers to understand your website better and display only relevant results on the SERP.
However, be careful with noindexing paginated pages on eCommerce websites. If your store has categories with multiple pages and you decide to noindex page 2 onwards, this can reduce the crawling of pages listed on paginated categories. You can use self-referencing canonical with standard pagination to avoid duplicate content in search indexes.
Frequently, websites include ‘Thank you pages’ that simply help you express gratitude to new customers when they have made a purchase or new subscribers after they have left their information for getting newsletters. These ‘thank you pages’ add no value to people who are using search engines to find helpful information. So, consider noindexing such pages. Likewise, most admin and login pages should have noindex tag.
Content not intended for search engines, such as confidential or sensitive information, should also have a ‘noindex’ tag or directive. You can consider making shopping carts or checkout pages of an online store as ‘noindex,’ too.
If you have decided that you need to noindex a page/pages, you probably wonder how to do this quickly and safely. In this section, we’ll consider various ways of how to noindex a page and their peculiarities.
Robots.txt is a file that instructs search engine crawlers which URLs they can access on a website. This file is useful for managing crawler traffic to your website. A robots.txt file contains one or more rules that block or allow access for a specific crawler to a particular path. Specifically, a robots.txt file can let some search engines crawl your website and block pages for other search engines.
The two basic rules in robots.txt are ‘allow’ and ‘disallow.’ By default, a search bot can crawl any page or directory unless it includes a ‘disallow’ rule. If you want to prevent all search engines (i.e., user-agents) from crawling your website, you can use the following command:
User-agent: * Disallow: /
The part ‘User-agent: *’ means that it refers to all search bots, and the ‘Disallow: /’ rule states that you are blocking the whole website.
If you want to avoid crawling unimportant pages, robots.txt is of great help as this file mainly works by exclusion. For instance, you can instruct search bots not to crawl PDF files of your website by adding two lines:
User-agent:
* Disallow: /*.pdf$
Similarly, for WordPress website, it’s possible to block access to such directories as /wp-admin/, /wp-content/comments/ or /wp-content/plugins/. The commands will look this way:
User-agent: *
Disallow: /wp-admin/
User-agent: *
Disallow: /wp-content/comments/
User-agent: *
Disallow: /wp-content/plugins/
Thus, you are disallowing bots to crawl and index your admin folder, comments, and plugins.
A robots.txt file is not the best way to keep a web page out of the search index. Firstly, this file can stop search engines from crawling pages but not indexing them. If crawlers find access to these links (even on other websites) and consider them important, the URLs will get to the index. However, it might lead to missing descriptions in the snippets on the search engine results page.
Secondly, it’s up to search engines to follow the instructions, so search crawlers might just ignore your robots.txt file. Thirdly, anyone can access a robots.txt file and see what pages you wish to keep away from search crawlers. You can’t hide pages with confidential information this way, either.
Since search engines can’t access pages protected by passwords, such pages will not be crawled and indexed. Accordingly, if you set password protection, a web page will not appear on the index.
For instance, while creating a store on Shopify, your website will have password protection and thus won’t be indexed. When you are ready to launch the store, you should remove password protection.
Unfortunately, not all bots are as friendly and harmless as Google or Bing that crawl and index pages to provide users with the most relevant information. There are unwanted bots that can harm a website owner or even bring down a site. Spambots, hacker bots, or malicious bots are just a few examples.
Suppose you have spotted unwanted activity on your website pages and want to block a bad bot. In that case, you should identify either the bot’s IP address or the ‘User-agent string’ that the bot is using.
Blocking the bot’s activity is possible by modifying the .htaccess file, and it’s advisable to be very careful with it. To block a specific IP address, you need to add the following line to the .htaccess file, inserting the necessary IP address:
Order Deny, Allow
Deny from IP address
If you need to disallow the bot’s activity by a user-agent string, it’s necessary to find a part of the user-agent string unique to that bot. Then, you’ll have to add the rule that disallows this particular bot to the .htaccess file. For instance, if you know that a bot SecuitySpider is affecting your website, you’ll use this RewriteRule:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} SecuitySpider [NC] RewriteRule .* - [F,L]
Alternatively, it’s possible to use a directive BrowserMatchNoCase to block a bot:
BrowserMatchNoCase "SecuitySpider" bots Order Allow,Deny Allow from ALL Deny from env=bots
If you are not an experienced developer who knows the commands of .htaccess files well, we highly recommend backing up your .htaccess file before overwriting it.
We have already mentioned robots meta tags that define the interaction of a search bot with a page. It’s necessary to note that search crawlers can see and understand these tags if a robots.txt file doesn’t block access to them.
The basic syntax for making a page noindex is <meta name="robots" content="noindex"
>. This tag should be a part of the <head>
section of the page that you are blocking from indexing.
If you are using WordPress, you can easily discourage search engines from indexing the whole website. All you have to do is go to the Reading section in Settings of your admin panel and tick the box ‘Discourage search engines from indexing this site.’ Yet, as the admin panel warns you, it’s up to search engines to honor this request.
Applying a noindex meta tag to separate pages depends on what platform, website builder, or content management system you use to create a website. Frankly speaking, more and more website builders are facilitating the process of optimizing websites for search engines. So, blocking a page from indexing can be pretty straightforward. Let’s have a look at just a few examples.
As leading open-source software, WordPress offers numerous tools for work. The easiest way to block a separate page from indexing is by using a plugin, such as Yoast or All-in-One SEO. By using these plugins, you can adjust the SEO settings of single pages or blog posts.
Open the page you need to hide, scroll to the Yoast SEO, go to the Advanced section, and choose ‘No’ for the ‘Allow search engines to show this Page in search results?’
Being a powerful e-commerce platform, Shopify also lets you control SEO settings. Noindexing a URL may not be as straightforward as it is with using plugins in WordPress: you’ll have to edit the coding. To exclude a specific page from indexing in Shopify, you need to find the theme.liquid layout file in the Themes of your Online Store sections and insert the coding:
{% if handle contains 'page-handle-you-want-to-exclude' %}
<meta name="robots" content="noindex">
{% endif %}
Remember to replace page-handle-you-want-to-exclude with the correct page handle.
The Editor in Wix is quite user-friendly when you need to prevent crawlers from indexing the page. Select the necessary page in the Pages section, click ‘Show More,’ choose ‘SEO Basics,’ and disable the toggle ‘Let search engines index this page.’
The HubSpot CMS features the tools that allow customers to change page settings and add the necessary noindex meta tag. In the Advanced Options in Settings, you need to choose the ‘Head HTML’ field, insert the tag <meta name="robots" content="noindex">
, and save the changes.
Another popular content management system, Squarespace, has two ways of applying a noindex tag. Firstly, you can take advantage of built-in options in Page Settings without having to work with coding. On the SEO tab, check ‘Hide this page from search engine results’ and save the settings.
Secondly, it’s possible to add the noindex tag via code injection. In the Page Settings, select the Advanced tab and insert the tag in the field (Page Header Code Injection). Remember to save the settings.
Page settings in Tilda also allow you to hide a specific page from indexing. Simply go to Page Settings, open Facebook and SEO, choose ‘Appearance in search results,’ and check the box ‘Forbid search engines from indexing this page.’
As a flexible CMS, Webflow lets you modify page coding and apply the noindex tag for a page. In the Page panel, select Page Settings and go to Custom code. Then, insert the necessary noindex tag, save, and publish the content.
This website builder lets you modify SEO settings and apply ‘noindex’ right in the Page Properties of the admin panel. Make sure you have chosen the relevant page, tick the ‘No-index’ box, and save the settings.
Besides noindex meta tags, you can use the X-Robots-Tag as an element of the HTTP header response for a specific URL. You can use this way for blocking non-HTML resources, such as PDF files, video files, images, etc. To apply this tag, you should be able to access your website’s header .php, .htaccess, or server configuration file. Here, you can see an example of the x-robots-tag blocking PDF files:
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, follow"
</Files>
It’s necessary to add the code to your .htaccess file or httpd.config file.
After blocking the page from indexing, you may want to make sure that everything is working correctly. Here, we’ll examine some means of checking if a page is indexable or not.
Firstly, Google Search Console offers the URL Inspection Tool that lets you see the current index status of a page. Additionally, you can test a URL with the Live URL test. It examines the page in real-time, so the data you’ll get with this test may differ from the indexed page.
Various add-ons can check the page’s indexability directly in the browser. In Chrome, user-friendly Website SEO Checker or SEO Minion are just a few examples. Similarly, in Firefox, you can find the SeoQuake SEO extension, Detailed SEO Extension, SEO Minion, etc.
Do you prefer using tools on your PC and getting a complete audit? Take advantage of the Screaming Frog SEO Spider. It will analyze your website and provide a list of pages blocked by robots.txt, meta robots, or X-Robots-Tag directives such as ‘noindex’ or ‘nofollow.’
You can also use the online SEO platform Sitechecker. This platform will help you perform an SEO audit and regular checks of your website. Page Counter by Sitechecker will help you find all pages of your site and check their indexing in search engines.
Sometimes you realize that your website has some pages that should be excluded from the search index. By the way, to check which pages are visible to Google, you can type site:yourdomain in the Google search and see all the results of the Google index for your website. So, let’s study the most common cases when you need to deindex a page and how to do this.
Firstly, all pages in WordPress are indexed by default when a website is published. If you don’t have much experience in SEO settings for a WordPress website, unnecessary pages can get indexed by mistake. For instance, besides pages, you may find archives, tags, menu items, theme parts, smart filters, etc., displayed on the SERP.
Fortunately, the Yoast plugin lets you choose various content types settings that specify which types of content should appear in search results.
Secondly, it’s advisable to deindex page duplicates and hide those pages that don’t add to the user experience. For instance, if your website contains a printer-friendly version together with an ordinary one, only one of them should appear in the search index.
Thirdly, if you happen to have hacked website pages (we hope it’s not the case now), it’s necessary to remove these pages from indexing. Google’s hacking classifiers monitor websites and can block them in case of hacking to prevent other computers from getting infected.
However, this doesn’t occur very often, and attackers do harm websites. Thus, if any hacked URLs are showing in the search index, you should remove them manually.
First and foremost, as discussed before, you can choose various ways of making a page noindex (using noindex meta tags, disallowing specific pages via robots.txt, canonicalizing to other pages, etc.).
Furthermore, Google Search Console provides instructions on how to remove a page from the Google index. You may hide pages from search results temporarily with the help of the Removal tool.
Keep in mind that this can be a part of the permanent page removal. It will work for about six months, and then you have to decide which action to take.
If you don’t need a page anymore (like in the case of hacked pages), you can remove the content by deleting the page. Then, your server will return a 404 (Not Found) status code, and Google will remove the page from its index with time.
Alternatively, as described earlier, you can choose other blocking techniques to block access to the pages. In short, it’s up to you to decide how to prevent Google from indexing a page in each particular situation.
Let’s briefly summarize what indexing/noindexing is, why noindex or deindex a page, and how to do it.
Search engine indexing performed by bots or crawlers is collecting, analyzing, and keeping information. Indexes aim to organize data from different pages and to make it accessible and valuable. A page that has an ‘index’ status is available for crawlers.
Noindexing is blocking access to website pages and excluding them from appearing in search engine results.
When a website includes pages that don’t have much value and shouldn’t be visible on the SERP, make them ‘noindex.’ The most typical examples are archive, tag, author pages for blogs, or login and admin pages. Also, blocking access to confidential pages or Thank You pages is necessary.
The safest means is applying a ‘noindex’ meta tag or using the X-Robots-Tag in the HTTP header response. Additionally, you can restrict access to website directories, such as /wp-admin/ by ‘Disallow’ commands in the robots.txt file.
Indeed, you can. If the page became available to search engines by mistake, you can make it noindex.
A quick but temporary decision for removing a page is with the Removal tool in the Google Search Console. Afterward, remember to choose an appropriate way to deal with the situation (either delete the page entirely or make it noindex).
We hope you find this information helpful. Remember to update your website regularly, revise the content, monitor performance, check the site’s health, and deindex pages when necessary.
Still need help with noindexing a page or any other web development task? Just drop us a line. Over 16+ years in the industry, we have helped thousands of businesses and agencies meet their needs.
We are a leading provider of web and mobile development services, from building custom themes for websites based on various platforms (WordPress, Drupal, etc.) and eCommerce development (Shopify development, Magento, WooCommerce) to hand-crafting email templates and engaging HTML5 banners.