1. Indexing 101
  2. Here's what I did to identify indexing issues
  3. And the verdict is …
  4. Key takeaways on general indexing problems

Google is open about the fact that not all pages found will be indexed. You can use the Google Search Console to view the pages on your website that are not indexed.

The Google Search Console can also provide you with useful information about the specific problem that was preventing a page from being indexed.

These issues include server errors, 404s, and indications that the page may have thin or duplicate content.

However, we never see data showing what problems are most common across the web.

So … I decided to collect data and do the statistics myself!

This article examines the most common indexing issues that are preventing your pages from showing up in Google Search.

Indexing 101

Indexing is like building a library, except that Google deals with websites instead of books.

advertising

Read on below

In order for your pages to appear in search, they must be properly indexed. For laypeople, Google has to find and save them.

Google can then analyze the content to decide which queries they might be relevant to.

Indexing is a requirement to get organic traffic from Google. And as more pages on your website are indexed, you have more chances of showing up in search results.

Because of this, it is very important for you to know if Google can index your content.

Here's what I did to identify indexing issues

My daily tasks include optimizing websites from a technical SEO point of view in order to make them more visible in Google. Therefore, I have access to several dozen websites in the Google Search Console.

I decided to use this to hopefully solve popular indexing problems … well, less popular.

For the sake of transparency, I have broken down the methodology which has led me to interesting conclusions.

advertising

Read on below

methodology

I started by creating a page example that combined data from two sources:

  • I used the data from our customers that was available to me.
  • I asked other SEO experts to share anonymized data with me by posting a Twitter poll and contacting some SEOs directly.

SEOs, I need 3-10 minutes of your time.
Can you help me with my indexing research and share some non-sensitive GSC stats with me?
If I find interesting insights, I'll publish an article about it.

Thanks in advance! Please R / T.

🙏🙏 https://t.co/vAwMulQtsx

– Tomek Rudzki (@TomekRudzki), November 9, 2020

Both proved to be fruitful sources of information.

Except for non-indexable pages

It is in your best interest not to index some pages. This includes old URLs, articles that are no longer relevant, filter parameters in e-commerce and much more.

Webmasters can ensure that Google ignores them in a number of ways, including the robots.txt file and the noindex tag.

Considering such sites would have a negative impact on the quality of my results. So I've removed pages from the example that meet any of the following criteria:

  • Blocked by robots.txt.
  • Marked as noindex.
  • Diverted.
  • Returns an HTTP 404 status code.

Except not valuable pages

To further improve the quality of my sample, I only considered the pages that are included in sitemaps.

In my experience, sitemaps are the clearest representation of valuable URLs from a given website.

Of course, there are many websites whose sitemaps contain junk. Some even include the same URLs in their sitemaps and robots.txt files.

But I did that in the previous step.

Categorize data

I've found that popular indexing problems vary based on the size of a website.

This is how I split the data:

  • Small websites (up to 10,000 pages).
  • Medium-sized websites (from 10,000 to 100,000 pages).
  • Large websites (up to a million pages).
  • Huge websites (over 1 million pages).

advertising

Read on below

Because of the different sizes of the sites in my example, I had to find a way to normalize the data.

A very large website struggling with a particular problem might outweigh the problems of other, smaller websites.

So I've looked at each website individually to sort out the indexing issues they're struggling with. Then I assigned scores to the indexing problems based on the number of pages that were affected by a particular problem on a particular website.

And the verdict is …

Here are the top five problems I found on websites of all sizes.

  1. Thinned – currently not indexed (quality issue).
  2. Double content.
  3. Detected – currently not indexed (crawl budget / quality issue).
  4. Soft 404.
  5. Crawling problem.

Let's break these down.

quality

Quality issues include your pages being thin, misleading, or overly biased.

If your page doesn't have clear, valuable content that Google wants to show users, you'll have a hard time indexing it (and you shouldn't be surprised).

advertising

Read on below

Double content

Google may recognize some of your pages as duplicate content even if you didn't intend to.

A common problem is canonical tags that refer to different pages. The result is that the original page is not indexed.

If you have duplicate content, use the canonical tag attribute or a 301 redirect.

This is a great way to ensure that the same pages on your website are not competing for views, clicks, and links.

Crawl budget

What is the crawl budget? Based on several factors, Googlebot will only crawl a certain number of URLs on each website.

This means that optimization is vital. Don't let it waste its time on pages that don't interest you.

Soft 404s

404 errors mean you submitted a deleted or non-existent page for indexing. Soft 404s display "not found" information but do not return the HTTP 404 status code to the server.

Forwarding remote pages to others who are irrelevant is a common mistake.

advertising

Read on below

Multiple redirects can also show up as soft 404 errors. Make an effort to shorten your referral chains as much as possible.

Crawl problem

There are many crawling problems, but one important one is a problem with robots.txt. If Googlebot finds a robots.txt for your website but cannot access it, it will not crawl the website at all.

Finally, let's look at the results for different website sizes.

Small websites

Sample size: 44 locations

  1. Crawled, not currently indexed (quality or crawl budget issue).
  2. Double content.
  3. Crawl budget problem.
  4. Soft 404.
  5. Crawling problem.

Medium sites

Sample size: 8 locations

  1. Double content.
  2. Detected, currently not indexed (crawl budget / quality issue).
  3. Thinned, not currently indexed (quality issue).
  4. soft 404 (quality problem).
  5. Crawling problem.

advertising

Read on below

Great websites

Sample size: 9 locations

  1. Thinned, not currently indexed (quality issue).
  2. Detected, currently not indexed (crawl budget / quality issue).
  3. Double content.
  4. Soft 404.
  5. Crawling problem.

Huge websites

Sample size: 9 locations

  1. Thinned, not currently indexed (quality issue).
  2. Detected, currently not indexed (crawl budget / quality issue).
  3. Duplicate content (duplicate submitted URL that was not selected as canonical).
  4. Soft 404.
  5. Crawling problem.

Key takeaways on general indexing problems

It is interesting that, based on this finding, two website sizes are suffering from the same problems. This shows how difficult it is to maintain quality on large websites.

  • Greater than 100,000 but less than 1 million.
  • Greater than 1 million.

The takeaways, however, are:

  • Even relatively small websites (over 10,000) may not be fully indexed due to insufficient crawling budget.
  • The bigger the site, the more pressing the crawl budget and quality issues become.
  • The duplicate content problem is serious, but it varies depending on the website.

P.S. A note on URLs unknown to Google

During my research, I realized that there is another common problem that is preventing pages from being indexed.

advertising

Read on below

It may not have earned its place in the ranking above, but it still has meaning and I was surprised to see that it is still so popular.

I'm talking about orphaned pages.

Some pages on your website may not have internal links leading to them.

If the Googlebot can't find a way to find a page on your website, it may not find it at all.

What is the solution? Add links from related pages.

You can also fix this manually by adding the orphaned page to your sitemap. Unfortunately, many webmasters still neglect this.

More resources:

LEAVE A REPLY

Please enter your comment!
Please enter your name here