Google uses a predictive method to identify duplicate content based on URL patterns. This can result in pages being incorrectly identified as duplicates.
To avoid unnecessary crawling and indexing, Google tries to use their URLs to predict when pages may contain similar or duplicate content.
If Google crawls pages with similar URL patterns and finds that they contain the same content, it can determine that all other pages with that URL pattern have the same content.
Unfortunately for website owners, which could mean that pages with unique content are written off as duplicates, they have the same URL pattern as pages that are actual duplicates. These pages would then not be included in the Google index.
This topic will be covered during the Google Search Central SEO hangout recorded on March 5th. The website owner Ruchit Patel asks Müller about his event website, on which thousands of URLs are incorrectly indexed.
One of Mueller's theories as to why this is happening is based on the predictive method used to identify duplicate content.
Read on below
Read Mueller's answer in the following section.
Google's John Mueller on Duplicate Content Prediction
Google has several layers to determine when web pages have duplicate content.
One is to look directly at the page content and the other is to use their URLs to predict when pages are duplicates.
“What happens a lot on our side is that we try to understand on several levels when there is duplicate content on a website. And one is, if we look directly at the content of the page and see that this page has this content, this page has different content, we should treat them as separate pages.
The other thing is sort of a broader predictive approach that we have when looking at the URL structure of a website that we've seen URLs in the past that look like we've seen the same content as URLs like these. And then we essentially learn that pattern and say that URLs that look like this are the same as URLs that look like this. "
Read on below
Mueller goes on to explain why Google is doing this to save resources on crawling and indexing.
If Google thinks a page is a duplicate version of another page because it has a similar URL, it won't even crawl that page to see what the content really looks like.
"Even without looking at the individual URLs, we can sometimes say that we save ourselves the crawling and indexing and only concentrate on these assumed or very likely duplication cases." And I've seen this happen with things like cities.
I've seen this be the case with things like, I don't know, cars, which we've essentially seen, where our systems essentially recognize that what you put as the city name is for the actual URLs is not that relevant. And we usually learn this type of pattern when a website has a lot of the same content with alternate names. "
Mueller talks about how Google's predictive method of detecting duplicate content can affect event websites:
“With an event site, I don't know if your website does. With an event site, there may be times when you take a city and a city that is perhaps a kilometer away and the event pages show that there is exactly the same thing because the same events are relevant to both places.
And you take a city that is maybe three miles away and show exactly the same events again. And on our part, this could easily lead to a situation where we say we checked 10 event urls. This parameter, which looks like a city name, is actually irrelevant as we checked 10 of them and displayed the same content.
And here our systems can then say, maybe the name of the city as a whole is irrelevant and we can simply ignore it. "
Read on below
What can a website owner do to fix this problem?
As a possible solution to this problem, Müller suggests looking for situations in which there are real cases of duplicate content and narrowing this down as much as possible.
“In such a case, I would try to find out if there are situations in which content overlaps a lot and to find ways to limit this as much as possible.
And that could be because you're using something like a rel canon on the page and saying, Well, this little town that's just outside the big city, I'm going to put the canon on the big city because they have exactly the same content shows.
So that really every URL we crawl on your website and index can see that that URL and its content are unique and that it is important for us to keep all of these URLs indexed.
Or we see clear information that this URL you know should be the same as this other one. You may have set up a forwarding or you may have set up a rel canonical facility there. We can only focus on these main URLs and still understand that the city aspect is crucial for your individual pages there. "
Read on below
Mueller does not address this aspect of the problem, but it is worth noting that there is no penalty or negative ranking signal associated with duplicate content.
At most, Google won't index duplicate content, but it won't have a negative impact on the website as a whole.
Hear Mueller's answer in the video below: