Google’s John Mueller discussed the role of TF-IDF in Google’s algorithm. He discussed what it was and offered a better way to optimize for ranking web pages.
What is TF-IDF?
Wikipedia has a concise definition of what TF-IDF is:
“…tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection… The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”
The key thing to focus on is that TF-IDF is a metric related to the entire “collection” or “corpus.” That means all the web pages containing a specific word or phrase. In the case of web search, this means that the metric depends on how often the word or phrase appears in every web page that exists online. This is a statistical analysis.
That part about “some words appear more frequently in general” is about how TF-IDF is used to catch and remove commonly used words (and, a, the). TF-IDF is important for removing common words (like and, a, and the) from consideration for ranking purposes.
TF-IDF is used to create statistic averages of the use of words and phrases throughout the web. It’s not the magic content solution that some people have suggested.
Here is the question.
“What are your thoughts on TF-IDF keywords? Does Google use a similar mechanism?
Should we make use of this to make our content better?”
John Mueller answered:
“…TF-IDF keywords is essentially a metric that is used in information retrieval.”
That reference to “information retrieval” is a reference to the general field of information retrieval. This includes the science of searching through the GMAIL inbox. Information Retrieval is a somewhat ambiguous term.
Then he said this:
“With regards to trying to understand which are the relevant words on a page, we use a ton of different techniques from information retrieval. And there’s tons of these metrics that have come out over the years.”
This is a hint that focusing on an old metric that is useful for finding “stop words” is not useful because there are many other techniques used.
TF-IDF and Ranking in Google
“…My general recommendation here is not to focus on these kinds of artificial metrics… because it’s something where on the one hand you can’t reproduce this metric directly because it’s based on the overall index of all of the content on the web.
So it’s not that you can kind of like say well, this is what I need to do, because you don’t really have that metric overall.”
This means that it’s not possible to calculate the TF-IDF metric because it is based on statistics of the entire web.
John Mueller Recommendations for Ranking Better
John Mueller went on to describe a better alternative to focusing on TF-IDF:
“Instead, I would strongly recommend focusing on your website and its users and making sure that what you’re providing is something that Google will in the long term still recognize and continue to use as something valuable.”
Mueller revealed that this is a very old metric, implying that modern information retrieval has become more sophisticated:
“The other thing is… this is a fairly old metric and things have evolved quite a bit over the years. …there are lots of other metrics as well.”
Then he said that focusing on users is a better approach because it’s immune to changes. Google is focused on delivering the most useful search results. If you focus on useful content then the page will likely remain popular and shown on Google.
Here’s what Mueller said
“So just blindly focusing on just one kind of theoretical metric and trying to squeeze those words into your pages, I don’t think that’s a useful thing.
I think that’s very shortsighted thinking because you’re focusing just purely on a search engine where you think that these words have a stronger effect.
So, don’t just focus on artificially adding keywords. Make sure that you’re doing something where all of the new algorithms will continue to look at your pages and say, well this is really awesome stuff. We should show it more visibly in the search results.”
TF-IDF and SEO
- A major use for TF-IDF is for finding stop words like a, the, and and.
- This is an old and basic content metric
- There are many other content metrics that are better than the basic and simple TF-IDF metric
In a world where AI, neural networks and machine learning are the norm, TF-IDF is like a kids bike on training wheels compared to a Ferrari.
Mueller referenced its use for weeding out stop words (i.e. words like and, the, and that). That seems a fitting use for such an old technology. A basic algo like this could very well be limited to contributing to the simple task of identifying stop words.
We can’t know for sure, but the fact that Mueller mentioned TF-IDF in the context of stop word removal and didn’t mention any other context is meaningful.
Watch the Google Webmaster Hangout here.
Screenshots by Author, Modified by Author