Google’s VP: Google’s Index Size Revealed

Google’s Index Size Revealed by Google’s VP of Search
Google rarely discusses the size of its web index—at least publicly. What exactly is Google’s index? Simply put, it’s a digital record of web pages and other documents eligible to be served in Google’s search results.

If it’s not in the index, you won’t see it in search results.

Google’s Index Size Revealed
Many might believe you can simply search Google for any page on the web, but the opposite is actually more true. Out of trillions and trillions of possible pages, Google must narrow it down to mere “billions” of the most important documents.

Google typically keeps the actual size of its index a secret, but recently, during testimony in the USA vs. Google antitrust trial, questioning by US attorneys revealed that Google maintained a web index of “about 400 billion documents.”

The number came up during the cross-examination of Google’s VP of Search, Pandu Nayak. The 400 billion refers to Google’s index size in 2020.

Nayak also testified that “for a time,” the index grew smaller than that. Finally, when asked if Google made any changes to its index capacity since 2020, Nayak replied “I don’t know in the past three years if there’s been a specific change in the size of the index.”

The takeaway is that while 400 billion probably isn’t the exact index size, it’s most likely a good ballpark figure. Also, the size of the index shifts over time and may even be shrinking.

How Big is 400 Billion Documents?
Make no mistake, 400 billion is a big number. For example, the size of this (very small) website you are reading right now—Zyppy—is about 50 pages. So Google’s index could hold 8 billion websites like this one.

Some sites are much larger. Wikipedia, for example, has 7 billion pages in Google’s index. So, Google could hold only about 50-60 Wikipedias.

To put this figure in perspective, consider the size of Google’s index compared to popular indexes SEOs might know about – Ahrefs, Moz, and the Wayback Machine. And remember that Google, while it filters out a lot of junk, is more likely to contain vast numbers of documents like books, patents, pdfs, and scientific papers that serve smaller and more niche audiences.

Google Excludes An Increasing Number of Documents
Google can’t index every page it finds on the web. Nor does it want to.

Google actually discovers trillions of pages while crawling. But as Nayak testified, most of those pages aren’t helpful to users.

“Like I said, trillions of pages is a lot of pages. So it’s a little difficult to get an index of the whole web. It’s not even clear you want an index of the whole web, because the web has a lot of spam in it. So you want an index of sort of the useful parts of the web that would help users.”

Beyond getting rid of spam, Nayak listed several other factors that impact the size of Google’s index:

1. Freshness Of Documents
Some pages on the web change quickly – like the front page of CNN. Other important pages can stay the same for years. The challenge Google faces is estimating how often a page might change to keep its index fresh without unnecessary crawling.

2. Document Size
Webpages are simply growing bigger. Ads, images, and more code mean the average size of web pages has grown huge over time.

Since it costs money to crawl and process web documents, this creates a challenge for Google to index.

“… over time at various times, the average size of documents has gone up for whatever reason. Webmasters have been creating larger and larger documents in various ways. And so for the same size of storage, you can index fewer documents, because each document has now become larger.”

Bigger documents mean pressure to index fewer pages.

3. Metadata Storage
Not only does Google store each document, it creates a tremendous amount of data about each document, including all the words and concepts related to each document.

“… when we get these documents, not only do we create an index, we create a bunch of metadata associated with the document which reflects our understanding of the document. And that has also grown over time. And so that also takes space in the index. And as a result, that results in the number of documents that you can index in a fixed size of storage to go down.”

As Google’s algorithms become more and more sophisticated, the amount of metadata increases, limiting the amount the index can grow in size.

4. Cost of indexing and processing
At the end of the day, all those data centers cost a lot of money – and use a lot of electricity!

“… there is this trade-off that we have in terms of amount of data that you use, the diminishing returns of the data, and the cost of processing the data. And so usually, there’s a sweet spot along the way where the value has started diminishing, the costs have gone up, and that’s where you would stop.”

Takeaways for Web Publishers
As AI-generated content floods the web as it becomes cheaper to produce, Google may be forced to index an increasingly smaller percentage of all web pages it finds. As Nayak explained, the goal of Google’s index isn’t about making a complete record of all documents but indexing enough pages to satisfy users.

“… making sure that when users come to us with queries, we want to make sure that we’ve indexed enough of the web so we can serve those queries. And so that’s why the index is such a crucial piece of the puzzle.”

This supports what Google has been publicly hinting at for years: Sometimes when Google doesn’t index a page, it does so because it doesn’t believe it’ll be useful to users.

If Google isn’t indexing your pages, you may need to evaluate your site’s technical SEO, the usefulness of the content, links to your site, and your user engagement, among other factors.

It may seem like being in Google’s index is a no-brainer, but increasingly, we may see more pages excluded from it.

Source: Zippy.COM