In an update to its Googlebot Help document, Google has specified that Googlebot will only crawl and index the first 15MB of an HTML file or supported text-based file.
“Googlebot can crawl the first 15MB of an HTML file or supported text-based file. Any resources referenced in the HTML such as images, videos, CSS, and JavaScript are fetched separately. After the first 15MB of the file, Googlebot stops crawling and only considers the first 15MB of the file for indexing. The file size limit is applied on the uncompressed data. Other crawlers may have different limits.”
The update caused some head scratching among SEOs. For example, would images count towards the size limit, meaning text below images which had reached the limit just be ignored?
In response, Google’s John Mueller tweeted on 24 June to clarify that embedded resources or content with IMG tags would not count as part of the HTML file.
John Mueller tweet read
“It’s specific to the HTML file itself, like it’s written. Embedded resources / content pulled in with IMG tags is not a part of the HTML file.”
John Mueller also confirmed that this is not a change, just official documentation of an already existing policy.
“This is not a change, it’s just not previously been officially documented (folks have experimentally determined it too, eg https://icg.agency/blog/whats-the-maximum-file-size-google-can-index ). Feel free to ping us or submit feedback if you’re unsure about the docs!”
So what is SEO best practice…
Google has now put on the record what the crawl cut off is for Googlebot. 15MB is a large amount, however, so there’s no need for undue worry.
It’s good practice (also editorially) to place important content at the top of the page to ensure it’s not missed, so Google can rank your page appropriately. It’s also a good idea to keep your web pages light. This is better both for users, who will just move on if your page takes too long to load, and for crawlers such as Googlebot.
You can check your HTML page size with free tools such as sitechecker, and you can use the URL Inspection tool in Search Console to see which parts of the page Google renders and sees within the debugging tool.
In light of the confusion caused by this documentation of the crawl limit, Google published a blog post clarifying the content the 15MB limit applies to. The post reiterates that, with the existing median size for an HTML file being 30KB, the overwhelming majority of users will not be affected by this crawl limit.
Google added:
“However, if you are the owner of an HTML page that’s over 15 MB, perhaps you could at least move some inline scripts and CSS dust to external files, pretty please.”
Read full details in Google’s Search Central blog.