Google 2024 Leak

Google confirmed the authenticity of the documents, commenting as follows:

“We caution against making incorrect search assumptions based on information that is out of context, out of date, or incomplete. We have shared extensive information about how search works and what factors our systems take into account, while working to protect the integrity of our results from manipulation.”

Google search ranking algorithms

This Google data leak, along with other leaks and recent testimony in the US Department of Justice antitrust case, has shed light on many aspects of their ranking algorithms that call into question some of the company’s public statements . Here are a few key points that diverge from their claims about ranking methods and are of great interest to SEOs:

  • On-Site Behavior: An important component of ranking, NavBoost uses click-based data to improve or demote a site’s ranking. NavBoost analyzes user clicks based on search results, taking into account parameters such as “good clicks”, “bad clicks” (badClicks), duration of clicks (lastLongestClicks) and others. This allows Google to understand which search results are most satisfying to users and which pages are worth boosting in the rankings. NavBoost also takes into account user behavior such as pogo-sticking (quickly returning to search results after clicking on a result that did not satisfy the user’s request). The length of clicks (duration of stay on the page) is analyzed, which helps determine the usefulness and relevance of the page.
  • Chrome Browser Data Usage: The leak revealed that Google collects extensive user behavior data that is used to rank pages and domains. For example, Google may use the number of clicks on pages in the Chrome browser to determine the most popular URLs on a site, which influences the creation of Sitelinks.
  • Website Whitelists: Google has whitelists for websites related to travel, COVID, and elections. This allows Google to control search results for controversial or potentially problematic queries, ensuring that only verified and trusted sources are shown.
  • Domain Authority: Google has repeatedly stated that it does not use the Domain Authority metric in its algorithms. However, a data leak revealed that there is a siteAuthority metric that is used in the Q* system to assess the authority of a site. This indicates that there is an internal equivalent to the domain authority metric.
  • Sandbox: Google has stated that there is no sandbox and new sites are not subject to special restrictions. However, the leak mentions the hostAge attribute, which is used for a “fresh spam sandbox.” This confirms that Google does indeed use some form of sandboxing for new or suspicious sites.
  • Data from EWOK: EWOK is Google’s internal search quality assessment platform, where real people view search results pages and rate them based on a number of criteria such as relevance, usefulness and trustworthiness of the source. Data from quality raters can be used to directly influence page rankings.
  • Considering brand size: Popular and well-known brands have priority in ranking. Google uses various methods to identify and rank brands, including brand size, which is determined not only by the site itself, but also by the mention of this site on the Internet (even without links). 

Additional Important Points

  • Date matters: Google actively associates dates with content using bylineDate (a given date on a page), syntacticDate (a date extracted from a URL or title), and semanticDate (a date extracted from a page’s content).
  • Original Content and Keywords: Short content is assessed for originality and this affects its ranking. Page titles must be relevant to user queries, which remains an important factor.
  • Font Size matters: Google tracks the weighted average font size of terms in documents and links, which also affects rankings.
  • The PageRank of the home page is taken into account for all pages: Each document has its own PageRank of the home page. It’s likely that PageRank and siteAuthority are used as proxies for new pages until they have their own PageRank calculated.
  • Google may specifically exclude small sites from ranking: Google has a special flag indicating that a site is a “small personal site.” There is no definition for such sites, but Google can easily increase or decrease their ranking.
  • Indexing level affects the value of links: A metric called sourceType shows the relationship between where a page is indexed and its value. For reference, Google’s index is divided into tiers where the most important, regularly updated and accessible content is stored in flash memory. Less important content is stored on solid-state drives, and irregularly updated content is stored on regular hard drives. That is, the higher the level, the more valuable the link. Pages that are considered “fresh” are also considered to be of higher quality. This partly explains why high-ranking and news pages give better rankings.

Demotion in Google ranking algorithms

Demotion is the decline in the ranking of web pages in search results due to the presence of certain factors that negatively affect their quality or relevance. The data leak revealed that Google uses many different algorithmic mechanisms to demote pages. Here are some of them:

  • Anchor Mismatch – When a link does not match the target site it links to, the site is demoted in ranking.
  • SERP Demotion is a signal indicating potential user dissatisfaction with a page, and is likely measured in clicks.
  • Nav Demotion – This demotion is applied to pages that exhibit awkward navigation or poor user experience.
  • Exact Match Domains Demotion – a special feature to demote exact match domains (eg buy-cheap-shoes.com) if they do not provide quality content.
  • Product Review Demotion – There is no specific information on this issue, but it is likely related to the recent update to product reviews in 2023.
  • Location Demotion – there is an indication that “global” pages may be demoted in the search results. This suggests that Google is trying to associate pages with a location and rank them accordingly.
  • Porn demotion – demotion for demonstrating pornography.
  • Other link demotions – demotions due to links

Architecture of the ranking system

Functionality and interconnection of various systems at Google by their internal names.

Crawl 

  • Trawler is a web crawling system. It has a scan queue, reflects scanning speed and understands how often pages are clicked.

Indexing

  • Alexandria is the main indexing system.
  • SegIndexer is a system that places documents into tiers in an index.
  • TeraGoogle is a secondary indexing system for documents that are stored on disk for a long time.

Rendering

  • HtmlrenderWebkitHeadless is a rendering system for JavaScript pages.

Treatment

  • LinkExtractor – Extracts links from pages.
  • WebMirror – Canonicalization and duplication management system.

Ranging

  • Mustang is the main system for assessing, ranking and maintaining websites.
  • Ascorer – the main ranking algorithm
  • NavBoost is a re-ranking system based on click logs and user behavior.
  • FreshnessTwiddler is a system for ranking documents based on their freshness.
  • WebChooserScorer – determines the names of objects used when scoring snippets.

Service

  • Google Web Server (GWS) is the server that Google’s frontend interacts with. It receives data to display to the user.
  • SuperRoot is the brains of Google Search, sending messages to Google servers and running the post-processing system to re-rank and present results.
  • SnippetBrain is a system that generates snippets for search results.
  • Glue – A system for combining universal results based on user behavior.
  • Cookbook is a system for generating signals.

What are Twiddlers?

Twiddlers are re-ranking functions that start working after the main ranking algorithm, Ascorer, has been executed. Twiddlers can adjust a document’s information search score or change its rating, as well as impose certain restrictions on categories.

Presumably, any of the functions with the Boost suffix work using the Twiddler framework. Here are some Boosts described in the documentation:

  • NavBoost
  • QualityBoost
  • RealTimeBoost
  • WebImageBoost 

The factors outlined above show exactly how Google ranks sites. It is worth noting that the information may be updated with new data, since the documentation has recently appeared on the Internet