Earlier this year, Google’s Gary Illyes stated that 30x redirects (301, 302, etc.) do not result in a loss or dilution of PageRank. As you can imagine, many SEOs have greeted this claim with skepticism.
In a recent Webmaster Central Office Hours Hangout, I asked Google’s John Mueller whether perhaps the skepticism was because when SEOs experience loss of visibility during migrations, they might not realize that all signals impacting rankings haven’t passed to the new pages yet, so they assume that PageRank was lost.
Yeah, I mean, any time you do a bigger change on your website — if you redirect a lot of URLs, if you go from one domain to another, if you change your site structure — then all of that does take time for things to settle down. So, we can follow that pretty quickly, we can definitely forward the signals there, but that doesn’t mean it will happen from one day to the next.
During a migration, Googlebot needs to collect huge amounts of data for collation in logs, mapping and updated internally, and rankings can fluctuate throughout this process. But in addition to that, when Googlebot visits plays a fundamental part in rankings fluctuation during a migration, and that can relate to “URL scheduling,” a key component of crawl budget.
URL scheduling is essentially “What does Googlebot want to visit (URLs), and how often?” Host load, on the other hand, is based around “What can Googlebot visit from an IP/host, based on capacity and server resources?” Together, these make up “crawl budget” for an IP or host. Both of these still matter in migrations.
On a 10-page brochure site, you likely won’t see any loss of visibility during a site migration. But what if your site is, for example, an e-commerce or news site with tens of thousands, hundreds of thousands, or more URLs? Or what if you’re merging several sites into one on the same IP host?
For everything to be fully passed, it all has to start as a bare minimum with at least a complete site crawl by Googlebot. It may even take a few complete site crawls, as Googlebot understands more about URLs — and how everything fits and links together internally in a site — with each subsequent visit to a newly migrated site.
On larger sites, that may not happen as soon as you’d hoped.
You’ve likely spidered your website with your favorite crawling tools prior to migration “go live,” and you’re confident that there are no issues. But then rankings and overall visibility drops. What could have gone wrong?
Many things can go wrong with a migration, but consider this: maybe nothing has gone wrong.
Maybe some of those signals that have not been passed are just “late and very late signals in transit,” rather than “lost signals.”
Some signals could even take months to pass. Why? Because Googlebot does not crawl large websites like crawling tools do, and it’s well nigh impossible for tools to emulate.
Your migration schedule is not Googlebot’s schedule
You have a migration schedule. It doesn’t follow that Googlebot will fall into step. Googlebots have their own work schedules, too.
Crawl frequency of URLs is on a per-URL basis. Google’s John Mueller confirmed this, saying:
Some URLs are crawled every few minutes, others just every couple months, and many somewhere in between.
While Google states that there are many factors affecting the crawl frequency of URLs, in a recent webinar, Gary Illyes referred to “scheduling” and “buckets” of URLs prepared beforehand for Googlebot to visit. So we know that scheduling exists. It’s also covered in lots of Google patents on crawl efficiency.
It is worth noting that crawl frequency is not just based on PageRank, either. Both Google’s Andrey Lipattsev and Gary Illyes have remarked in separate webinars recently that PageRank is not the sole driver for crawling or ranking, with Lipattsev saying, “This (PageRank) has become just one thing among very many things.”
‘Importance’ is important
I’m not going to apologize for my overuse of the word “important,” because it’s been confirmed that crawl scheduling is mostly driven by the “importance” of URLs.
In fact, Gary Illyes states just that in a recent Virtual Keynote recorded interview with Eric Enge, and he notes that we should not keep focusing on PageRank as the sole driver for crawling or ranking.
Many of the Google Patents touch on Page Importance and mention that this “may include PageRank,” but it is clear that PageRank is only a part of it. So Page Importance and PageRank are not the same, but one (Importance) may include the other (PageRank).
What we do know is that important pages are crawled more often.
There is the kind of relationship where … when we think something is important we tend to crawl it more frequently.
So, just what is ‘page importance?’
Of course, Google is not going to tell us of all the contributors to Page Importance, but a number of Google Patents around crawl efficiency and managing URLs touch on the subject.
These are a few of my findings from patents, webinars, Google Webmaster Hangouts, old interviews, blog posts and Google Search Console help. Just to be clear, there are undoubtedly more factors than this, and only some of the factors listed below are confirmed by Google.
There are other clues about page importance, too:
Recently, Gary Illyes mentioned in a Virtual Keynote webinar with Eric Enge that if a page was included in an XML sitemap, it would likely be considered more important than others not included.
We know that hreflang and canonicalization are used as signals (in page robots management).
As mentioned above, PageRank “may be included in Page Importance” (and presumably with that internal PageRank).
In Google’s Search Console Help Center, internal backlinks are stated as “a signal to search engines about the relative importance of that page.”
Matt Cutts, Google’s former Head of Webspam, spoke of search engines understanding the importance of pages according to their position in URL parameter levels. Illyes also uses the example of an “about us” page and a “home page which changes frequently” as having different levels of importance to users who want to see fresh content. The “about us” page does not change much.
File types and page types are also mentioned in patents, and we know that, for instance, image types are crawled less frequently than other URLs because they don’t change that often.
Change management/freshness is important, too
One thing we do know is that change frequency impacts crawl frequency.
URLs change all the time on the web. Keeping probability of embarrassment for search engines (the “embarrassment metric”) by returning stale content in search results below acceptable thresholds is key, and it must be managed efficiently.
Most of the academic papers on web crawling efficiency and information retrieval, conference proceedings and even patents attribute the term “search engine embarrassment” to Wolf et al.
To combat “embarrassment” (returning stale content in results), scheduling systems are built to prioritize crawling important pages and important pages which change frequently over less important pages, such as those with insignificant changes or low-authority pages.
These key pages have the highest probability of being seen by search engine users versus pages which don’t get found often in search engine results pages.
In general, we try to do our crawling based on what we think this page might be changing or how often it might be changing. So, if we think that something stays the same for a longer period of time, we might not crawl it for a couple of months.