Google crawling schedule 2008

I’ve seen many people asking about this, and I’ve seen even more people generally mystified by the way Google works. So most people don’t understand how and when Google crawls and are generally thinking it’s a secret.

It’s not really that big a secret, but it is a bit of a thing to predict when Google does come. Of course if you deal with this stuff as often as I do, you start to become used to the schedules. What is a bit confusing though is how this crawl schedule changes like hell depending on a million factors Google finds important.

We’ll start off with a bit of information from the Google Webmaster Center. As always they give us the follow the guidelines and it depends on many things crap, but a few factors come out as obvious in the process:

  • PageRank
  • links to a page
  • crawling constraints (such as the number of parameters in a URL)

Ok, so we know what helps us. PR is the most important, then links, then the ability of your site to be crawled (that number of parameters refers to the fact that Google doesn’t like many php parameters - use mod_rewrite). So we have a starting point. But as always Google is cryptic and doesn’t really help… So we move on.

As early as 2002 people were asking about the Google crawl schedule, and some were guessing at it. However, results were strange and back them high PR sites were a lot more. However, many have seen Google full crawls at around 1st June, while another had it in May and still moving on in June. An interesting piece of info was that for large sites Googlebot came in at about every three minutes indexing about 2-10 pages a second, which I feel was a bit of a slurp but was made to keep a bit of the strain off the webserver. Their discussion goes offtopic then on, but for the purists, go read…

Our next source is a for dummies book excerpt, in which we get a bunch of terms related to the crawl. In doing research for this I was really surprised to see there’s very little info to be found. Then again, it’s not such a hot topic for SEO, but is somewhat important. They say the deep crawl occurs about every month and that fresh crawls occur randomly. Also, they consider the index as static between deep crawls, in a form called everflux in the strange update given by fresh crawls. My opinion later ;)

There’s not much else on the web, except a mention of the Google Dance. I find all these names so amusing, since they don’t really explain the phenomenon and there’s no dancing involved. I guess they got bored of using crawl in everything. It’s basically the deep crawl, and we get the info that it usually begins at the end of the month, lasting 3-5 days, and usually updates PR. Also, for the people out there who know how to monitor server logs, deep crawl uses an IP range of 216.239.46.x whereas fresh crawl uses the 64.68.82.x range. Also at that link above you can find a so called Google Dance Tool, which could be useful to see what pages Google finds important and crawls, but you could just use webmaster tools for that.

Now for my take on the whole thing. I feel that there’s not two, but three kinds of crawls. Firstly, there’s an almost immediate crawl, from pings and links and basically whichever spider Google uses for Google alerts. That happens at once, and crawls the title and the post, but does not index it. It only notices it’s there. Then, in a few days to a week, the post becomes indexed completely, and starts showing up in Google results (on a quite high position at first, then gradually lower if no further activity on that post is detected, or no search activity for that keyword is detected). The next kind of crawl is a longer-term crawl, which usually includes the homepage, and is done every week, or two weeks, or even a month for less active sites. This updates the cache on your active pages, but doesn’t touch the others. And the last kind of crawl happens about three or four times a year, and reindexes everything. This usually happens in February or March, June, November, or in some cases any other month. Google tends to vary this stuff, presumably due to factors on and off the site. So be prepared for a couple of crawls this year in June (beginning) and mid-November or so, and see if it happens as I’ve predicted.

One more thing, an important factor to crawling is the kind of servers you are hosted on. Use GoDaddy or any other established host rather than hosting on your old machine, so Google can download the data properly. The crawl intensity depends a lot on that. Also, Google does not have the same schedule as Yahoo for example. Yahoo just performed a deep crawl for my site a few days ago, whereas Google didn’t.

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.