So I just submitted my site to ping-0-matic. Very interesting. The blogosphere is really pushing the envelope in keeping search engines notified when they are updated. I wish websites would do the same thing... and so does Google:
From their blog Google Site Feed Rumor. Google has been innovating with the way websites inform SE of changes -- with sitemaps for example. This allows them to improve their Recall -- to make sure they get the pages that the webmasters want them to get. What's next -- submit your site feed to Google. This marks a very big change in SE philosophy.
Traditionally search engines are pull organisms. They send out crawlers to vacuum the content off of a web. However, they are far from perfect in making sure they content they have is fresh. It is impossible to have perfect freshness, because the second a SE crawler downloads a page it could be out of date, especially if it is database driven. Think of highly dynamic news / forum sites for example.
It's very interesting that blogs have been pioneering a push centric architecture with pinging. From various sources it looks like there are at least 30-40 pinging services. Some, like feedburner even offer more advanced services, automatically notifying blog search engines when a new post is made.
On the traditional web, search engines go to great lengths to estimate the patterns which pages update. For example:
The evolution of the web and implications for an incremental web crawler. The problem is, this isn't perfect-- it requires a long history of observations to make accurate semi-accurate predictions. Search engines waste a lot of bandwidth crawling sites more often than they change, because it hurts too badly when a user clicks a result and doesn't get what they expect. Along this vein...
There is some interesting research going on around different approaches to web crawling, like
User Centric Web Crawling. Chris Olston, Sandeep and others at CMU and Stanford are doing some very interesting research in this area. Very few pages appear in results, so it makes sense for search engines to pay attention more to what users see AND that they know have relatively more unpredictable update schedule or the site is new and they don't have enough data to accurately predict when it will be next updated yet.
I think it is safe to say that we (both webmasters and SEs) are all looking forward to this innovation by Google. Hopefully, Google's market power will push (no pun intended!) webmasters to submit site feeds to search engines for indexing. It's about time! I am very curious how they are going to cope with all of the potential problems -- cloaking, spam, storage and bandwidth, etc... Perhaps an extension of the system they use for partners / Froogle?
Take a look at these guidelines for submitting a Froogle product feed. Very semantic webby. GlobalSpec has similar technology for partners and suppliers to submit data into our product search. I'll say it again -- getting web data in a structured format direct from webmasters / companies is a dream come true for search engines.
However, it can be more trouble then its worth. Explaining the ins and outs of tab delimited formats, valid xml feeds, etc... formats a pain to get right. Getting valid feeds reliably can prove more problematic than just crawling the website! The bottom line is, this is a nice step, but I don't think we will see much benefit for some time. Perhaps when future web applications have feed generation built-in.
Site feeds would solve a whole slew of problems for search engines, but create some new ones. It will hopefully help with angry webmasters sick of SE bots that download pages too often from their sites and use large amounts of bandwidth. This is especially important as the number of search engines are proliferating.
This probably won't help with crawlers that aren't webmaster friendly and don't obey Robots.txt exclusion standards correctly. Globalspec's robot, Ocelli is fully compliant, but there is still some difficulty getting webmasters to use and understand the standards. Having a feed would save our time (and Google's) if we could stop worrying about webmaster crawling related complaints!
Here's looking forward to the next generation of pinging for websites -- whole site feeds. I can't wait to see the implementation details Google has in store. It would be nice to get the search engines together and form standards on things like this, does anyone know of any attempts?