Wednesday, October 26

Blog pinging and next generation site feeds

So I just submitted my site to ping-0-matic. Very interesting. The blogosphere is really pushing the envelope in keeping search engines notified when they are updated. I wish websites would do the same thing... and so does Google:

From their blog Google Site Feed Rumor. Google has been innovating with the way websites inform SE of changes -- with sitemaps for example. This allows them to improve their Recall -- to make sure they get the pages that the webmasters want them to get. What's next -- submit your site feed to Google. This marks a very big change in SE philosophy.

Traditionally search engines are pull organisms. They send out crawlers to vacuum the content off of a web. However, they are far from perfect in making sure they content they have is fresh. It is impossible to have perfect freshness, because the second a SE crawler downloads a page it could be out of date, especially if it is database driven. Think of highly dynamic news / forum sites for example.

It's very interesting that blogs have been pioneering a push centric architecture with pinging. From various sources it looks like there are at least 30-40 pinging services. Some, like feedburner even offer more advanced services, automatically notifying blog search engines when a new post is made.

On the traditional web, search engines go to great lengths to estimate the patterns which pages update. For example:
The evolution of the web and implications for an incremental web crawler. The problem is, this isn't perfect-- it requires a long history of observations to make accurate semi-accurate predictions. Search engines waste a lot of bandwidth crawling sites more often than they change, because it hurts too badly when a user clicks a result and doesn't get what they expect. Along this vein...

There is some interesting research going on around different approaches to web crawling, like
User Centric Web Crawling. Chris Olston, Sandeep and others at CMU and Stanford are doing some very interesting research in this area. Very few pages appear in results, so it makes sense for search engines to pay attention more to what users see AND that they know have relatively more unpredictable update schedule or the site is new and they don't have enough data to accurately predict when it will be next updated yet.

I think it is safe to say that we (both webmasters and SEs) are all looking forward to this innovation by Google. Hopefully, Google's market power will push (no pun intended!) webmasters to submit site feeds to search engines for indexing. It's about time! I am very curious how they are going to cope with all of the potential problems -- cloaking, spam, storage and bandwidth, etc... Perhaps an extension of the system they use for partners / Froogle?
Take a look at these guidelines for submitting a Froogle product feed. Very semantic webby. GlobalSpec has similar technology for partners and suppliers to submit data into our product search. I'll say it again -- getting web data in a structured format direct from webmasters / companies is a dream come true for search engines.

However, it can be more trouble then its worth. Explaining the ins and outs of tab delimited formats, valid xml feeds, etc... formats a pain to get right. Getting valid feeds reliably can prove more problematic than just crawling the website! The bottom line is, this is a nice step, but I don't think we will see much benefit for some time. Perhaps when future web applications have feed generation built-in.

Site feeds would solve a whole slew of problems for search engines, but create some new ones. It will hopefully help with angry webmasters sick of SE bots that download pages too often from their sites and use large amounts of bandwidth. This is especially important as the number of search engines are proliferating.

This probably won't help with crawlers that aren't webmaster friendly and don't obey Robots.txt exclusion standards correctly. Globalspec's robot, Ocelli is fully compliant, but there is still some difficulty getting webmasters to use and understand the standards. Having a feed would save our time (and Google's) if we could stop worrying about webmaster crawling related complaints!

Here's looking forward to the next generation of pinging for websites -- whole site feeds. I can't wait to see the implementation details Google has in store. It would be nice to get the search engines together and form standards on things like this, does anyone know of any attempts?

1 comment:

  1. - Ngươi đi trước đi , ta còn muốn cho mấy nha đầu này chọn lựa một ít trang bị . Ta cho ngươi ba kiện đồ này , nhất định không được đánh mất , Ma pháp bào tên là Nguyệt thần thủ hộ pháp bào , là vật phẩm ma pháp do tinh linh tộc chế tạo ra , trên đó có gia cố quang thuẫn có thể không thua sút so với thủy hệ ma pháp Thánh khiết , vĩnh viễn không nhiễm bụi bẩn . Quang thuẫn có thể căn cứ vào tinh thần lực sử dụng mà phạm vi lớn hay nhỏ , mỗi ngày có thể sử dụng ba lần , bất luận là vật lý công kích hay ma pháp công kích , đều có thể kháng cự lại .

    - Ny Na nãi nãi , cái này quá quý giá rồi ! - Từ nguyên tố ba động phát ra trên bề mặt của ba kiện đồ vật này , Âm Trúc có thể nhìn ra được sự trân quý của chúng , đây là những kiện bảo vật rất quý hiếm .đồng tâm
    game mu
    cho thuê phòng trọ
    cho thuê phòng trọ
    nhac san cuc manh
    tư vấn pháp luật qua điện thoại
    văn phòng luật
    số điện thoại tư vấn luật
    dịch vụ thành lập doanh nghiệp

    - Cũng không phải là cho ngươi . Chính là vì để ngươi tại tân sinh võ hội thượng cấp thần âm hệ mà xuất lực thôi . Hạng liên gọi là tâm linh thủ hộ , chuyên môn bảo vệ cho tinh thần lực , đối với ngươi bình thường tu luyện có chỗ lợi , một khi gặp phải tinh thần ma pháp công kích , có thể bảo vệ cho tinh thần bổn nguyên của ngươi , không bị biến thành ngu ngốc . Còn kia là sinh mệnh thủ hộ ngọc trạc có thể mỗi ngày gọi ra một lần phòng ngự . Mặc dù là phòng ngự