Saturday, October 29

High speed data storage and unsigned Data types in Java

One of the reasons search is so fascinating is because performance really matters. I've always been fascinated by speed and performance, and I had the wonderful opportunity to work in the performance group at the Java Tech. Center in Hursley testing the performance of AspectJ .

What I want to talk about today is the classic trade-off in CS: space and time. Search engines need to be fast, and ironically, to do this they need to be very space efficient.

RAM efficiency and Search
So, you may ask, what does unsigned data types in Java have to do with search? A lot, it turns out. It all goes back to economics and scarcity of resources. The biggest limiting factor in the speed of web search is fast, cheap, storage. And RAM is the most important source of fast storage in the modern computer. The more you can do in RAM (indexing / query serving) the faster the search engine.

In software, and search in particular, data structure size matters because RAM is a very precious, it is a finite resource -- a bit like oil. RAM is digital black gold! The biggest barrier to speed in search engines (and computers in general) is the limited size of RAM and the relative slowness of hard disks. Not much has changed since 2000 when Larry and Sergey wrote their now famous paper The Anatomy of a Search Engine:

Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently... However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness.

In order for a SE to be fast, it needs fast access to data -- the faster the better. The more inverted index that can fit into memory the faster the search engine. If it has to access a part of index not in memory, it must access a hard disk. Accessing a disk is slooww compared to RAM, and I don't think many people really appreciate how slow disks are or how little they have changed in recent years. Moore's Law has made web search feasible with CPU speeds and cheap storage. However, just because storage has gotten cheaper doesn't mean it has gotten faster -- storage speeds do not obey Moore's law. In fact, disk speeds have not improved much in the past 10 years. We are still stuck in the world of 8-10 ms random access times. Hard disks are serial access memory (as opposed to random access memory, RAM) made of mechanical pieces that move -- a magnetic read/write head attached to an arm that moves back and forth over the top pieces of magnetic material (platters) that are spun on a central axis (spindles). In constrast, RAM is very fast. It is has an access time of nanoseconds, not milliseconds. However, we have much less of it and it isn't permanent -- the lights go out when the power is pulled (although there is some research that might be changing this).

On average, a normal high-end workstation can have between 2-8 GB of RAM. The rest of the storage is disk. Google has large clusters of cheap computers serving their search engine not because it is convenient -- but because it is the easiest way to get large amounts of RAM. In a search engine, more cheap computers are better than fewer high-end computers because both can hold roughly the same amount of RAM. The cost per megabyte of RAM is much lower for cheap computers. How this limited resource is used is critical to speed.

Java and Unsigned Data Types
So what does all of this have to do with unsigned data types in Java? As I have said, search engines are very conscious of data structure size. They pull out lots of fancy compression and encoding algorithms to shrink the size of data structures so that they can fit everything into memory. The problem is that all in Java there are no unsigned data types. This means wasted space when search engines usually only utilize positive numbers. Two's complement representations waste have of their bandwidth on negative numbers, which some applications don't require. Other languages, like C /C++ both signed and unsigned data structures. Why is Java different? Why doesn't Java have unsigned data types?

Well, this has been bugging me for awhile, so I decided to do some research. I knew it had something to do with its cross platform compatibility and word sizes. However, I wanted more details. It took some research -- it was not nearly as easy or obvious as I thought it would be and there is still some debate.

I found Java and Unsigned types (Or rather, the lack thereof). Which pointed me to some very interesting articles.

Here is at least on point that James Gosling makes in this very fascinating interview on C/C++/Java.

Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple.

Now, I am no C developer, but would like to think that I know about about unsigned arithmetic. It is in the curriculum of any good CS program -- Computer Architecture 101. If you want a good resource to learn more please see Computer Organization and Design by Hennessey and Patterson.

I understand that they wanted to abstract platform concerns away from the developer and create a cross-platform language. However, they provided no means of extension that developers could utilize to gain platform specific performance -- which is very important in search.
Li Moore, from Google, reiterates this desire for a platform extensible model in his recent interview on Google's migration to Java 1.5. Li would really like to see the ability to have a pluggable interface for platform specific file system operations. Ostensibly, Google wants this so they can fully utilize their Google File System and the I/O features of their custom Linux platform.

Li Moore also mentions that he would really like to see an unsigned byte. A signed byte can only store from -128 to +127. But, what Li wants is to store 0-255. I'm not sure what Li would like to use this for, but I can speculate. What if you wanted to store PageRank scores in the index for efficient searching? Having an unsigned short, 255 buckets, would certainly provide more bandwidth than 127 buckets to distinguish between sites. While it is not terribly difficult to translate negative numbers to positive numbers at query time, it requires a larger data type to be kept in RAM.

In short, negative numbers are less convenient or just don't make sense in some applications, dealing with all positive numbers clearly has advantages. To get around Java not having unsigned data types you could write an unsigned byte wrapper class, but because it would be a class and not a primitive you would use more RAM because you woulld need to assign an Object Reference and you would lose the space savings and performance offered by primitive types.

Java Sign Tips
So, if you've read this far without being bored out of your mind I applaud you!. Here are some interesting tricks. Java DOES have an unsigned data type -- it is a char. In Java, a char is a two byte (16-bit) unsigned number, in other words -- an unsigned short.

Another little useful trick if you are doing a lot of bit arithmetic and want to make your code easier to read is to initialize constants using something like Long.parseLong("0000111...", 2). Much easier than trying to read hex when you need to do efficient bitwsie arithmetic. Comments are an alternative ;-).

Don't forget the difference in Java between signed bit shifting >> and unsigned bit shifting that shifts in 0s >>>.

So why do you care? Being able to save 2 bytes may not sound like a lot, but over millions or billions of objects this can add up to be a significant savings! It is especially critical if the difference is spending money to buy new hardware because you need more RAM.

What is the sucessor to the hard disk for storage? 7200 and 10k RPM drives are not the long-term solution. We need a better answer than spinning pieces of metal. Something without moving parts that is both fast and cheap. Perhaps a type of non-volatile memory like NRAM or MRAM? Optical or Holographic media?

Thoughts anyone? Perhaps SDRAM is good enough for Google, but storage media is something search engines might consider putting more effort into developing and supporting with research dollars. Until then, bring on the unsigned data types!

We've talked about memory and RAM, fast, cheap storage is only one part of the problem Larry and Sergey pointed out in 2000. The other was the robustness of operating systems. Fuel for a future post ;-).

Further reading.
Here is an article from 2000 out of IBM research on the future of media storage that is very interesting.

Holographic Memory on Technology Review.
IBM Almaden Data Storage Research.

Java Data Types

Thursday, October 27

User Generated Content and Spam

Getting users to generate content is catching on again. Remember when forums became THE hot thing on the internet? Now it is blogs, folksonomies, tagging, wikipedia, etc... all under the nebulous umbrella of "Web 2.0". We'll see what happens, but the spammers are assaulting here as well.

Greg has posted some interesting commentary on this phenomenon and some ways to combat it. Very timely considering the problems I had getting Blogger setup because of the new anti-spam 'features'.

Matt Cutts has a recent post about adding 'Captchas' to Wordpress. For those not yet in the know, a Captcha is one of those funky images that look like a two year old scrawling on the screen with a crayon. It's not only hard for humans to read -- its darn near impossible for bots.

Can we make one that doesn't give me a headache and still be effective?

Wednesday, October 26

Google Clustering and Classification

I must've missed this post on Battelle's SearchBlog:
Peter Norvig's Demo at Web 2.0
Here is SE Watch's coverage.

Other web 2.0 highlights.

Named Entity Extraction, Clustering, and new UI innovations? Talk about a lot to bite off in one presentation...

Here is something he said that I found really interesting that I have been pondering recently:
We break text into sentences and then match sentences against patterns. We discard noisy data and regularize over names. We also use the relations of concepts and the nesting sets of concepts to understand the concepts. -- Talk about a teaser. There is so much there that it is almost meaningless. I really want to be interested, to learn something from that, but Peter is a master of marketing speak -- lots of words that are interesting, without much substance. How?? How do you break sentences up? What are the patterns? What is the noise! No secret sauce.

Is this presentation online somewhere? I am dying to see it. I don't think Google has published papers on much (or any!) of this.

I managed to find an MP3: on the unofficial google blog. I'll try to listen soon.

There is some interesting working going on to identify and remove irrelevant parts of a page (remove things like navigational links, targeted ads, etc..) and focus on the content.

What I would like to know is: What are Google's categories for Bayesian classification? Were the categories 'automatically discovered' a la Clusty or pages classified into a more manual taxonomy a la DMOZ and Globalspec.

... there is a lot more here on text classification and ad sense. Peter Norvig wrote the book on AI (literally) so I guess I shouldn't be surprised. Did I mention it is a good book? We used it as the text of my AI course at Union.

Some people & research on text classification, entity extraction, and clustering :
The Bow Toolkit -- library from 1996 in C code. Talk about ancient, but it is an early example. The author, Andrew McCallum is one of the pioneers in entity extraction and former VP of R&D at Whizbang labs and major guy at Flipdog. His research students are publishing some very interesting papers.

Learning to Cluster Web Search Results by MS Asia. They actually have an online demo of their web search clustering technology. It's not bad, but it's a little slow. It's also not perfect. For example a search on Jaguar returns lots of clusters on cars, but only one on MacOS and none on cats, helicopters or fighter jets. Clusty does a much better job.

Word Clustering and disambiguation based on co-occurrence data from MS Research Japan.

If you want to drink from a firehouse, here is Stanford's CS276 class on IR. This class covers most of the above mentioned topics with a plethora of research to keep you busy for awhile. The old course website also has interesting material -- including lecture notes!

Haven't looked too closely at this, but I ran across it in my browsing stream and it looks interesting:
Techniques for Improving the performance of Naive Bayes
... now what about SVMs? Naive Bayes and SVMs seem to be the two most common text classification technologies in vogue today with academia. More on text classification discussion to follow...

Google Query Highlighting of Abbreviations

I noticed something interesting in my daily search engine travels.

Try a search on Google for ACM.

The first result shows the Association for Computing Machinery (one of my favorite organizations). What is fascinating is that it is highlighted in bold! Cool! Google recognized ACM as an abbreviation. Especially interesting since more and more users are looking at query highlighting as a measurement of relevance.

I tried this search on Yahoo, MSN, and Ask Jeeves. No other search engine is this sophisticated in its search result display technology. Does this help the user experience on Google by making them feel the results are more relevant? So, that leaves me asking... how much effort it would it take to pull this off and why haven't the others caught on?

Blog pinging and next generation site feeds

So I just submitted my site to ping-0-matic. Very interesting. The blogosphere is really pushing the envelope in keeping search engines notified when they are updated. I wish websites would do the same thing... and so does Google:

From their blog Google Site Feed Rumor. Google has been innovating with the way websites inform SE of changes -- with sitemaps for example. This allows them to improve their Recall -- to make sure they get the pages that the webmasters want them to get. What's next -- submit your site feed to Google. This marks a very big change in SE philosophy.

Traditionally search engines are pull organisms. They send out crawlers to vacuum the content off of a web. However, they are far from perfect in making sure they content they have is fresh. It is impossible to have perfect freshness, because the second a SE crawler downloads a page it could be out of date, especially if it is database driven. Think of highly dynamic news / forum sites for example.

It's very interesting that blogs have been pioneering a push centric architecture with pinging. From various sources it looks like there are at least 30-40 pinging services. Some, like feedburner even offer more advanced services, automatically notifying blog search engines when a new post is made.

On the traditional web, search engines go to great lengths to estimate the patterns which pages update. For example:
The evolution of the web and implications for an incremental web crawler. The problem is, this isn't perfect-- it requires a long history of observations to make accurate semi-accurate predictions. Search engines waste a lot of bandwidth crawling sites more often than they change, because it hurts too badly when a user clicks a result and doesn't get what they expect. Along this vein...

There is some interesting research going on around different approaches to web crawling, like
User Centric Web Crawling. Chris Olston, Sandeep and others at CMU and Stanford are doing some very interesting research in this area. Very few pages appear in results, so it makes sense for search engines to pay attention more to what users see AND that they know have relatively more unpredictable update schedule or the site is new and they don't have enough data to accurately predict when it will be next updated yet.

I think it is safe to say that we (both webmasters and SEs) are all looking forward to this innovation by Google. Hopefully, Google's market power will push (no pun intended!) webmasters to submit site feeds to search engines for indexing. It's about time! I am very curious how they are going to cope with all of the potential problems -- cloaking, spam, storage and bandwidth, etc... Perhaps an extension of the system they use for partners / Froogle?
Take a look at these guidelines for submitting a Froogle product feed. Very semantic webby. GlobalSpec has similar technology for partners and suppliers to submit data into our product search. I'll say it again -- getting web data in a structured format direct from webmasters / companies is a dream come true for search engines.

However, it can be more trouble then its worth. Explaining the ins and outs of tab delimited formats, valid xml feeds, etc... formats a pain to get right. Getting valid feeds reliably can prove more problematic than just crawling the website! The bottom line is, this is a nice step, but I don't think we will see much benefit for some time. Perhaps when future web applications have feed generation built-in.

Site feeds would solve a whole slew of problems for search engines, but create some new ones. It will hopefully help with angry webmasters sick of SE bots that download pages too often from their sites and use large amounts of bandwidth. This is especially important as the number of search engines are proliferating.

This probably won't help with crawlers that aren't webmaster friendly and don't obey Robots.txt exclusion standards correctly. Globalspec's robot, Ocelli is fully compliant, but there is still some difficulty getting webmasters to use and understand the standards. Having a feed would save our time (and Google's) if we could stop worrying about webmaster crawling related complaints!

Here's looking forward to the next generation of pinging for websites -- whole site feeds. I can't wait to see the implementation details Google has in store. It would be nice to get the search engines together and form standards on things like this, does anyone know of any attempts?

Google spam verification

Ok, Now I am annoyed. This is the third try and I am still getting asked for the word verification. I have cleared my cookies, cache, etc.. One more time!

And again...

So I am still having my picture problem.

I was in the picasa forums and I was referred to this:

Apparently they have classified my blog as "spammy" so I can't upload pictures until I make a post and use the word verification system to make sure I am not a bot. Frustrating. I suppose this is a problem with new blogs that don't have a lot of content yet. Still, for a new user this is not a good first impression.

WordPress, the open source Blog is starting to look better and better... I just need to find some free hosting somewhere. Also, I wonder if blogger has categories yet... more to come.

Hello nuisance

RateLimitingException. So I decided to upload a picture to go with my blog. To do this I had to download Hello (or other Picasa program might have worked). So I signed up AGAIN, and when I tried to upload a picture I got the above error.

I am attempting to do what it says --> create a post on my blog and solve a "captcha".

Whew, this is ridiculous. I already have Gmail / Google accounts, now I need two more? There some room for improvement here folks. So far setting up a blog and getting it tweaked has taken almost an hour (including the Hello technical snafus). This is still way to complicated for the average user, IMHO. My parents definitely would be lost right now.

Google, if you are listening -- MAKE THIS EASIER. And this is still better than Yahoo 360. I decided to sign up for both and see which I liked better. I've gotta say so far that blogger is far superior. More on this later...

My new blog

I have finally decided to start a public blog.

I have been blogging internally at work (see for several months now. I use it as my online notebook. Very cool, and the integrated search is much easier than flipping pages!

I've been reading the blogs of other bloggers in the search arena and it is time for me to jump in! Speaking of which, I just signed up for Findory recently. I've been reading Geeking With Greg and finally tried it. Very cool!

And with that, here is my first drop in great blogosphere.