Saturday, October 29

High speed data storage and unsigned Data types in Java

One of the reasons search is so fascinating is because performance really matters. I've always been fascinated by speed and performance, and I had the wonderful opportunity to work in the performance group at the Java Tech. Center in Hursley testing the performance of AspectJ .

What I want to talk about today is the classic trade-off in CS: space and time. Search engines need to be fast, and ironically, to do this they need to be very space efficient.

RAM efficiency and Search
So, you may ask, what does unsigned data types in Java have to do with search? A lot, it turns out. It all goes back to economics and scarcity of resources. The biggest limiting factor in the speed of web search is fast, cheap, storage. And RAM is the most important source of fast storage in the modern computer. The more you can do in RAM (indexing / query serving) the faster the search engine.

In software, and search in particular, data structure size matters because RAM is a very precious, it is a finite resource -- a bit like oil. RAM is digital black gold! The biggest barrier to speed in search engines (and computers in general) is the limited size of RAM and the relative slowness of hard disks. Not much has changed since 2000 when Larry and Sergey wrote their now famous paper The Anatomy of a Search Engine:

Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently... However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness.

In order for a SE to be fast, it needs fast access to data -- the faster the better. The more inverted index that can fit into memory the faster the search engine. If it has to access a part of index not in memory, it must access a hard disk. Accessing a disk is slooww compared to RAM, and I don't think many people really appreciate how slow disks are or how little they have changed in recent years. Moore's Law has made web search feasible with CPU speeds and cheap storage. However, just because storage has gotten cheaper doesn't mean it has gotten faster -- storage speeds do not obey Moore's law. In fact, disk speeds have not improved much in the past 10 years. We are still stuck in the world of 8-10 ms random access times. Hard disks are serial access memory (as opposed to random access memory, RAM) made of mechanical pieces that move -- a magnetic read/write head attached to an arm that moves back and forth over the top pieces of magnetic material (platters) that are spun on a central axis (spindles). In constrast, RAM is very fast. It is has an access time of nanoseconds, not milliseconds. However, we have much less of it and it isn't permanent -- the lights go out when the power is pulled (although there is some research that might be changing this).

On average, a normal high-end workstation can have between 2-8 GB of RAM. The rest of the storage is disk. Google has large clusters of cheap computers serving their search engine not because it is convenient -- but because it is the easiest way to get large amounts of RAM. In a search engine, more cheap computers are better than fewer high-end computers because both can hold roughly the same amount of RAM. The cost per megabyte of RAM is much lower for cheap computers. How this limited resource is used is critical to speed.

Java and Unsigned Data Types
So what does all of this have to do with unsigned data types in Java? As I have said, search engines are very conscious of data structure size. They pull out lots of fancy compression and encoding algorithms to shrink the size of data structures so that they can fit everything into memory. The problem is that all in Java there are no unsigned data types. This means wasted space when search engines usually only utilize positive numbers. Two's complement representations waste have of their bandwidth on negative numbers, which some applications don't require. Other languages, like C /C++ both signed and unsigned data structures. Why is Java different? Why doesn't Java have unsigned data types?

Well, this has been bugging me for awhile, so I decided to do some research. I knew it had something to do with its cross platform compatibility and word sizes. However, I wanted more details. It took some research -- it was not nearly as easy or obvious as I thought it would be and there is still some debate.

I found Java and Unsigned types (Or rather, the lack thereof). Which pointed me to some very interesting articles.

Here is at least on point that James Gosling makes in this very fascinating interview on C/C++/Java.

Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple.

Now, I am no C developer, but would like to think that I know about about unsigned arithmetic. It is in the curriculum of any good CS program -- Computer Architecture 101. If you want a good resource to learn more please see Computer Organization and Design by Hennessey and Patterson.

I understand that they wanted to abstract platform concerns away from the developer and create a cross-platform language. However, they provided no means of extension that developers could utilize to gain platform specific performance -- which is very important in search.
Li Moore, from Google, reiterates this desire for a platform extensible model in his recent interview on Google's migration to Java 1.5. Li would really like to see the ability to have a pluggable interface for platform specific file system operations. Ostensibly, Google wants this so they can fully utilize their Google File System and the I/O features of their custom Linux platform.

Li Moore also mentions that he would really like to see an unsigned byte. A signed byte can only store from -128 to +127. But, what Li wants is to store 0-255. I'm not sure what Li would like to use this for, but I can speculate. What if you wanted to store PageRank scores in the index for efficient searching? Having an unsigned short, 255 buckets, would certainly provide more bandwidth than 127 buckets to distinguish between sites. While it is not terribly difficult to translate negative numbers to positive numbers at query time, it requires a larger data type to be kept in RAM.

In short, negative numbers are less convenient or just don't make sense in some applications, dealing with all positive numbers clearly has advantages. To get around Java not having unsigned data types you could write an unsigned byte wrapper class, but because it would be a class and not a primitive you would use more RAM because you woulld need to assign an Object Reference and you would lose the space savings and performance offered by primitive types.

Java Sign Tips
So, if you've read this far without being bored out of your mind I applaud you!. Here are some interesting tricks. Java DOES have an unsigned data type -- it is a char. In Java, a char is a two byte (16-bit) unsigned number, in other words -- an unsigned short.

Another little useful trick if you are doing a lot of bit arithmetic and want to make your code easier to read is to initialize constants using something like Long.parseLong("0000111...", 2). Much easier than trying to read hex when you need to do efficient bitwsie arithmetic. Comments are an alternative ;-).

Don't forget the difference in Java between signed bit shifting >> and unsigned bit shifting that shifts in 0s >>>.

So why do you care? Being able to save 2 bytes may not sound like a lot, but over millions or billions of objects this can add up to be a significant savings! It is especially critical if the difference is spending money to buy new hardware because you need more RAM.

What is the sucessor to the hard disk for storage? 7200 and 10k RPM drives are not the long-term solution. We need a better answer than spinning pieces of metal. Something without moving parts that is both fast and cheap. Perhaps a type of non-volatile memory like NRAM or MRAM? Optical or Holographic media?

Thoughts anyone? Perhaps SDRAM is good enough for Google, but storage media is something search engines might consider putting more effort into developing and supporting with research dollars. Until then, bring on the unsigned data types!

We've talked about memory and RAM, fast, cheap storage is only one part of the problem Larry and Sergey pointed out in 2000. The other was the robustness of operating systems. Fuel for a future post ;-).

Further reading.
Here is an article from 2000 out of IBM research on the future of media storage that is very interesting.

Holographic Memory on Technology Review.
IBM Almaden Data Storage Research.

Java Data Types