Friday, December 10

Seeking Summer Internship Opportunities

I am beginning to explore internship opportunities for the summer of 2011.

I am an applied researcher whose interests are search in specialized domains and building search tools to solve complex information needs. My experience includes search in the engineering domain, medical search, local business objects, food and recipes, and information extraction on book data. My work often involves processing of large datasets using distributed processing frameworks such as Hadoop and PIG.

I am looking for opportunities that fit with my background and preferably include research that could lead to a publishable paper in a major conference.

If you know of any opportunities that would be appropriate, please contact me via email. My CV is available from my website (Word, PDF).

Several of my fellow PhD students here in the CIIR are also seeking internships, so I would be happy to pass along any appropriate opportunities.

Wednesday, December 8

New Book: Mining of Massive Datasets

Anand Rajaraman and Jeffrey D. Ullman have put together a new ebook, Mining of Massive Datasets. The book builds on the course materials for the Stanford CS345 course "Web Mining" and the CS246 class, Mining Massive Data Sets.

From the ToC, the book covers:
  1. An introduction to data mining
  2. Large-scale processing with distributed file systems and MapReduce
  3. Similarity search: nearest neighbor, minhashing, LSH, etc...
  4. Algorithms for mining streaming data
  5. (Web) Graph analysis: Pagerank, HITS, and spam detection
  6. Frequent Itemset algorithms
  7. Clustering Algorithms
  8. Advertising on the web
  9. Recommendation Systems
It is an interesting blend of material that are not usually taught together. I look forward to examining it in more detail.

Tuesday, December 7

Barriers to Entry in Search Getting Lower

The Mim's Bits column in the MIT Tech Review has an article, You, Too Can Be the Next Google. In the article, Tom Annau, the CTO of blekko (see my previous post) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.
"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...
To be more efficient, they are more careful about what they crawl by:
  1. Avoiding crawling spam and splog content
  2. Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.
I'm not sure blekko's "efficiency" techniques are particularly interesting or novel. However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like Bixo).
"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.
The recent Mei and Church paper in 2008, Entropy of search logs: how hard is search? with personalization? with backoff?, analyzed a large search engine log to determine the size of this 'interesting' part of the web. They find that they can encode the URLs from search logs using approximately 22 bits, millions of pages. As they say,
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.
In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced. In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity. In short, there is great opportunity for small very personalized verticals.

I think the main reason that blekko needs a modest number of servers is that its query volume is small. One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world. To pull this off Google keeps its search index in memory (see Jeff Dean WSDM 2009 keynote).