Monday, December 28

Dean and Ghemawat Strike Back on MapReduce

Jeff Dean and Sanjay Ghemawat wrote an article for the January edition of CACM, MapReduce: A Flexible Data Processing Tool. In the article, they refute the findings of A Comparison of Approaches to Large-Scale Data Analysis. On their blog, the authors also wrote a post bashing MapReduce: MapReduce, A major step backwards. The post is no longer available, but thankfully Greg had good coverage.

In the article Dean and Ghemawat address the paper and attempt debunk its claims, although they lack the benchmarks to back it up. In the process, they inform you about the right way to run M/R jobs efficiently:
  1. Avoid starting processes for each new job, reuse workers.
  2. Careful data shuffling, avoid O(M*R) disk seeks
  3. Beware of text storage formats.
  4. Use natural indices like timestamps on files.
  5. Do not merge reducer output.
They present some good M/R lessons in their refutation. You should be using a binary serialization system like Avro or Protocol Buffers and storing your data in a format that provides efficient access, using a natural file structure or using a database system like HBase.

NY Times hits new low: Gives voice to quack on "Search Neutrality"

The NY Times ran a op-ed article, Search, but you may not find. I can't believe they ran such rubbish. I'm not going to bother to debunk it, Paul Kedrosky did a better job than I could.

The problem is that commercial search engines are inherently conflicted: they have products to sell and advertisers to please. The question is: Should search be a public service, like a library?

The French are taking on Google books with Polinum, the "Operating Platform for Digital Books." Jimmy Wales's efforts with Wikia Search failed because they didn't execute and weren't profitable. Daniel, a long advocate of transparency in search now works for Google.

There will always be disgruntled quacks, but in the long-run, is a company or even a small group of companies with such a large share of search healthy?

Sunday, December 27

NY Times Article on Childrens Search

The NY Times had a recent article on search for kids. They covered a study sponsored by Google and performed by Allison Druin at the HCI lab at UMd that conducted a user study with 83 kids to understand how they search. My wife is an elementary school teacher, so this a topic we've often discussed and is particularly interesting.

In recent related work, Druin published, How Children Search the Internet with Keyword Interfaces which was performed on 12 kids. Read section 6 for their suggestions on user interfaces. Here are several of their possibilities: (1) using voice search instead of typing, (2) simplified results pages (3) results that are at an appropriate reading level. The NY Times article appears to describe a larger follow-up study.

The NY Times interviewed Irene Au, Google’s Director of User Experience for ways the research could be incorporated into a product. They note that they keyword mismatch problem is much challenging for kids, who have less of the conceptual framework of a subject necessary to be effective. From the article, “The problems that kids have with search are probably the problems adults experience, just magnified... If we can solve that for children we can solve that for adults." However, I'm not convinced that this is a correct conclusion. Druin says that the bottom of the screen is an area that offers an important area to suggest related searches.

In the article, representatives from Bing and also weigh in; a representative from Y! is notably absent given Y!'s presence in this market. Stefan Weitz, from Bing suggests that visual interfaces offer an opportunity because kids haven't developed typing skills. Scott Kim, from says that kids are more likely than adults to ask questions. Perhaps if we catch them early enough, we can study them before they are brainwashed into keywordize.

Given their lack of typing skill, the article briefly mentions that voice search, like that used for mobile search, offers an interface opportunity for kids.

At the end one of the kids interviewed suggests, “I think there should be a program where Google asks kids questions about what they’re searching for,” he said, “like a Google robot.”

I look forward to reading the paper on the study. Hopefully it will contain the concrete solutions to improve the search experience for kids that they foreshadow in their earlier work.

Sunday, December 13

Hadoop Eclipse Tip: Lib Dependencies

I'm writing a Hadoop job and I ran into a little problem that I wanted to share (and remind myself of the solution for the future).

I am packaging up my Hadoop program into a Jar file. It has external dependencies on text parsers. To include these with my program, one way to do this is to package the dependencies inside the jar in a /lib directory. This ensures the jar and all dependencies get copied to the Hadoop Mappers.

I create my jar file by right-clicking on the project --> export --> Java --> Jar file. I then select my code and the lib directory. However, the problem I had was that my lib directory was not being exported. I learned that this happens if the jars in lib are on your build path. To solve this, the jars need to be "external" or in a different folder. Then you can export the lib directory as a resource.

Anyone care to share a better solution?

Friday, December 11

Google Quick Scroll: A better CTRL-F

Today, Google announced an extension, Quick Scroll, for Chrome 4 Beta that utilizes your previous search while browsing. Consider their example query [belgian waffles served by street vendors?] and browsing a result:
... a small black box appears in the lower right hand corner of the browser with a couple snippets of text from the page that might be relevant to your query.... Quick Scroll analyzes things like proximity, prominence and position of the words to identify the most relevant content.
... who will be the first to reverse engineer the formula from the extension?

Wednesday, December 9

ICWSM 2010 Data Challenge

The ICWSM is a conference on blogs and social media. For the conference, they issued a data challenge.
The dataset, provided by, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
The deadline is March 1st.

Something to look at after the SIGIR deadline....

Monday, December 7

Google Search Evolution Event

Google is hosting a press event at the Computer History Museum offering an "inside look at the evolution of Google search".

Danny Sullivan is live-blogging the event at SELand.

Saturday, December 5

Google Gets More Personal: Extends Personlization to All Users

Google announced in a blog post that it will personalize results for all users, utilizing data from cookies. It will use up to the last 180 days of history about your behavior.

It sounds like they've learned enough from using the histories of signed in users to generalize the technique to a wider audience. They also believe it has enough benefit to push on everyone, despite the fact that it raises more privacy issues. It will make people more aware that Google tracks how they interact with search results over an extended period of time.

Thursday, December 3

BixoLabs Public Terabyte Webcrawl

Ken Krugler and the team at BixoLabs are doing some things worth noting. They created Bixo, an open-source Java web crawler built on top of the Cascading MapReduce framework. I think it's one of the best open-source options for large-scale web crawling. Going further, they have their own cluster that they use for specialized client crawls.

You should read their blog for some of the recent talks Ken gave on web mining.

Particularly interesting is that they are starting to work on a Public Terabyte Project web crawl. The code and the crawl will be available for free via Amazon's dataset hosting.

I look forward to taking a look at it more. It's an important resource, because ClueWeb09 is only available to researchers who pay for it.

Microsoft EntityCube is live

Microsoft demoed EntityCube at TechFest, see my previous post. The system is now live.

See the top-ranked papers by domain; information retrieval.

It provides an interesting list of the most influential people in a field by citations.

Thursday, November 19

TREC 2009 This week

The annual TREC meeting is this week in Maryland. The proceedings won't be available until February, but you can get hints about what is happened (but no eval results) by following on Twitter, #trec09. Some highlights:
Keep up the news Ian and Iadh!

Evaluating LDA Clustering Output

Yesterday, I mentioned that Mahout has an implementation of LDA, a form of clustering.

Today, there is a post on the LingPipe blog covering a recent paper, Reading Tea Leaves: How Humans Interpret Topic Models. Read the post for an overview of what the authors found when they used Mechanical Turk to evaluate the coherence of topic-document and topic-word clusters.

Microsoft Pivot: A vizualization tool for faceted exploration

Yesterday Microsoft Live Labs launched Pivot. Pivot is a desktop application for faceted navigation and visualization to explore collections of information. Watch the YouTube demo.

I don't have invitation for the tech preview, so you'll have to watch the demo for more details.

Wednesday, November 18

Apache Mahout 0.2: MapReduce LDA, Random Forests, and Frequent ItemSet Miner

Grant announced on the Lucid Imagination blog that Mahout 0.2 is released. Mahout is a library of scalable (distributed) machine learning algorithms using MapReduce.

Mahout 0.2 has several key new features that are worth taking a look at:
The release also has many other bug fixes and improvements. Keep up the good work guys!

Monday, November 16

New Yahoo! Research Demo: Quest for NLP based Q&A exploration

Hugo Zaragoza let me know that Yahoo! research has a new demo out, Quest. Quest is a faceted navigation interface on Q&A data. It lets you browse using key phrases, nouns, and verbs extracted from a dependency parse of the questions.

For their description, you can read the announcement on the Y! Sandbox. The demo uses a set of 8 million Q&A documents from Yahoo! Answers collected in 2007. Here's their description of some of the challenges they faced:
The first one is to select the right "lexical units" of the collection in order to produce meaningful browsing suggestions. The next challenge is to develop interesting list suggestions, on the fly, for whatever query the user may submit. Lastly, we had to invent an interface that would allow users to interact with the suggestions and the results, and enable a natural browsing experience.
They used the DeSR dependency parser to extract terms and phrases and then use a forward index with Archive4J to count and sort the terms in the questions that are returned by a query.

I tried it for pasta and then filtered to "pasta salad" I was hoping that some of the nouns would include common ingredients: bacon, chicken, olives, onion, pepperoni, mozzarella cheese, etc... However, most of the nouns/verbs are more general and somewhat redundant given my selected filters. I think the algorithm to select the terms could still be improved.

Faceted search interfaces are important browsing tools, and automatically extracting and selecting facets is a challenging problem. It's good to see first steps applying NLP to the task. I look forward to seeing how Quest evolves.

Be sure to check out the Correlator demo if you haven't seen it.

Thursday, November 12

Machine Learning Talk: Lee Spector on Genetic Programming; applications to Learning Ranking Functions

Today at the Yahoo! sponsored machine learning lunch, Lee Spector presented his work on genetic programming. His talk, Expressive Languages For Evolved Programs highlighted his work using the Push programming language for solving interesting and hard real-world problems.

He pointed to two key principles that these systems need to have to learn solutions, based on observations from biology:
  • Meaningful variation - Variations can't just be random, the mutations and selections have to produce meaningful effects in the domain.

  • Heritability - children need the ability to inherit desirable features from the parent without being clones.
During the talk, a really obvious application would be to use GP to learn IR ranking functions. Recently, Ronan Cummins, did some work in this area. Ronan's recent paper at SIGIR 2009 applied it to learning proximity functions, Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval.

I think there's still interesting work combining GP with IR. For example, one problem is that collections and users evolve over time, but most ranking functions are static.

Monday, November 2

New York Times Releases Subject Headings Data

Evan Sandhaus announced in a NYTimes Open blog post that they are opening up the NYT subject headings. Today they are announcing the release of the first batch of 5,000 headings.
Over the last several months we have manually mapped more than 5,000 person name subject headings onto Freebase and DBPedia. And today we are pleased to announce the launch of and the release of these 5,000 person name subject headings as Linked Open Data.
Over the next few months they plan to expand this to over 30,000 tags.

Also, check out the NYT Article Search API.

Browse the headings and get hacking!

CIKM 2009: Tutorials and Workshops

Today there were workshops and tutorials at CIKM.

Social Web Search and Mining (SWSM2009)
Web Information and Data Management (WIDM)
Cloud Data Management (CloudDB 2009)

There were also four tutorials.

I'm particularly disappointed to miss Marius Pasca's tutorial on for the acquisition of Open-Domain Concepts and Conceptual Hierarchies.

There's little coverage of the conference so far, but I'll try to link to what I find.

Wednesday, October 28

CMU Read the Web Project on M45

Jon points out a post on the Y! Developer Network Blog detailing their use of the M45 cluster for Information Extraction.

The post is by Andy Carlson and Justin Betteridge. They are PhD students working on the Read the Web project. The goal is to generate a knowledge from web documents.

They ran MapReduce jobs over a large web crawl to find:
  1. Given a list of patterns, what noun phrases fill in the blanks of those patterns?
  2. Given a list of noun phrases, what patterns do those noun phrases occur with?
  3. Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)?
They are currently scaling their techniques up to ClueWeb09 and using features from a dependency parse obtained from the Malt parser.

See their upcoming paper at WSDM 2010, Coupling Semi-Supervised Learning for information extraction.

You can also see the Read the Web course wiki page.

My group here at the CIIR uses M45 for large-scale extraction and organization work on the Million Book Project data. More on that work as it develops.

Friday, October 23

Conferences Coverage: RecSys09 and HCIR09

I'm not attending either, but trying to follow what's going on.

The 2009 conference on recommendation systems in NY is happening this weekend. Follow the conference on Twitter, #recsys09. I'm particularly looking for coverage on the Netflix Challenge panel: What did we learn from the Netflix Prize? Perspectives from some of the leading contestants.

The HCIR Workshop is also taking place in DC. Daniel is one of the chairs. You can also see other coverage on #hcir09. The proceedings for the workshop are available. Henry is attending and taking part in a panel, so hopefully I'll be able to share some of his highlights.

Tuesday, October 20

Why I Don't Want Your Search Log Data

The IR field is largely driven by empirical experiments to validate theory. Today, one of the biggest perceived problems is that academia does not have access to the query and click log data collected by large web search engines. While this data is critical for improving a search service and useful for other interesting experiments, ultimately I believe it would lead to researchers being distracted by the wrong problems.

Data is limiting. Once you have it, you immediately start analyzing it and developing methods to improve relevance. For example, identifying navigational queries using click entropy. You can also apply supervised machine learning to rank documents and weight features. These are important and practical things to do if you run a service, but they aren't the fundamental problems that require research.

The IR community has it's own data: TREC. TREC deserves credit for driving significant improvements in IR technology in the early and mid 90s. However, it too can be limiting. For many in academia, success and failure is measured by TREC retrieval performance. Too often, a researcher struggles with superhuman effort to get incremental improvements on well-studied corpora that won't make a significant long-term contribution to the field. What's missing are the big leaps: disruptive innovation.

Academia should be building solutions for tomorrow's data, not yesterday's.

What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It's not an easy question to answer, but you can watch Bruce Croft's CIKM keynote for some ideas. Without going into too much detail, also consider trends like cross-language retrieval, structured data, and search from mobile phones.

One proven pattern is that breakthroughs often come from synthesizing a model from a radically different domain. One recent intriguing direction is Keith van Rijsbergen's work on The Geometry of Information Retrieval applying models of quantum mechanics to describe document retrieval. Similarly, are there potential for models of information derived from molecular genetics and other fields? If you're a molecular geneticist and are interested in collaborating, e-mail me!

I still believe in empirical research. However, I'm also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

That said, if you want me to study your query logs... I'd be happy to do it. After all, I need those publications to graduate.

Am I wrong? I'm interested to hear your thoughts, tell me in the comments.

Thursday, October 15

Elements of Statistical Learning 2nd edition and Other Books

The second edition of the classic, Elements of Statistical Learning is available. The book covers topics such as:
  • Supervised and Unsupervised Learning
  • Regression
  • Linear Classification (including LDA)
  • Kernel Methods
  • Evaluation and Assessment (including the right and wrong way to do cross-validation)
  • Bayesian inference
  • Decision Trees and boosting methods
  • Neural Nets
  • SVMs
  • K-Means clustering and nearest neighbor classification
  • Random Forests
  • Ensemble learning (shown to be very effective in the Netflix competition)
  • Undirected Graphical Models (including RBMs)
The PDF is available for download, so you can read/search it before you buy it.

While you're looking at books, another book to check out is Probabilistic Graphical Models.

Thursday, October 8

Eric Schmidt: Disruptive Innovations In Search Not Possible

Last week BusinessWeek had a series of articles interviewing Google Search Quality leaders. In the interview with CEO Eric Schmidt, there was a question about innovation:
The days when you can come in with some new idea and change everything are gone. It's a much more sophisticated set of problems than can be done with a small team coming up with a new development.
Instead, he says that disruptive ideas will focus on a smaller part of the system, e.g. a new important ranking feature that will be assimilated into the massive behometh of a system. One example of this that comes to mind is Sep Kamvar's work on personalized Pagerank at Kaltix that Google acquired in 2003 and has now integrated.

As Eric also mentions in the interview, a key obstacle to web search innovators is scale. In economics terms the market has a large barrier to entry.

Despite the barriers, I think he is wrong. There are still disruptive innovations left in search.

Wednesday, October 7

HCIR 2009: Proceedings Highlights

Daniel points out that the HCIR 2009 workshop proceedings are available. Here are a few highlights:
  • Modeling Searcher Frustration by Henry Feild
    Henry is a labmate who recently conducted an interesting study. He conducted a user study to analyze the affective mental state of the user during search tasks in order to detect 'frustration'. The goal is then to try and predict when a user is frustrated based on observable query log data. He has some interesting results:

    1) Users who get frustrated tend to stay frustrated
    2) Frustration tends to increase with the number of queries submitted
    3) Certain users are more predisposed to being frustrated than others
    4) Frustration levels depend on the type of task

  • Using Twitter to Assess Information Needs: Early Results by Max Wilson
    They analyze 189,000 tweets collected 100 results for 10 search queries hourly over a two week period. Their goal was to understand the kinds of things people are looking for.
  • I Come Not to Bury Cranfield, but to Praise It by Ellen Voorhees
    She argues that the very simplified (impoverished) role of the user in Cranfield is necessary in order to run highly controlled experiments. A key challenge is the cost of judging results. She says,
    Modifications as small as moving from MAP to a more user-focused measure like precision at ten documents retrieved require larger topic sets for a similar level of confidence. More radical departures will require even larger topic sets.
  • Freebase Cubed: Text-based Collection Queries for Large, Richly Interconnected Data Sets by David Huynh, creator of Parallax.
    David explores some of the challenges presenting faceted interfaces across large, heterogenous domain models. He writes,
    Any large data set such as Freebase that contains a large number of types and properties accumulated over actual use rather than fixed at design time poses challenges to designing easy-to-use faceted browsers. This is because the faceted browser cannot be tuned with domain knowledge at design time, but must operate in a generic manner, and thus become unwieldy.
  • Usefulness as the Criterion for Evaluation of Interactive Information Retrieval by Michael Cole, et al. from Belkin's group at Rutgers.

    The paper argues that pure relevance based measures fail to measure whether or not a system helped a user accomplish their task. They propose a method to measure 'usefulness'.
    ... usefulness judgment can be explicitly related to the perceived contribution of the judged object or process to progress towards satisfying the leading goal or a goal on the way. In contrast to relevance, a judgment of usefulness can be made of a result or a process, rather than only to the content of an information object. It also applies to all scales of an interaction.
  • Towards Timed Predictions of Human Performance for Interactive Information Retrieval Evaluation by Mark Smucker

    He advocates an extension of the Cranfield paradigm that measures the user's ability to find relevant documents within a timed environment. The overall goal is to develop of a model of user behavior in order to inform decisions about what UI and search features provide the most opportunity for improvement. They use GOMS to estimate the time for users to complete a task given an interface. He writes,
    The acronym GOMS stands for Goals, Operators, Methods, and Selections. In simple terms, GOMS is about finding the sequence of operations on a user interface that allows the user to achieve the user’s goal in the shortest amount of time.
That's all for now, although there is a lot more interesting work in the proceedings!

Tuesday, October 6

Stalk My Semester: Information Retrieval and Stats

You may have noticed my post frequency decreasing. It's inversely proportional to the amount of homework. This semester I am taking three classes, all of which relate to my research interests (for a nice change).

CS646: Information retrieval
The graduate IR class with James. The slides from the class are available for you to follow along. For texts we're using:
  1. B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice. Addison Wesley, February 2009. [amazon]

  2. C. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, 2008. [cup]. The authors also make it available online.
This is the first time there have ever been good comprehensive texts for IR. I recommend you pick them up if it's your area of interest.

STAT 607: Mathematical Statistics I
This is an introductory class on statistical theory taught by Michael Lavine. We're learning R for data analysis. The textbook, Introduction to Statistical Thought is available for free download. It has lots of good R examples.

CS791: Information Retrieval Seminar on User Modeling
Bruce is leading a seminar on User Modeling for IR. Last week we focused on query term weighting, led by Michael. This week we'll cover Query Reformulation techniques. There is no website or text for this course, but I'll try to provide some links to relevant papers and presentations as we cover material.

Thursday, October 1

Yahoo! Hadoop Distribution Update in Time for Hadoop World

To coincide with Cloudera's Hadoop World, today Y! announced an updated release of the distribution that runs their clusters. The new version is based on Hadoop 0.20.1.

Watch #HadoopWorld on Twitter for more updates on HadoopWorld tomorrow..

Wednesday, September 30

Cloudera CDH2 New Testing Release

Cloudera today announced a new version of the Cloudera CDH2 testing release.

The new version is based on Hadoop 0.20.1 and has compatible versions of PIG, Hive, and HBase 0.20. This is a big deal because these were previously unavailable to early adopters.

They announced it just in time for Hadoop World in NYC, which starts tomorrow. I won't be attending due to classes and other scheduling conflicts, but please go and let me know what's going on.

Tuesday, September 22

Transparent Text 2009 Coverage

IBM Research Cambridge hosted the Transparent Text Symposium yesterday and today. Judging by the conversation it's sparking, I missed a great event. The sheer volume of interesting presentations looks astounding.

Start by checking twitter, #tt09. Daniel has been carpet-tweeting the entire event! Check out his coverage of Day 1 and Day 2.

Be sure to read Ethan Zuckerman posts on what Matthew Gray is doing with the Google Books data and Beth Noveck's open government keynote.

If you haven't checked out the Guardian's datablog, do it! Tons of captivating and informative information vizualization. Their talk at TT highlighted the MP Spending crowd sourcing project.

I look forward to catching up with the videos once they are posted!

Updated Yahoo! Search Page Framework: Faster, more consistent, and easier to change

Today Yahoo! launched a new search page framework.

The new framework integrates SearchMonkey, SearchPad, and SearchAsisst. The team streamlined the layout to make it more faster and more modular. As they write,
Now, here’s the best part: Rather than building this new experience on top of our existing front-end technology, our talented engineering and design teams rebuilt much of the foundational markup/CSS/JavaScript for the SRP design and core functionality completely from scratch. This allowed us to get rid of old cruft and take advantage of quite a few new techniques and best practices, reducing core page weight and render complexity in the process.
I like the SearchPad integration. However, I find the three-column Y! search result page layout cramped compared to Google and Bing. It feels cluttered and less readable. I have a big monitor and the restricted layout doesn't utilize the space well.

Interesting stuff, but now let's see what the Y! team can do with a new revamped code base.

Monday, September 21

The Winning $1M Netflix Prize Methods Published

In case you've been living under a hole, today Netflix formally awarded the $1M dollar prize to BellKor. See the NYTimes article. Wired has a picture of the team, which met for the first time to receive the award.

The top finishing teams recently published papers outlining their strategies via the Netflix Prize Forum message.
I look forward to hearing more about the second iteration of the contest! From the NY Times article it will be a different:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies.

Thursday, September 17

Hadoop 0.20.1 released

Hadoop 0.20.1 is finally here. Get it while it's hot! If you want, you can read the full release notes. This is the release to use if you are setting up a new cluster. It's also worth upgrading older pre 0.20.x clusters to this release.

Hadoop 0.20.x is very different from previous releases. The configuration and APIs have been overhauled. As previously mentioned, there is the new TFile storage format.

Look for an imminent release of PIG 0.4 release and Cloudera distribution CDH2 0.20.x with Hive and PIG support.

Wednesday, September 16

Yahoo! Key Scientific Challenges Coverage III: Web Information Management

Today continues the series (part I: search, part II: machine learning) of Henry's notes from the Yahoo! Key Scientific Challenges summit. Today we are covering Brian Cooper's talk on challenges in "Web Information Management" which deals with structured data, unstructured data, and making structure out of unstructure.

Information extraction
  • Goal: from unstructured -> structured

  • How?
    - site-specific
    - format-specific
    - domain/category-centric

  • Goal: system and process for fast and easy building of domain-centric operations.
List extraction versus entity extraction
  • exploits structured regularities/proxies to nested concepts
    - lists, records, attributes
  • create business directories for store locations
  • pulling useful tidbits of info from around the web, dereferencing them, and then presenting them to the user
  • scalability is important
    - get rid of some complex features
  • speed
  • adaptive allocation for reduced server load
  • tagging
    - relying on these is messy
  • photo albums online allow for quick searching
  • image labeling
    - could use ESP, but relies on users playing the game
    - OR let people tag as normal and then offline...
  • detect similar photos
  • overlap tags
  • collaborative tagging
(See also PSOX and information extraction using community efforts).

Friday, September 11

Yahoo! Key Scientific Challenges Summit: Machine Learning

Yesterday I posted the notes from Andrew Tomkins presentation on challenges in search. Today are more of Henry's notes from Machine Learning presented by Sathiya Keerthi Selvaraj, Senior Research Scientist. It covers the use of ML at Y! and some current challenges.

Application view
  • understanding user behavior
  • choosing best content for presentation
  • serving right ads
  • extracting/semantic tagging of content
  • dealing with spam
    - rich data makes solutions for these possible
ML problems view
  • standard ML problems
    - regression/classification/clustering/feature selection/etc.

  • statistics

  • scale
    - dealing with large data sets
    - discovering faster algorithms
    - fast surround (?)

  • structure/signal
    - adversarial learning
    - budget on real-time
    - preserving privacy
    - multi-task and transfer learning
    - graph transduction w/ many types of info
    - injecting knowledge into models (non-traditional training data)
  • experimental design/quality metrics
  • estimating CTR
  • rare events/anomaly detection
  • forecasting (page views for displaying advertising)
    Ex: content optimization (COKE)
    - matching content to user intent
    - maximize "long-term utility" (satisfaction)
  • online tracking of content affinity
    - multi-armed bandits and time series analysis
  • SVD for user modeling
Clustering documents
  • one document, many topics
  • using graphical model representation
  • speed up algorithms
    - parallel implementation via pipeline
    (fastest LDA code)
  • uses many tricks
  • 1000 iterations (near convergence) of 1M docs in a few hours
Vowpal Wabbit
  • online learning (linear regression)
  • optimized o get fastest speed up of algorithms
    - open source
    - available on github
  • can use hashing techniques
    - allows for very large feature space
  • modularized
    - can swap out linear regression for other ML models
  • use Yahoo! accounts
    - spammers pay people to solve captchas
  • very lucrative
  • >80% of email is spam
  • classifiers have to be quick
  • users hate good mail being classified as spam (FPR)
  • must protect privacy
Search Ranking (MLR)
  • features
    - queries only
    - documents only
    - queries AND documents
  • approaches
    - pointwise
    - pairwise
    - listwise
  • directly optimize a metric of interest
  • using click data for auto labeling
  • transfer learning
  • diversity
  • cascaded learning
Bid Generation
  • ML techniques suggest what bidders should bid fories they hadn't though of using queries they hadn't though of using
Domain-centric IE (PSOX)
  • wrappers
    - info extraction algorithms for pages with same/similar format
    - requires supervision
    - not scalable

  • web tables
    - looks for clean HTML tables
    - not scalable
    - needs some supervision

  • NLP
    - uses language signals
    - hard

  • domain-centric extraction
    - located somewhere between the above methods
    - look at one domain at a time
    e.g. blogs
    - what's the title? post time? etc.

  • schema
  • domain knowledge (weak labeling signals)
  • local presentation consistencies => accurate extraction
  • complex graphical models
  • domain-centric approach to deep web

A more interactive Bing 2.0 coming next week?

Yesterday news broke on Twitter about the demo of Bing 2.0 at the Microsoft annual meeting. Mary Jay Foley on ZDnet has coverage of the story.

The tweets say it's slotted for this fall and could be released as early as next week.

The exact details are sketchy, but one feature that came out was the use of interactive maps using Silverlight integrated into the results. Other "vizualization" features we also alluded to. For example, one tweet:
Bing 2.0, out this month, has some exciting new features. Imagine seeing maps plus pics from the neighborhood of a restaurant to try.
We'll have to wait and see for ourselves...

Thursday, September 10

Yahoo! Key Scientific Challenges Coverage I: Challenges in Search

Yahoo! put out a press release on the Key Scientific Challenges Summit. My labmate, Henry, attended the summit, presenting work on detecting searcher frustration. He took some great notes to share. First up are the notes from Andrew Tomkins, Chief Scientist, Yahoo! Search talking about key challenges in search.

Three key challenges
  1. Optimizing task-aware relevance (model long-running user tasks)
  2. Grid-based content analysis (new computing algorithms)
  3. Measure/predict/generate engagement
Many non-issues:
  1. Anything involving PageRank
  2. Algorithms for supercomputers
  3. Folksonomies and tag analysis (at least not yet)
  4. Friend of a friend
  5. Unsupervised user-facing techniques (showing result clusters)

Challenges (details)
1. Task-aware
o Dawn of search:
- navigation and packets of information

o Today:
- increasing migration of content online
- new forms of media available online
- infrastructure for payment more comfortable for users

o Moving away from 2.7 words and 10 blue links
- more structured results
- more satisfaction without clicking
- more interaction with web services
- much richer page structure

o The resources people search is changing
- search engines may or may not be the hub

2. Storage trends
o Storage is cheap: any company with tens of employees can store all
text produced by all humans on the planet
- multimedia is another story

o Move away from scale to deep understanding

o Richer models about what's on a page
- page semantics
- user consumption patterns
- aggregate properties
- how do we search it??

3. The problem is bigger than search -- Understanding the user
o why do people lurk versus participate?

o why do people create new personas?

o why are Facebook/YouTube/etc. so successful?

o what new genres are emerging?
- for content creation?
- participation?
- what tools are appropriate?

o haven't really gotten started
- many proxy measures based on views/clicks/etc.

o too low level

o some contributions
- click prediction
- dynamics of social network analysis
- models of viral marketing

o predictions of engagement still "embryonic"
- generation of engagement remains an art form

o need new science of engagement
- this is not a substitute for creativity
- scientific basis

Stay tuned for coverage of the machine learning challenges...

Wednesday, September 9

Getting started with Mahout on DeveloperWorks

Grant Ingersoll has an article, Introducing Apache Mahout, published on IBM Developer works.

The article gives an introduction to different ML tasks and Mahouts implementations. Mahout current has the Taste recommendation system developed by Sean Owen. Clustering implementations including k-Means, fuzzy k-Means, Canopy, Dirichlet, and Mean-Shift. A Naive Bayes text classifier.

Grant covers the basics of getting these working in the article.

At the end he comments on what's next for Mahout:
On the immediate horizon are Map-Reduce implementations of random decision forests for classification, association rules, Latent Dirichlet Allocation for identifying topics in documents, and more categorization options using HBase and other backing storage options...

CIKM 2009 Papers

The CIKM 2009 accepted papers list is available. Some of the papers are starting to appear online.

The CIIR has a several papers this year:

Jinyoung posted a list of papers he thinks look interesting.

Thursday, September 3

Mahout gaining adoption: Mippin

Mippin is a mobile portal service. Sean Owen has a post on the Mippin blog advertising their use of Mahout to build a content recommendation system.

via Grant.

New IR blog: LiFiDeA

My labmate, Jinyoung started blogging at LiFiDeA in English (he also has a Korean blog) to cover his work at the intersection of Personal Information Management (PIM) and Information Retrieval. He's done some good work on personal document retrieval and semi-structured retrieval. Check out his blog for details!

SpringBoard: Startup Incubator

A little too late, but I can dream.... sign me up for SpringBoard! Ah well, that whole PhD research thing gets in the way sometimes ;-).

Wednesday, September 2

NetBase launches healthBase: Another flawed "semantic" search engine

Today Netbase announced a new semantic health search engine: HealthBase. There is coverage on SELand. From their release,
healthBase is the first example of Content Intelligence that is open and available to the public. The showcase uses Content Intelligence technology to automatically find treatments for any health condition or disease; pros and cons of any treatment, medication and food, and more. Like all NetBase-powered applications, healthBase enables users to get summarized answers and insights automatically from millions of online sources.
Here are my first reactions using the engine. I first tried running related injuries that I've experienced and researched previously. First up is [pulled hamstring] which returns only one poor result. The query for [shin splints] returned better, mostly good results.

I also tried a few queries that are popular currently:
The [swine flu] results aren't bad, with results on the vaccine and antiviral medications. Next, I tried a "pros and cons of treament" query for [acai berry], a controversial natural health supplement. The acai berry results are disappointing with little reliable information, for example one "pro" result is "Offer Free Trial"!! from The engine also showed other flaws in its understanding technology by misclassifying "caused human cancer cells to self-destruct" as a con. Sigh.

A more structured presentation of results is a step in the right direction. There are few details on how or why their semantic technologies work (or don't as the case sometimes appears).

Overall healthBase is not a compelling offering right now.

Saturday, August 29

Hadoop: Major Platform Upgrades Coming Soon

The Hadoop world is undergoing rapid evolution. Tom White has a presentation called Hadoop Futures available on slideshare that outlines some of the next major directions.

There are some important changes to keep your eye on. In the next month we will see major releases that will change the Hadoop landscape.

First up is the Hadoop 0.20.1 release. It is a major Hadoop release. It will (likely) be used as the basis for the next Y! and Cloudera distributions. Hadoop 0.20 was released in June, but hasn't been widely adopted until some of the bugs were worked out. Hadoop 0.20.x has a new API that will be used going forward. The upcoming point release has a lot of fixes and features, including the new TFile format. The new version is critical because it opens up the way for releases from the sub-projects.

The 0.20.1 release paves the way for the PIG 0.50 release. PIG 0.40 and 0.50 will have significant performance and other improvements that have been developed over the past months. One key change is that it will likely include PIG SQL support that is now in the trunk.

The release of HBase 0.20 is getting very close. There are great presentations on the new releases given at the recent HBase User Group Meeting at StumbleUpon. Again, one of the key new features is a new HFile format based on the TFile that will be in the 0.20.1 release.

In the very near future we will also see a bug fix release from the Avro serialization system, Avro 1.0.1.

In short, by the middle to end of September we should see the adoption of a new and radically improved Hadoop platform. We can ditch the aging 0.18.x platform. We will finally be able to use the new scheduling systems, simplified API, and take advantage of significant performance and reliability improvements.

Tuesday, August 25

Quick News: Google Traffic update, Yahoo! Search UI changes, and more

  • Google is using your Google Location with My Location enabled to monitor traffic data. It is now integrating this data into Google Maps. This means expanded coverage beyond main roads for traffic conditions.
    Imagine if you knew the exact traffic speed on every road in the city — every intersection, backstreet and freeway on-ramp — and how that would affect the way you drive, help the environment and impact the way our government makes road planning decisions...
  • Yahoo! is testing new search UI changes.

  • Matt Cutts has his WhiteHat SEO Tips for Bloggers. It's a useful presentation because Matt shares what Wordpress plugins he uses and what tweaks he makes to make his blog more search engine friendly.

  • In case you missed it, the MSN toolbar is now powered by Bing.

Wednesday, August 19

Technology Review Search Innovators Under 35

I'm back from vacation and catching up on news and work.

While I was away the he MIT Tech Review magazine released its TR35 2009: Young Innovators Under 35. Making the list were several people working on search:

Jaime Teevan won the award for her work on personalized search. We recently saw a lot of her work highlighted when her collaborator Susan Dumais gave the Salton Award keynote at SIGIR (I recently organized and updated the notes).

Vik Singh won the award for his work on the Yahoo! BOSS API. He "opened up search secrets to spur innovation". Despite the TR hyperbole, BOSS has become a solid example of open search APIs.

Unrelated to search, Kevin Fu, a professor here at UMAss won the Innovator of the Year award. Congratulations to Kevin (and Ben and his other students!).

My congratulations to the winners!

Wednesday, August 12

The Google File System Evolved: Real-Time User Applications

The ACM has an interview with Sean Quinlan on the evolution of the Google File System.

They talk about the issues they dealt with as GFS has evolved with an emphasis on the move to a distributed master design.
Our distributed master system that will provide for 1-MB files is essentially a whole new design. That way, we can aim for something on the order of 100 million files per master. You can also have hundreds of masters.
Towards the end, Sean discusses how GFS is evolving beyond its batch design to meet the needs of user-facing and latency sensitive applications often using BigTable to store structured data:
... engineers at Google have been working for much of the past two years on a new distributed master system designed to take full advantage of BigTable to attack some of those problems that have proved particularly difficult for GFS
I'm sure the Hadoop and HBase teams will find it interesting reading. I haven't had a chance to read the entire interview in detail because I'm leaving for a week long vacation on Cape Cod. Don't expect many updates from the beach!

Tuesday, August 11

Hadoop Founder Doug Cutting Leaving Yahoo! for Cloudera

Doug Cutting, creator of the Hadoop project, is jumping ship at Yahoo! and joining Cloudera, a Hadoop centric startup offering enterprise support and services.

Doug has a post on his blog describing the move. According to the NY Times interview, the decision is unrelated to the Microsoft takeover of Y! search. Doug reports in the interview,
"This has been in the works for awhile and is unrelated," Mr. Cutting said. "I am definitely not leaving in any sort of protest, and the thing I like least about this move is that it might be perceived that way."
Congratulations to Doug on the new position and to Cloudera for the big win. Having project leaders outside Yahoo! is important for the ecosystem. As Doug works on projects for other clients, it will mean that the future of Hadoop will be driven by the needs of the greater community rather than the internal needs of Yahoo!.

Facebook Makes Updates, Photos, Links, and Videos Searchable

Facebook is rolling out new search functionality to better compete with Twitter search.

Akhil, the Engineering Directory, has a post describing the news types of content that Facebook is making searchable.
You now will be able to search the last 30 days of your News Feed for status updates, photos, links, videos and notes being shared by your friends and the Facebook Pages of which you're a fan.

Google Unveils New 'Caffeine' Search Infrastructure Update

Caffeine is a top secret project to re-rewrite of Google's indexing system. It's finally being released. According to this interview with Matt, infrastructure-wise, this compares with the BigDaddy update in 2006. There have been major changed under the hood to make indexing more flexible, faster, and more robust. According to the Google post:
For the last several months, a large team of Googlers has been working on a secret project: a next-generation architecture for Google's web search.
You can try an index served on the new archicture in the sandbox they setup to let people try it out. Notice anything different?

Matt Cutts has a post on his blog. The infrastructure team have been working hard,
...a few weeks ago, I joked that the half-life of code at Google is about six months. That means that you can write some code and when you circle back around in six months, about half of that code has been replaced with better abstractions or cleaner infrastructure...
Congratulations to the infrastructure team: I didn't notice a significant difference in the results. I expect this will help Google to significantly increase the size and freshness of their index.

You may remember Cuil. Despite getting knocked pretty hard, Cuil was not about next-generation ranking, it was about infrastructure. Read my post for details. It's not clear, but perhaps the Caffeine update tackles some of the issues that Anna Patterson, former Google infrastructure architect, recounted in a Cuil interview,
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exercise.

Has Google tackled these architecture issues with 'Caffeine'? We may never know.

Monday, August 10

Hadoop Summit Video Roundup

Yahoo! has posted several new videos from the Hadoop summit held in June. Here's a roundup with links to the videos posted so far:
  • State of Hadoop
    Owen O'Malley, Eric Baldeschwieler, and Yahoo!'s Hadoop team talk about their work with Hadoop over the last year, including core capabilities and related sub-projects, deployment experiences, and future directions.

  • HBase Goes RealTime
    HBase is a storage system that's built on top of HDFS. The guiding philosophy of their release: to unjava-fy everything. Some of the major changes: new key format, new file format (HFile), new query API, new result API and optimized serialization, new scanner abstractions, and new concurrent LRU block cache.

  • Hive
    In this talk, Namit Jain and Zheng Shao discuss how and why Facebook uses Hive. They present Hive's progress and roadmap and describe how the open source community can contribute to the evolution of March 2008 the service was generating about 1TB per day in March 2008; in mid-2009, data production had increased to 10TB per day.

  • Hadoop Futures Panel
    Yahoo!'s Sanjay Radia discusses backwards compatibility and the future of HDFS; Owen O'Malley covers MapReduce and security futures; Doug Cutting, the father of Hadoop, talks about Avro, a serialization system; Cloudera's Tom White discusses tools and usability; Facebook's Joydeep Sen Sama talks about Hive; and Yahoo!'s Alan Gates looks at Pig, SQL, and metadata.

  • Scaling Hadoop for multi-core and highly threaded Systems
    Here they present the basic architecture of CMT (chip multi-threading) processors, designed by Sun for maximum throughput, and then describe the work the team did using Hadoop and other virtualization technologies to help scale CMT.

  • Running Hadoop in the Cloud by Tom White
    He opens with a discussion of the Berkeley RAD Lab paper on cloud computing and walks us through a set of definitions to a discussion of the public cloud. He sees a realm of interesting possibilities: an apparently infinite resource; the elimination of user commitment; and the pay-as you go model, which enables elasticity. Tom describes the implementation of Hadoop in this landscape.

  • Amazon Elastic MapReduce
    Amazon Web Services (AWS) evangelist Jinesh Varia presents Amazon's Elastic MapReduce, a web service that simplifies the complexity of large-scale data processing operations for a growing ecosystem of AWS users.

  • The Growing Hadoop Community
    Cloudera co-founder Christophe Bisciglia takes a detailed look at the growth and evolution of Hadoop technology and community over the past year.

Friday, August 7

Yahoo! is Becoming a Newspaper

The NY Times has an interview with Carol Bartz. One highlight is that she claims that Y! was never a search engine:
“We have never been a search company,” she said. “It is: ‘I am on Yahoo. I am going to do a search.’ ”
Danny Sullivan takes issue with her interpretation of history.

As for the future, it sounds like Y! will become an online news and media organization:
“My fortunes are tied to my pages,” Ms. Bartz said... Its Sports section, for example, has reporters producing top-notch original material ranging from scoopy news items and blogs to long-form analysis pieces.
Y! has had a long internal battle over whether it is a media company or a software company. By divesting Y! of it's biggest software systems, Ms. Bartz seems to be making the direction clear. Y!'s future will be in media, not software technology. In short, it means getting rid of developers and servers farms and hiring reporters in their place. The direction is ironic because it's a throwback to Y!'s past as a manually created directory which it abandoned in favor of technology driven search algorithms.

Considering the recent fortunes of online media companies and news organizations it seems like the wrong direction. But, maybe she can pick up all those unemployed journalists on the cheap and run a better organization. Better yet, maybe she can get users involved in creating the content for her with more tools like Y! Answers.

These are sad words to hear and I believe the wrong choice for the future of the company. However, as a developer and researcher, I'm only slightly biased. Just a little bit.

Netflix Prize Sequel: A Sprint instead of a marathon

Neil Hunt, Chief Product Officer at Netflix announced in the forum that they will launching a sequel to the very their successful recommender competition.
The advances spurred by the Netflix Prize have so impressed us that we’re planning Netflix Prize 2, a new big money contest with some new twists.

Here’s one: three years was a long time to compete in Prize 1, so the next contest will be a shorter time limited race, with grand prizes for the best results at 6 and 18 months.
Stay tuned for more news in September after they formally announce the winner of round one.

Wednesday, August 5

News of the day: Eclipse AppEngine Plugin, New Chome Beta, Lucene Payloads, and MS-Y! SEC Filing

Google released a new Google Plugin for Eclipse 3.5. It provides support for building and deploying applications on App Engine in Eclipse.

Grant has a post introducing Lucene payloads. This is the primary way to get token/term level metadata into a lucene index. See also Michael Busch's slides from the SF Lucene/Solr meetup. If you are a Lucene power user, you should get into it.

SeLand has coverage of the SEC documents on the MS-Y! search deal that go into more details, if you care. One tidbit from the documents is how the merge will take place:
Microsoft will hire not less than 400 Yahoo! employees (the “Transferred Employees”) and will offer the Transferred Employees market competitive compensation packages. In addition, Yahoo! and Microsoft will mutually agree on a retention plan to be paid for by Microsoft to assist in retaining the Transferred Employees and an additional 150 Yahoo! employees to be mutually agreed upon between Microsoft and Yahoo! to assist with providing the transition services.
Google released a new Chrome beta for windows. I'm still waiting for the Linux version.

Tuesday, August 4

Quick Links of the day: Tuesday

  • Daniel has been consistently rolling out articles covering the SIGIR industry track. Here are a few highlights:
The slides from Nick Craswell's SIGIR Industry Day talk on Query Modeling at Bing are available.

Vanja Josifovski gave a talk on computational advertising, presenting "Ad Retrieval – A New Frontier of Information Retrieval". The comments in Daniel's Post are worth reading! It's clear that ad retrieval research is still controversial.

Evan Sandhaus's from the NY Times R&D Labs also presented and a lot of what he covered are available in previous slides. He highlighted the NY Times Annotated Corpus, which you should take a look at, "The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007"

Monday, August 3

Quick Links For the Day

I'm catching up from being away last week on vacation. Here are my quick links to read up on more:

Are Academic Conferences Broken? and Why Go to Conferences?
Time for Computer Science to Grow up?
A lot of discussion about Computer Science conferences and the future.

On a somewhat related note: Daniel Lemire blogs about why he doesn't blog his ongoing research.

iPhone app development course at Stanford is available online. Google also announced App Inventor for teaching mobile app development. It's a drag and drop GUI for Android apps.

Future of search podcast, featuring the Anand Rajaraman, the founder of Kosmix.

BellKor's Pragmatic Chaos, the winning team on the Netflix prize, has a portal page for coverage of the news.

Evri announced a Javascript API.

Cloudera has an intensive app development tutorial for Hadoop.

The future of Hadoop: Don't panic, yet

The recent MS-Y! deal has a alot of people scared about what it means for Y!'s support of Hadoop. Y! search uses Hadoop to create its "WebMap" and is the largest Hadoop customer. See my previous coverage on the State of Hadoop talk at the Hadoop summit, where the search application was one of the primary featured applications. In fact, the cost of running and continually expanding the search clusters was likely a factor in Carol's decision to stop investing in search and to reallocate resources elsewhere.

Given the importance of Hadoop as an infrastructure tool for search there is a lot of uncertainty about the future. For example, The Register wrote an article titled: Microsoft pact holds gun to stuffed elephant. To counter the uncertainty and fear Eric has a post telling people not to panic! and that Yahoo! is still very committed to using Hadoop for infrastructure.

Despite this reassurance, Hadoop is losing a big customer driving requirements and changes that make it a better platform for building search applications, unless a miracle happens and some variant is adopted by Microsoft. The loss may not have short-term impact, but will change the long-term direction of the project as it focuses on being relevant to other teams and problems that are aligned with Y!'s new goals and strategies.

New York Times article on Microsoft-Yahoo Search Deal

In case you missed it, the NY Times ran an article yesterday on the Microsoft-Yahoo! search deal.

The decision Ms. Bartz had to make was go big or give up; she decided to give up:
Ms. Bartz said she sold the search business because Yahoo could no longer continue to match the level of investment Google and Microsoft were making in searching...
There's also and interesting quote where she sees the goals for the company:
"My first reaction when I got here was that I wouldn’t even do a search deal... until I looked at our expense structure and our actual options and looked at what our prime job was, which is to grow audience."
I read this as saying that search is taking a disproportionate share of resources to run an maintain without being a leader in the market. She would rather spend the money on creating top notch properties; developing content/media for consumers.

The article briefly mentions the potential impact on jobs at Y!, saying that it could mean that 400 or more jobs could hit by layoffs.

Thursday, July 30

Microsoft Bing Page Hunt Game

At SIGIR Microsoft showed off their Page Hunt game. Page Hunt is in the same vein as GWAP by Luis Van Ahn. The game in short: Guess the query given a web page. If the page is returned in the top 5 results by Bing, you win and get points.

If you didn't play it at SIGIR, give it a try.

It's a cute game, but I'm not sure about its real utility or long-term re-playability.

Here's my quick tips:
Homepages - homepages are pretty easy, the title usually works.
Detailed pages - Pick what you think will be an infrequent phrase from the page and search for it with quotes. It almost always works. There's no incentive to issue shorter queries for higher points.

Post your top score in the comments.

Wednesday, July 29

It's Official: Microsoft Assimilates Yahoo! search.

The deal is official and Yahoo! search is dead. Search will be outsourced to Microsoft. Danny Sullivan at SELand liveblogged the conference call. There's also coverage more coverage on TechCrunch and SearchEngineLand.

Highlights of the deal from SE Land:
Microsoft will acquire an exclusive 10 year license to Yahoo!’s core search technologies, and Microsoft will have the ability to integrate Yahoo! search technologies into its existing web search platforms.
From the QA in the conference call via Danny:
Revenue to Microsoft? Couple hundreds of millions of costs over the first two years. Upsides really come as able to improve relevance of search product. Ads are part of relevance. Then improve monetization on Microsoft and Yahoo site.
What does this mean for search teams at Y!? According to Carol:

Yes there are many Yahoo search employees who will be asked to take jobs at Microsoft. There will also be search employees who we look to help us on the display side. And then unfortunately there will be some redundancy in Yahoo.

Danny has more details on the Microo-Hoo deal:
The deal covers “web, image and video search.” Mehdi explained there will be a single crawl and a single index that both parties will have equal access to — “parity” in his words.
The interview also reveals that Microsoft will be taking on responsibility for BOSS and SearchMonkey, as he says: "incorporating the best of Yahoo’s search assets and user experience into its platform and technology".

On the plus side for Microsoft, the teams there now have access to the Y! search platform and tools, including Panama. More importantly, they'll have access to search and log data from Yahoo!. This is an important resource for improving the relevance and quality of Bing's search and advertising. Kudos to them for pulling it off. I think it will result in improvements to Bing and a stronger competitor to Google.

Overall, I think this is bad news for Google and the long-term future of Yahoo!. I think Carol Bartz is underestimating the value in owning a platform that can search and analyze the entire web. The data and algorithms used to improve web search provide strategic knowledge and can be leveraged to improve the overall Y! experience across many of its properties. It will also lose the ability to deeply integrate search into its products because it won't control the platform.

Y! has been a strong supporter of IR research. They have a history of providing academic access to the BOSS API, cluster resources, internships, and other forms of support. I hope that MS steps up its support to avoid shortfalls in these areas.

I think it's a sad day in search when a big player gives up. My thoughts and best wishes to all my friends and colleagues working on search and infrastructure at Y!.

Tuesday, July 28

Microsoft Yahoo! Search Deal for Real?

Update: It's official: Microsoft assimilates Y! search.

It looks like it's finally happening: BoomTown and Wall Street Journal report that a deal is very close and could be announced as soon as tomorrow.

BoomTown reports:
Sources said Microsoft search technology will be used on Yahoo sites, although Yahoo would still sell search ads, which makes the deal much smaller than ones previously envisioned, which included Microsoft taking over both search and search advertising.
If it goes forward, I wonder what this means for the search and advertising teams at Yahoo!. Will the teams re-organize, be cut, or become part of Microsoft? What is the future of Yahoo!'s ad platform that they have invested a lot of time and money into?

I'm apprehensive about the implications of the deal for Y! and the great people there. I also think that Y! provides a solid third alternative search platform. Having only two major engines decreases diversity, which could be problematic in the future.

Friday, July 24

Ivory: A New MapReduce Indexing and Retrieval System

To coincide with the SIGIR MapReduce Tutorial, Jimmy Lin announced the release of Ivory, an open-source MapReduce search platform. It is a web-scale indexing and retrieval system built on top of Hadoop. Since it's based on Hadoop, it's clearly written in Java.

For retrieval it uses Don Metzler's Searching using Markov Random Fields (SMRF) Java implementation. You can read his publications on the topic. It's exciting to finally get a chance to play with the implementation of one of the state-of-the-art retrieval tools. To my knowledge this is the first time Don's Java MRF toolkit for retrieval is available to the public.

Ivory is aimed at IR researchers as a platform for experimentation. This is an early release with a lot of rough edges.

Jimmy is using Ivory to index the ClueWeb09 dataset, which has 500 million English documents for the TREC web track.

SIGIR 2009 Wednesday, Day 3, Summary

I attended the Industry Day talk that Matt Cutts gave Webspam and Adversarial IR: The Way Ahead. I have some photos of his some of his really interesting slides I'll post soon.

I then was on volunteer duty helping James with registration and getting volunteer gifts ready.

I caught Matt's talk,
An Improved Markov Random Field Model for Supporting Verbose Queries.

I missed the second morning industry track session. I heard they were quite interesting, including Nick Craswell's talk on Bing. If someone has detailed notes, please share them! I'd love to know more details about what I missed.

Lunch was the ACM SIGIR business meeting. SIGIR 2012 will be in Portland, Oregon. SIGIR 2013 will be somewhere in Europe (multiple contenders) or in Israel.

I attended the Query Formulation Session.

Niranjan did a fantastic job presenting Reducing Long Queries Using Query Quality Predictors for his co-workers at Microsoft who were unable to attend.

I also enjoyed Extracting structured information from user queries with semi-supervised conditional random fields prsented by Xiao Li. It looked at the effectiveness of tagging queries, particular for product search. I enjoyed the talk, but I think CIKM would've been a better forum for the paper; it did not use or measure retrieval, the only connection to IR is that it used query log data.

I then went to the Web Retrieval II session. Nick Craswell gave a very good presentation of The impact of crawl policy on web search effectiveness. Anyone building a large-scale web crawler should read this paper. (On a related note you should also read IRLbot: Scaling to 6 Billion Pages and Beyond, PPT slides, which won best paper at WWW 2008).

I liked Marius Pasca's prentation of Web-Derived Resources for Web Information Retrieval: From Conceptual Hierarchies to Attribute Hierarchies which aligned extracted attributes onto WordNet classes. They manually judged the accuracy which sounded challenging and time-consuming. Instead, I would have liked to have seen the evaluation based on utility in a retrieval system: such as usage of the attributes in a faceted retrieval interface or using query logs. This is another paper that I think would've been more appropriate for CIKM than SIGIR because it was more about extraction than retrieval.

There was a UMass CIIR Alumni reunion. I only attended part of it, but I got the chance to talk (briefly) to a great group of smart and interesting people.

The main event was the boat cruise on the Boston harbor. I have some beautiful photos of the water. It was a relaxing time with friends. The only destriment was the slight odor in part of the harbor coming from the sewage treatment plant. Yuck!

SIGIR 2009 Day 2 Summary

Here are some of my highlights from the day.

I didn't get a chance to go to the Retrieval models II session, but I heard there was some interesting work there on better modeling of term proximity. I hope to go back through those papers quite soon.

I had lunch at the Upper Crust with Jon, Matt, Andy, Elif, and a group from CMU/UMass that braved the rain. Great discussion and delicious thin-crust pizza.

I attended the Interactive Retrieval session.

I liked the Predicting User Interests from Contextual Information paper.
A few comments: 1) It was remarked that the ability to predict interests likely varies significantly across the different ODP categories. It would have been interesting to dig in further here. 2) Is ODP even being mantained anymore? I'm not sure how meaningful its categorizations really are.

I wanted to attend the paper on Effective Query Expansion for Federated Search. It looks interesting; another one to read later.

The last talk was the keynote, From Networks to Human Behavior.

That night was the conference banquet at JFK Library and Museum. The museum was a very fun time. It has a great location right on the water near UMass Boston. The museum and the presentations really gave you a sense of JFK's remarkable speaking ability. Many people remarked on how it was easy to see the comparison to Obama's intelligence and rhetorical skill. The museum ignored or minimized some of the more controversial aspects of his life and presidency, which is unsurprising and disappointing.

At the banquet me and Daniel sat at the 'bloggers table' along with a group from Microsoft's Sharepoint search team and talked at length about some of our blogging experiences, goals, etc... The food and setting were also very enjoyable.

It was Henry's birthday. James embarassed him by having the everyone at the banquet sing happy birthday!

Post Banquet
A group of us ended up at the Bukowski Tavern near the Sheraton. I hung out with Jeremy, Daniel, Victor, Diane, and others until the early hours of the morning. We had some really interesting conversations.

Thursday, July 23

SIGIR Social Media Workshop: Abdur Chowdhury Twitter invited talk

Abdur Chowdhury gave the invited talk at the Social media Workshop called "Emergence"

"...Emergence is the way complex systems and patterns arise out of a multiplicity of relatively simple interactions.."

Abdur's presentation primarily focused on Twitter Trends. It was a series of questions and illustrations to provoke discussion. You needed to be there or watch it to experience it. This post is a bit disorganized because it was hard to capture the discussion dynamics.

Twitter Trends
Last thanksgiving he built Twitter Trends. The word Mumbai popped up and he thought it was a bug.

What is a trend?
examples: Mumbai Bombing, the NY plane crash in the hudson, the Jakarta bombings.

The ability to share information quickly and easily:
It changes our awareness of the world and causes new emerging behaviors not fully understood:
  • What social, technological & research questions emerge from this?
  • What happens when it's not one reporter, but millions of all of us talking?
    Lowering the barrier for all of us to share information and allowing it to propagate quickly.
- He presented a movie of Twitter chatter during the super bowl.

(It's cool. I wonder what the bias is towards those who tweet. The midwest is very underrepresented considering the interest in football and the Super Bowl.)

"More interesting than query logs, you can drill in and read the tweets; what people are actually saying."
  • What will the emerging behavior and benefits from sharin events with the world?
  • What does it mean when the water cooler is the size of the planet?
Trend spam
- Hackers band together and try and game twitter. If you have 10k people sitting in basements, they can create a trend. It's getting harder as the traffic volume grows.

Country Perspective
Location based trends.
There was a good question about looking do trends in a sub-community instead of just in a location... it's possible, but it just hasn't been done yet.

One of the big uses of Twitter is shared events: when you want to share an experience.
- Shared events make it in the trends list: SIGIR, Apple developer conference
Fun examples: near: searches (e.g. near: queries)

Go Marti! She brought up a very important missing topic, food! Despite the fact that people are talking about it, it's not a "trend" according to Abdur. It's background "chatter". (I disagree. You need to look at trends in a lower level in the hierarchy.)

Emergence: Interesting things happen:
FollowFriday -- a recommendation engine that emerged.

"What's it like working somewhere that simultaneously matters, and complete doesn't matter?"
  • What is the interesting research?
(Personally, I would have liked some more depth on how Twitter and Search interact more heavily, e.g. the use of hash tags).