Saturday, December 29

Information Retrieval research challenges for 2008 and beyond

I am applying for PhD programs in CS for next fall. This has been consuming my nights and weekends for the past few months. When I haven't been writing research proposals (NSF GRFP), personal statements, etc... I have been reading papers and thinking about research and important challenges (hopefully) in IR and related fields.

Here are some of the papers I found inspiring in recent months, in no particular order:

Meeting of the MINDS: An Information Retrieval Research Agenda
The IR group's output from the 2007 MINDS workshop sponsored by the Retrieval Group at NIST. What the heck is MINDS? Well, as I discovered it stands for a group of releated fields in Human Language Technologies (HLT): Machine Translation, Information Retrieval, Natural Language Processing, Data Resources, and Speech.

Challenges in Information Retrieval and Language Modeling
Report of a Workshop held at the CIIR, UMass, September 2002

The Happy Searcher: Challenges in Web Information Retrieval
From Google, the researchers outline a few methods where AI approaches could potentially benefit retrieval. (2004)

On a somewhat related note, I recently replied to a post by Daniel Lemire, More CS Ph.D.s than ever, what about research jobs? where I commented briefly about some of my motivations for grad school.

To end the year here's a quote from Liz Liddy, chair of SIGIR, from the December issue of the SIGIR forum:
In closing, let me say what I so frequently share with the doctoral students in our school – what a truly opportune time for each of us to have chosen the information field. We are indeed either fortunate or brilliant to have chosen to work in this field at a time when the importance of access to information is recognized by virtually everyone as being so vital in every domain... But as we relish the current accomplishments of our field and look with anticipation to an even more exciting future, it is important for us to remember and build upon what we learn from the work of the past.
In this spirit, don't miss the executive summary from the MINDS workshop for an overview of the history of where we have been and what the future may hold.

Thursday, August 23

Powerset in the news

Continuing my theme today catching up on Powerset news, about a month ago, the MIT Review had an article on Powerset and other NLP based engines:
Building A Better Search Engine

The article also mentions IBM Avatar, which I have previously discussed.

Another reminder, if you missed Barney Pell's (Founder of Powerset) talk at UW in May, you can still catch the video.

Chad Walters and Patrick Tufts blogs

Today, I came across two interesting blogs that I wanted to share.

Chad Walters
Chad is the Search Architect at Powerset. His boss, Steve Newcomb, has an interview up on Steve's blog. Chad is a veteran of Yahoo Search, where he was the Lead Architect for Runtime Search under Sean Suchter (who was one of the leads on Inktomi).

Chad's blog has a great introductory article on query result and posting list caching in search engines (static versus dynamic caching). He hasn't blogged in awhile, so let's help he gets some more time!

Patrick Tufts
Patrick is an 'AI guy' working on Freebase for Metaweb (see my previous discussion). According to his blog he also invented one of the two product recommendation engines used at Amazon (cool!). Speaking of which, FreeBase just announced an open Alpha.

PowerSet Data Center Modeling

Steve Newcomb, founder and COO of Powerset wrote a blog post about it's data center model.

Steve provides the model in a flash application: Powerset Indexing Center Datacenter Dashboard

The application models several important factors in Powerset's cost model:
  • Index Size - How many servers are required to crawl and store a known portion of the Web?
  • Moore's Law - instead of modeling Moore's law as a trend line, we broke it out into its 2 components Server Speed and Server Cost
  • Lease vs. Buy - What drives a decision to lease servers versus paying cash?
  • Lease vs. EC2 - What drives a decision to lease servers versus virtual computing (e.g. EC2)?
Powerset's NLP analysis of documents during indexing is CPU intensive. In past presentations, Steve has given some rough estimates on their current indexing speed, I seem to recall on the order of 1 document/second.

It's common knowledge that Powerset has been using Hadoop on Amazon's EC2. They are likely using this to get a jump start on building scaled data before they have their own cluster in place.

The application is interesting to play with, give it a try.

Thursday, August 9

Data driven decision making and controlled experiments

Most of the major web companies use controlled experiments to test new features and products on their website.

First, if you want to learn how these work you should read a recent paper published by Ron Kohavi from Microsoft:
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO from SIGKDD 2007 (Knowledge Discovery and Data Mining), which is happening next week in San Jose. Essentially, he provides guidance on running experiments based on his experience running controlled tests at Amazon and Microsft.

Fist, I have to say that creating good tests is HARD. One of the hardest parts in testing is creating what Ron calls the Overall Evaluation Criterion (OEC) that maximizes not only short term profits, but also long term profits and customer satisfaction. The long-term impact is often overlooked in favor of short-term gain.

For example, spamming a customer with lots of promotions now may be great for short term sales, but in the long term it is not the best decision because of its longer term effect of increasing user unsubscribe rates thus limiting the scope of possible future promotions. All too often it is hard for profit-driven companies to take a long view and think beyond this quarter's (or if you are lucky, this year's) bottom line.

The authors have some good advice, suggesting that if you don't get the results desired you should drill-down and look at many metrics along with slicing and dicing by user segment in order to understand what happened in more detail; learning from what didn't work. The devil is in these oft overlooked details.

Erik Selberg raises this issue when he comments on the 'data driven decision process.' He writes,
A data-driven approach has to start with the right question, followed by experiments that provide data that is properly interpreted to provide the answer. Typically, people fail in either starting out with the wrong question, or by conducting poor experiments that produce flawed data. An evil twin of flawed data is what I call Executive Data Bias. A decision-maker will have a certain bias on what to do, and is looking for data to back up that decision. Thus, flawed data that backs up the decision is accepted without much probing, while good data and the implications are rejected, typically by asking for “more experiments” or “more data,” or questioning assumptions made in the experiment or question.
He goes on to say that it can work, but it must be done very very carefully.

I sometimes hear, 'we tested that, it didn't work', but too often when I ask why not, what didn't work in detail I don't get satisfactory answers. This incomplete test data is then used to dismiss projects that may still hold great potential if we only understood what failed in test that was run in more detail. Unfortunately, more often than I would like, I find myself in Erik's camp.

Hopefully, Ronny's team at MS and others will help all of us become more educated on this important topic.

Erik Selberg's SIGIR 2007 wrap-up

Erik, a senior developer at MSN search labs, has his wrap-up of SIGIR.

Definitely worth reading the papers on his list.

Monday, August 6

Semantic Web Progress: Radar Networks and MetaWeb

The recent news on Hypertext Solutions prompted me to check the status of two other Semantic Web based startups, MetaWeb and Radar Networks.

Radar Networks
First, a recent Business 2.0 article: Web 3.0: No Humans Required, finally provides some details on the application they are building:

The first consumer app Radar plans to launch is a sort of personal data organizer. It will allow you to bring in e-mail, contacts, photos, video, music --anything digital, really -- from anywhere on the Web, turn it into RDF, and access it in one place.

Semantic tags are added manually, or automatically if the item is a photo from Flickr or a video from YouTube. "We add a new level of order to connect and interact with these things at a higher level than is possible today," Spivack says. "We are letting you build a little semantic Web for your project, your group, or your interest."... When it's done, it should be like the best wiki you've ever used.

There is also a less thorough (and less accurate) BusinessWeek article. In the comments founder Nova Spivack clarifies:
Radar Networks is actually combining human and machine intelligence, leveraging social networks and user-generated content as well as artificial intelligence. We're not attempting to overlay a lot of new structure on the Web. We're actually trying to make sense of the structure that is already there. By combining the semantic web with social networks, a more powerful level of collective intelligence can be achieved. Our focus is not only on organizing information but also in helping people collaborate more intelligently around interests and activities. We'll be sharing more as we head towards our beta launch in the fall of 2008.
FreeBase (by MetaWeb)
FreeBase is a centralized, open, database of semantically tagged data. “We’re trying to create the world’s database, with all of the world’s information,” says founder Danny Hillis.

There is a recent IT Conversations interview with MetaWeb co-founder, Robert Cook. Apart, from that, there has been little news on FreeBase or MetaWeb in several months, but you can still read the March NY Times Article.

It exciting to see growth in this area evident in the phenomenal growth of mashups as people start using rss, xml technologies, and microformats to blaze the first trails connecting structured data across services. I will conclude with a quote from Tom Coates of Yahoo from the Business 2.0 article that captures it nicely,
"It's in the combination that the real power of this comes out. The mashup is an early example of the Web that is to come...The goal is the most important thing: reusable, repurposable, and reconnectable data. How we get there is not as important."
Yahoo Pipes is a platform for creating mashups; it is leading the way in this next phase of this development.

Whether specific technologies (RDF, OWL, etc...) are adopted is inconsequential in the long run. The promise of a web of data instead of mere HTML that can be combined and cross-referenced holds too much opportunity to go unrealized.

Hypertext Solutions: Another NLP based search startup

Hypertext Solutions is a stealth-mode company working on a next-generation 'web 3.0' (search?) platform. It recently purchased Insightful's InFact NLP text analytics platform for 3.65 million dollars, see the press release. As part of the transfer, Hypertext is getting Insightful's Director of Engineering for Text Analysis and Search, Deep Dhillon, see the post on his blog. According to his LinkedIn profile, Dhillon was the:
Project and Technical Lead for the design, development and deployment of InFact, a scalable natural language based next generation search engine.
According to the InFact product page, now defunct, it says:
Advanced entity and relationship extraction capabilities enable researchers and analysts to quickly uncover the critical facts and trends contained within extensive collections of unstructured textual information. Results are sorted and summarized according to user parameters, with hidden relationships between people, places, objects, and organizations highlighted.
HyperText Solutions is one of several semantic search startups including PowerSet, Hakia, and Spock.

Sunday, August 5

SIGIR 2007 coverage including Karen Spärck Jones video

I lamented last week on the lack of SIGIR coverage on blogs. However, some is finally beginning to surface.

Andre Vellino has a brief overview, including the best paper awards:

Best Student Paper
Relaxed Online SVMs for Spam Filtering by D. Sculley, G. M. Wachman (Tufts University)
Best Paper
Neighborhood Restrictions in Geographic IR

Karen Spärck Jones Address Video
In other coverage, SIGIR has posted links to Karen Spärck Jones' acceptance of the ACM Athena Award on the University of Cambridge's website (as well as the ACM Portal). It is 33 minutes long and was recorded not long before her death.

Thursday, August 2

IR Courses Updated

I updated my list of recent search (IR) graduate classes.

Recent Graduate Classes in Information Retrieval

I added a new class by Brian Davison he taught this past Spring:

CSE345/445 (Spring 2007) - WWW Search Engines Algorithms, Architectures and Implementations by Brian Davison at Lehigh University. Brian is one of the creators of DiscoWeb, later to become Teoma which was acquired by Ask. He also chairs the AIRWeb (Spam) workshop.

Of particular interest is the guest lecture by Marc Najork from Microsoft research on the effectiveness of link analysis algorithms, Comparing the Effectiveness of Different Scoring Functions for Web Search. Marc also presented the research this past week at SIGIR, HITS on the Web: How does it Compare?. His interesting conclusion is that when combined with the BM25F scoring algorithm a simple inter-domain in-link count is more effective than the default PageRank algorithm and just as effective as the expensive HITS algorithm. It's interesting to see such a simple algorithm prevail here when Google continuously toughts PageRank.

Wednesday, August 1

Conference Season, Digger, and GUESS

It's conference season. Both AAAI 2007 and SIGIR 2007 were last week. However, coverage on blogs (at least that I can find) has been sparse. At least there will be proceedings. Sadly, I wasn't able to attend either this year.

Also, programs for two upcoming conferences have been posted: the ACM Recommender conference (courtesy Paul Lamere) and SIGKDD (courtesy Matthew Hurst).

Speaking of Matthew, he has a great post on visualization software. Good software to visualize graphs is hard to find, and Matthew writes is own custom solutions. However, for beginners he recommends GUESS
created by Eytan Adar, a graduate student at the University of Washington.

One exception to the lack of conference coverage is Matthew Hurst's post on Digger from AAAI. I was accepted as a beta tester, so I'll try to find time to write something up after I play with it more. In the mean time, here is some background

Digger was founded by Timothy Musgrove, formerly a Senior Research Fellow at CNet focusing on AI. Related to Digger, Tim published a paper on semantic query expansion at the Modeling and Retrieval of Context (MRC) Workshop of the International Joint Conference on Artificial Intelligence (IJCAI) in 2005 entitled Representing the context of equivalent query words as a means of preserving search precision, and here is the presentation. In it, he lays out a method of query expansion using WordNet, but with a twist that filters potential synonyms to ones that co-occur with other words in the query when examining the results from an initial retrieval (with a final post-processing step to remove proper names).

It's an interesting piece of work and I look forward to seeing Digger mature.

Wednesday, July 25

Google's Semantic Unit Locator

Bill Slawski over at SEO-By-The-SEA posted an article today on Google's recent patent on "Semantic Units" also known in the NLP world as Mutiword Expressions (MWE). The patent lays out a system for identifying useful phrases ('compounds') based on queries and the documents retrieved during the search.

Background: Finding Meaningful phrases
First, the authors lay out the problems with finding meaningful phrases from web documents and query logs independently:
The disadvantage with this [document] approach is that it is inefficient, because there are many more compounds in the corpus than would typically occur in user queries. Thus, only a small fraction of the detected compounds are useful in practice... Identifying all compounds on the web is computationally difficult and would require considerable amounts of storage.
However, the query log data is also problematic:
A disadvantage associated with finding compounds in query logs using statistical techniques is that word sequences occurring in query logs may not correspond to compounds in the documents. This is because queries, especially on the web, tend to be abbreviated forms of natural language sequences.
It's clear it will be some combination of the two. The key is that it is contextual based on the relevant documents returned for the query.
For example, the queries "country western mp3" and "leaving the old country western migration" both have the words "country" and "western" next to each other. Only for the first query, however, is "country western" a representative compound. Segmenting such queries correctly requires some understanding of the meaning of the query. In the second query, the compound "western migration" is more appropriate, although it occurs less frequently in general.
Google Semantic Unit Locator
The method includes generating a list of relevant documents based on individual search terms of the query and identifying a subset of documents that are the most relevant documents from the list of relevant documents. Substrings are identified for the query and a value related to the portion of the subset of documents that contains the substring is generated. Semantic units are selected from the generated substrings based on the calculated values. Finally, the list of relevant documents is refined based on the semantic units.
User runs a query for "leaving the old country western migration"
  1. Generate the list of all phrases > length 1 from the user's query
    "leaving the," "leaving the old," "leaving the old country," "leaving the old country western,", etc...
  2. For the top k (say 30) documents, a fraction is calculated based on how many documents each phrase occurs in, for example "leaving the" is in 15 documents, and so FRAC = 15/30 = 1/2. This may be biased so that higher ranking documents have more weight.
  3. Select the semantic units. First, remove phrases where FRAC is below a threshold, say .25. Next, remove phrases that are subsumed by longer phrases and phrases that overlap with higher scoring phrases. This leaves "the old country" and "western migration," along with the single search term "leaving." In some cases stop words such as "the", etc... may be removed.
  4. Refine ranking of the originally retrieved results using the discovered meaningful phrases.
This semantic unit identification can be saved or even computed offline for based on query logs and used for related queries in the future.

The Economist features Globalspec

A few weeks ago the Economist featured a story on topic-specific search engines, entitled Vertical search-engines, Know your subject. Globalspec is the leading example from the story:, for example, a profitable search-engine for engineers, has 3.5m registered users and signs up another 20,000 each week. “They own that market,” says Charlene Li of Forrester, a consultancy.
It's great to see Globalspec getting well-deserved recognition for its hard work over a period of a decade of helping Engineers to build products and inventions that change the world.

The Economist goes on to feature health as an emerging topic area for vertical search, featuring MedStory, Healia, Healthline, and Mamma Health.

The real challenge for specialized search engines is that most users still use Google for most of their search activity and it works 'good enough' for their specialized searches. As the story writes:
... a vertical search-engine that successfully pairs a broad target market with a complicated topic can do well... But that will mean getting consumers to kick their existing search habits. A study by the Pew Internet & American Life Project, a non-profit research group, found that two-thirds of Americans researching health-related topics online started with a general search-engine. Only 27% went on to a medical site of any kind, let alone a health-search site. “The path to general search engines is well-worn and familiar,” says Susannah Fox of Pew.
Yahoo shortcuts and Google Base integration with general search engines may be enough to spell the demise of weaker vertical engines that do not continue to continue to differentiate themselves with significantly more relevant and comprehensive information coverage of their specialty.

The article concludes with three options for vertical search engines: domination in a topic, death by Google (Base), and acquisition by GYM or other large media companies seeking to expand into new media.

Tuesday, July 17

RecipeComun - Recipe Search 3.0

RecipeComun is a search engine for people passionate about food and cooking. It searches the best recipes on the web from a single place. And our (faceted search) filtering technology means you no longer have to search on multiple websites in order to filter your results to your favorite recipe site.

RecipeComun is a semantic search engine. Its detailed understanding of recipe DNA: ingredients, author, user ratings, etc... allows it to provide better results and a richer experience than other search engines. See for yourself:

RecipeComun - Recipe Search Engine

Today is our first beta release, so give it a try and let me know what you think. Over the coming weeks we have a slew of improvements and fixes planned to improve both the functionality and content in our recipe search index. If you have a favorite recipe site you would like to see added, e-mail me and we'll add it our list.

I hope you will now understand why my blogging has been a bit lapse in recent weeks.

I'll keep you posted!

Friday, June 1

IBM Avatar: Combining Structured and Unstructured Data

I came across the blog of Anant Jhingran, a Distinguished Engineer and CTO of IBM's Information Management Division via Alon Halevy. In the process I learned about what IBM is doing with structured and unstructured information. Two projects that he highlighted were Avatar out of IBM Almaden and TAKMI out of IBM Tokyo.


From the website:
The goal of the Avatar project is two fold: (i) to enable the discovery and extraction of structured information buried in volumes of unstructured text (such as emails, web pages, and blogs), and (ii) to exploit this information to drive the next generation of search and business intelligence applications.
Anant describes the project in his post on Semantics:
What Avatar does is that it looks at a corpus of documents. And based on the analysis of documents, and knowing that there are only 6 different ways (oki, i am making it up, but you get the idea) in which people give their phone numbers in email ("Anant Jhingran, 408-xxx-xxxx:, "you can reach me at 408-xxx-xxx", "call me at 408-xxx-xxxx", ...) one can build the regular expression patterns, and voila, without any deep natural language processing, or building an OWL ontology, one can reliably derive people's phone numbers. I am grossly simplifying it, but you get the idea...
Avatar has three main components:
  1. Information Extraction System - It's purpose is to allow relatively unsophisticated users to build rule-based document annotators that can operate over very large corpora.

  2. Semantic Search - This takes a user's keyword search and transforms it into a structured query over the extracted concepts using statistical analysis (see the related paper Web-scale Data Integration: You can only afford to Pay As You Go ).

  3. Managing Uncertainty and Probabilistic Databases - The extracted annotations have a probability of being correct based on the precision of the extraction rules. Having a system that can deal with this uncertainty during querying and processing can improve the performance of the system.
TAKMI (Text Analysis and Knowledge Mining)
There are fewer details available on the project, but it says:
Although TAKMI was originally created for analyzing call center logs, it can be applicable for any type of large text data in general. In particular, we have offered a medical version of TAKMI system (called MedTAKMI) for analyzing medical publications.

Thursday, May 24

HBase: Powerset's BigTable

Jim Kellerman, a senior engineer at Powerset, has started an open source version of Google's BigTable, called HBase.

From the HBase wiki:
Design (and subsequently implement) a structured storage system as similar to Google's Bigtable as possible for the Hadoop environment. Both Google's Google File System and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application... Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets.
Current status:

As of this writing, there is just shy of 9000 lines of code in "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk.

There are also about 2500 lines of test cases.

All of the single-machine operations (safe-committing, merging, splitting, versioning, flushing, compacting, log-recovery) are complete, have been tested, and seem to work great.

The multi-machine stuff (the HMaster, the HRegionServer, and the HClient) are in the process of being debugged. And work is in progress to create scripts that will launch the HMaster and HRegionServer on a Hadoop cluster.

Jim became a committer on Hadoop (Doug Cutting and Yahoo's open source Map-Reduce framework) last week - congratulations!

Hopefully, it won't be long before HBase and the related Google clones mature and we have robust, open source, Java, implementations of much needed infrastructure: GFS (HDFS) , Map-Reduce (Hadoop), Sawzall (Pig, see my previous discussion), and BigTable (HBase). And we can thank Yahoo, Powerset, and the other supporters. Keep up the good work!

Wednesday, May 16

Behind Universal Search: Advanced Query Routing and Heterogeneous Result Ranking

Unveiling 'Universal Search'
The new Google 'Universal Search' brings all the content types under one roof, or at least more seamlessly blended into one set of search results. Marissa Mayer describes the change:
The ultimate goal of universal search is to break down the silos of information that exist on the web and provide the very best answer every time a user enters a query. While we still have a long way to go, today’s announcements are a big step in that direction... Google’s vision for universal search is to ultimately search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results that offers users precisely what they are looking for.
Searching across all of the different heterogeneous content types and ranking all of the results in real time is hard and expensive.

You can read more on Google's Blog posts. Behind the scenes with universal search and Universal search: The best answer is still the best answer. Other coverage on Search Engine Land: Google 2.0: Google Universal Search.

Query Routing and Heterogeneous Result Ranking
So, how do they do they do that? Well, I don't have exact answers, but there are some clues. A good place to start looking are recent papers by University of Washington alumni (and now Googlers) Alon Halevy, Jayant Madhavan, and company on integrating web search with Google Base:

Web-scale Data Integration: You can only afford to Pay As You Go
See specifically sections 3.1 and 3.2 for how Google starts begin to go about performing the query mapping, heterogeneous result ranking, source ranking, and generated structured queries from unstructured queries (i.e. Britney spears is a person, a musician and performer and so music and video results may be relevant). Much of this computation can be done off-line for common queries and feedback can be collected on relevance using user behavior:
The prime example of implicit feedback is to record which of the answers presented to the user are clicked on. Such measurements, in aggregate, offer significant feedback to our source selection and ranking algorithm. Observing which sources are viewed in conjunction with each other also offers feedback on whether sources are related or not. Finally, when the user selects a refinement of a given query, we obtain feedback on our query structuring mechanisms.
For more on Query Routing and integrating results from heterogeneous sources see also the following:

Query Routing for Web Search Engines: Architecture and Experiments also at UW by Oren Etzioni.

Also, Alon Halevy and others working on Google Base also talk briefly about these problems again in: Structured Data Meets the Web: A Few Observations ; see Section 3 Integrating Structured and Unstructured Data. As an introduction:
The reality of web search characteristics dictates the following principle to any project that tries to leverage structured data in web search: querying structured data and presenting answers based on structured data must be seamlessly integrated into traditional web search.
Sounds like Universal Search to me. See my previous post Integrating a Database of Everything with Web Search for more details on that paper.

The above papers doesn't speak to Universal Search directly; they are mostly relate to selecting different types of objects from Google Base. However, the same principles can be applied for integrating other kinds of heterogeneous data with web content as well.

Data 'Silos' Continue to Abound
It is worth noting that more and more organizations are finding themselves with a similar problem: lots of data silos to search and heterogeneous data to rank. For example at Globalspec, we have the same problem as Google. In fact we just unveiled a new one today, PartFinder for part number search. At Globalspec we have a multitude of different content types mixing structured and unstructured content. We have our 'content petals', the Engineering Web, parts and services, engineering news, patents, material properties, etc... figuring out what content is most appropriate for a given query is an important non-trivial task.

Pushing towards true 'Universal Search' is a grand vision and it won't happen over night. Google is taking steps in the right direction with this latest update and I'm sure other search engines will follow.

Wednesday, May 9

Octopart and SupplyFrame: Part Search Engines

Octopart is a new part search engine funded by Y Combinator.

It was started by two UC Berkeley physics grad student drop-outs, Sam and Andres. Octopart aggregates part data from major distributors: Newark InOne, Digi-Key, Allied Electronics, and Mouser (more to be added). It allows wildcard (very useful for part numbers), phrase, and boolean searches. The search results include pricing and availability comparison across the distributors, product images, product specs, and part data sheets. The UI is very reminiscent of Google. They get daily part feeds from Newark to keep their availability fresh. The engine is written in Python.

TechCrunch has a write-up on them.

Overall, a very nice start. We'll see how they evolve and add features.

Another competitor in this field is SupplyFrame. Supply frame is more tightly integrated into the buying process with tools (including desktop integration) for RFQ handling, Bill of Materials, etc... more geared for large scale business buyers.

From a recent press release:
SupplyFrame takes component searching one step further by giving users the ability to easily create and manage lists of parts. With SupplyFrame interactive quoting tools, buyers and engineers can run parts lists or complete Bills of Materials through a full quoting cycle with any suppliers in the world.
Another feature it has is pricing and lead-time trends for parts (the data still looks pretty sparse). See the trends in a search for: SN74HC14N.

There are other part search engines (ChipIndex, FindChips, etc...), but these are the two new contenders.

Disclaimer: my employer, Globalspec, is a part and components search engine.

Search Innovations Article on R/W Web

Read/Write Web has an article on the Top 17 Search Innovations Outside of Google. The article is broken up into 17 areas of innovation.

Globalspec is mentioned under number 7, parametric search:
GlobalSpec lets you specify a variety of parameters when searching for Engineering components (e.g. check out the parameters when searching for an industrial pipe). Parametric search is a natural feature for Vertical Search engines...Google has already incorporated this feature at a general level - such as the parameters on the Advanced Search page - but that waters down its usefulness. The most powerful use of this feature happens when additional parameters become available as you drill down further into standard search results or when you constrain the search to specific verticals.
Good press is always nice :-).

Parametric search (database-like) and Google-like text search are converging. Google is adding faceted search via Google Base, and the traditional sql-like parametric engines are adding better full-text search. The goal is seamless integration between the two (a primary goal of faceted search). Faceted search engines like Ebay Express and are leading the way in this area.

The full list of the 17 areas:
  1. Natural language processing
  2. Personalization
  3. Canned, specialized searches (I'm not too sure about this one)
  4. New content types (Video, Images, Audio, etc...)
  5. Restricted data sources (aka 'custom' search engines like Google Co-op and Rollyo)
  6. Domain-specific search
  7. Parametric search
  8. Social search (Delicious, Stumbleupon, etc...)
  9. Human input (aka answers, Yahoo Answers)
  10. Semantic search (I'm not sure I agree, Spock and ZoomInfo, information extraction)
  11. Discovery support (aka recommendation systems)
  12. Classification, Tag clouds, and clustering (Is this innovation? Google uses clustering behind the scenes... I don't think users need or want this.)
  13. Results Vizualization (new ways to display search results
  14. Results refinement and filters (I'm not sure this is big either, it borders on faceted search the way it is described).
  15. Results Platforms (search result APIs? This is a bit strange. The closest example is the Alexa Web Platform).
  16. Related Services (I'm not sure I understand this..., not well defined)
  17. Search Agents
Overall, not a bad list, especially the top of the list. It seemed like they were stretching a bit towards the end...

Monday, May 7

SIGIR 2007 workshops and Learning to Rank for IR

The SIGIR 2007 (in Amsterdam) workshops were announced last Thursday.

Of particular interest is the Learning to Rank for Information Retrieval workshop. Papers are due by June 8th. From the description:

The task of "learning to rank" has emerged as an active and growing area of research both in information retrieval and machine learning. The goal is to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application.

The relevance of this task for IR is without question, because many IR problems are by nature ranking problems. Improved algorithms for learning ranking functions promise improved retrieval quality and less of a need for manual parameter adaptation. In this way, many IR technologies can be potentially enhanced by using learning to rank techniques.

A major theme at the workshop will be of course, LETOR, MSR Asia's collection of datasets to compare these type of machine learning based ranking systems. See my previous post on LETOR.

The LETOR website now has some critical bug fixes posted on the first version and a formal release is planned for the end of the month (according to the website).

Sunday, May 6

Powerset and Natural Language Search at UW

Barney Pell, from PowerSet spoke at UW last week on Powerset and Natural Language Search. From the abstract:

In this talk, we discuss the concept of natural language search. Central to this is a new user experience, in which users express queries in natural language and the system responses respect the linguistic information in the query.

To realize this vision at broad scope and scale will require advances in a variety of technology areas, including natural language processing, information extraction, knowledge representation, and large-scale search indexing and retrieval systems.

In addition, it will require innovations in user interface. Issues include changing user behavior, educating users about the affordances and constraints of the technology, supporting users in formulating effective queries, and managing expectations.

Hopefully, the video will be online soon at UWTV.

Also in the same lecture series, Raghu Ramakrishnan from Yahoo Research gave an interesting lecture in February: Community Systems: The World Online.

Is Relevance Relevant?

Elizabeth van Couvering, a PhD student at the London School of Economics recently published a paper: Is Relevance Relevant? Market, Science, and War: Discourses of Search Engine Quality. From her abstract:
The evidence presented here suggests that resources in search engine development are overwhelmingly allocated on the basis of market factors or scientific/technological concerns. Fairness and representativeness, core elements of the journalists' definition of quality media content, are not key determiners of search engine quality in the minds of search engine producers. Rather, alternative standards of quality, such as customer satisfaction and relevance, mean that tactics to silence or promote certain websites or site owners (such as blacklisting, whitelisting, and index "cleaning") are seen as unproblematic.

There is good discussion of her article on John Battelle's post on the topic, including follow-up with Matt Cutts from Google and from Elizabeth herself.

Oshoma Momoh, formerly a GM of MSN Search, also commented on the article on his blog.

Ms. Couvering also previously published, Web Behavior: Search Engines in Context.

Friday, May 4

Open Source Scraping (Wrapper Generation) Tools

Web information extraction is also sometimes referred to as 'screen scraping' or 'web scraping'; converting the unstructured or semi-structured web content intended for human consumption into structured data suitable for computers.

Using a few simple tools it is easy to create wrappers that reliably extract structured content from semi-structured HTML web pages.

First, there is the open source project Web-Harvest.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
Another similar project is JScrape (Alpha). JScrape is very similar to Web-Harvest in technique and technology. From the website:
JScrape using the HttpClient API to get an input stream to a web page, then using the TagSoup API to turn the HTML into an acceptable DOM object and then from their saxon is used to apply the XQuery.
Simple Alternative
Alternatively, it is easy to roll your own XML based extractor. You will need an HTML to XHTML converter such as NekoHtml or TagSoup (see my post last year on these parsers) and a XPath/XQuery engine such as XOM or JDom.

Here is a quick example using XOM and Tagsoup. You will need: Java JDK, Xom 1.1, and TagSoup 1.1. A simple data extractor:

// Setup your HTML to XML parser.
XMLReader htmlToXmlParser = new org.ccil.cowan.tagsoup.Parser();
htmlToXmlParser.setFeature("", true);
XPathContext xpathContext = new XPathContext("html", "");

// Build / parse your HTML document (here represented as bytes from a String)
Document doc = new Builder(htmlToXmlParser).build(new ByteArrayInputStream(bytes));

// Query your document using XPath Expressions
Nodes nodes = doc.query("//html:span[@class='headline1']", xpathContext);

You may need some help getting started with XQuery / XPath Expressions. A great way to start is to download the freely available first chapter of XQuery: A guided Tour.

That's all there is to it, at least for simple wrappers. The hard part is scale and maintenance of large numbers of wrappers over time... and there are some commercial engines that help to manage this. More on these in a future post.

Wednesday, May 2

Live Product Search Images Follow-up

I posted last week on MSN's update to their product search.

Ling Bao from the product search team responded to my comments on the post:
Jeff, you make a good observation. We've verified your queries and all the offers without images are from Product Upload Beta feeds where the merchant has blocked our image bot. In terms of the categories that have more images, this is heavily skewed by what merchants are uploading.

Additionally, I think the big difference between our numbers is due to two reasons. Part of it is because of sample size. The other cause is that we're getting more feeds over time, exacerbating the problem.

As you can imagine, we are actively working to address the image issue with feeds in coordination with merchants.

I agree, I only tried three queries -- so my sample size was tiny. I guess I wonder how the numbers would change over a larger query set. Having products with images only matters if those are the results that appear first in the search results. In short, the overall percentage of products with images can differ drastically from what users actually see in search results.

Good luck to the team on working on the arrangements with merchants to get your crawls. You can also read the team's full post on the MSN Product Search Blog.

In the post on their blog they asked for some feedback on ranking product results. Here are some of my thoughts:

From the their post: Is the product what the user was looking for given the query? The example given was a query for "speaker" and both speaker stand and speaker system both contain the desired search term in the product name.

Here are my thoughts:
It would be nice if the search engine recognized the speaker was a modifier/adjective of stand and not a speaker itself. This would solve the speaker stand vs. speaker system problem (this may not be as easy as part-of-speech tagging)...

Can you cluster the results by similarity (features include: manufacturer, price, dimensions, weight, etc...) and then bias the results toward prevalent product clusters (i.e. expensive large, expensive speaker systems are probably more prevalent than cheap and light speaker stands)? Factoring in aggregate product popularity might be important here...

Factors in Product Result Ranking
1) How popular is this product? I tend to be biased towards more popular (higher selling) items. (like the Amazon SalesRank)

2) How is the item rated? I like Amazon because it not only provides the overall rating, but also provides the number of ratings the items received and properties of those reviews (are there recent reviews? are there constant reviews over a long period of time?). Is it rated by consumer reviews or other rating services?

4) Who manufactured the product? I am going to probably prefer products from major name brands -- Wusthoff, Sony, Canon, Microsoft, Apple, etc...

5) When was the item first released? I am going to prefer newer items / models.

6) What are the seller's shipping rates / policies? I am going to prefer sellers that have cheaper shipping fees and that can get me my item faster.

7) Seller proximity for some items. For some large items I might want to be able to pick the item up and local sellers are better than distant sellers. (I probably won't ship a large screen tv or large piece of furniture).

ACM SIGKDD Webcasts Online

The ACM SIGKDD has started hosting webcasts to improve data mining education and share expertise.

There are currently two webcasts online and a third scheduled for May. Here is information on the first two:

Web Content Mining
By Bing Liu, University of Illinois at Chicago (UIC)
Web content mining aims to extract/mine useful information or knowledge from Web page contents. Apart from traditional tasks of Web page clustering and classification, there are many other Web content mining tasks, e.g., data/information extraction, information integration, mining opinions from the user-generated content, mining the Web to build concept hierarchies, Web page pre-processing and cleaning, etc.
His website also has slides and other data mining material to go with his new book: Web Data Mining (December 2006).

Towards Web-Scale Information Extraction
By Eugene Agichtein, Emory University
Data mining applications over text require efficient methods for extracting and structuring the information embedded in millions, or billions, of text documents... First I will briefly review common information extraction tasks such as entity, relation, and event extraction, indicating the main scalability bottlenecks associated with each task. I will then review the key algorithmic approaches to improving the efficiency of information extraction, which include applications of randomized algorithms, ideas adapted from information retrieval, and recently developed specialized indexing techniques.
Eugene has a web page that accompanies the webinar. The page has lots of good resources, including links to other similar tutorials.

Monday, April 30

Search at Ebay Part I: Faceted Search and Ebay Express

This is the beginning of a two part series on search technology at EBay. Search is important at EBay because user need to be able to quickly find products. Not long ago, I blogged about EBay's San Dimas project, which uses a faceted search UI. This article will explore faceted search in more detail. It will first provide an introduction to faceted search terminology and then look at EBay Express as a model of a faceted search system.

Facets refer to categorized properties of objects in a collection. Each facet has a name, such as Cooking Method, Ingredients, Course, or Cuisine for a recipe collection. A facet may be flat ( such as Author) or it may be hierarchical (Cuisine > Italian > North Italian > Milan). Facets are not categories because you don't place items INTO a facet, facet values are properties assigned TO items; facets are structured tags. For more background on faceted search systems you can read SearchTools' report on Faceted Metadata Search.

Marti Hearst at UC Berkeley is one of th leading experts on faceted search systems. She lead the design of the Flamenco faceted search system. At CHI 2006 in Montreal Marti led a course with Preston Smalley and Cory Chandler from EBay (the San Dimas Project designers) entitled "Faceted Metadata for Information Architecture and Search".
The main objective of the course is to instruct attendees about how to integrate navigation and search for large collections in a seamless, flexible manner that helps users find things quickly and browse items comfortably...The instructors have designed an approachable, reproducible methodology for the design of highly usable, highly searchable information-centric web sites.
The goals of these systems are outlined well in Marti's 2006 paper Design Recommendations for Hierarchical Faceted Search Interfaces from the Faceted Search Workshop at SIGIR 2006:
...the overarching design goals are to support flexible navigation, seamless integration with directed (keyword) search, fluid alternation between refining and expanding, avoidance of empty results sets, and at all times retaining a feeling of control and understanding.
EBay Express
In the Chi Course Preston and Corey present Ebay Express as a new model for a state of the art faceted search system. They outline a series of lessons learned and design pitfalls to avoid. Here are the main lessons they walk you through:
  • "Parsing" feels natural to users (and the text in the search box is not sacred)
  • Controls placed along the top of the page are used more than when on the left side.
  • People browse using the facets more when they are not familiar with the domain
  • Users stop using refinements when a) not useful, and b) item count low enough
  • Prominently showing 4 facets is sufficient (but prioritization is important)
  • Shifting columns doesn't disturb people
  • Truncated list of values per facet is okay (users know how to access the rest)
  • Showing sample values help users understand facets and can expose breadth
  • Users often want to select multiple facet labels and are pleased when they can (treated as an OR by search engine)
  • Traditional breadcrumbs don't work here
  • Users understand the idea of applying and removing facets using this modified breadcrumb without instruction
The course is very rich and the above outline is only a very illusory glimpse into the wealth of wisdom they walk you through.

There is a good review of the course by Jessyca Frederick a developer from ShopZilla that attended the course.

There are a lot of hard technical details to dig into, for starters:
  • How do you parse user queries intelligently and match query terms to facets?
    i.e. translate the query: 5 MP Cannon PowerShot A530 to Company:Canon, Resolution:5 Megapixel, Series: Powershot, Model: A530?
  • How do you pick what values of facets to display when the list of values is very large?
  • How do you efficiently integrate relational database querying with keyword search using inverted indexing systems? Do you even have to?
That's all for now, although there is certainly a lot more to be said on faceted search systems, including looking at the software that can power these interfaces. On that note, next up in the series is a look at EBay's search infrastructure: "Voyager" (no, not the Star Trek Series).

Thursday, April 26

Friday News: SIAM Data Mining Proceedings, LingPipe 3.0, and fun with Pig, Sawzall, and DryadLinq

SIAM Data Mining 2007
The SIAM Data Mining Conference is happening this week in Minneapolis. Daniel Lemire has coverage on his blog. All of the proceedings are available online for download (I with the ACM did this). Here are some highlights:

Best Paper Awards
Research: Less Is More: Compact Matrix Decomposition for Large Sparse Graphs
Authors: J. Sun, Y. Xie, H. Zhang and C. Faloutsos

Application: Harmonium Models for Semantic Video Representation and Classification
Authors: J. Yang, Y. Liu, E. Xing and A. Hauptmann

Another paper that looked interesting was:
Bandits for Taxonomies: A Model-based Approach by Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti and Vanja Josifovski (all of Yahoo Research). The problem here is to match contextual ads to web pages as efficiently as possible, even when clicks (feedback) are rare. One of the tricks described is to use taxonomy matching -- classifying web pages into a hierarchical taxonomy (such as the Yahoo Directory) and then classifying ads into the taxonomy. They can then exploit relationship within the taxonomy to find other similar content. They put an interesting spin on it by framing the problem as a "multi-armed bandit problem." See the Wikipedia entry on the Multi-armed bandit problem for background on a very interesting gambling problem ;-).

LingPipe 3.0
Alias-i has released LingPipe 3.0. There are full details on the new version on the LingPipe blog. The new system moves to Java 1.5 and uses generics. There is a great story about the upgrade process: Spring Cleaning Generics for Lingpipe 3.0. Generics are awesome -- and I love the for-each loop. Also, the clustering package was re-written from the ground-up; there is a new clustering tutorial as well.

Distributed Processing Abstractions: Pig, Sawzall, and DryadLinq
These are programming models designed to enable mere mortals to write programs that seamlessly scale for parallel processing on large computing clusters. In short, they are tools that enable efficient large scale data manipulation over web pages, query logs, etc... These languages usually (with the exception of Dryad) run on a map-reduce framework (such as Yahoo's Hadoop). All three of the major search engines are building languages to perform large scale distributed data processing:

The Pig Project from Yahoo (An open-source, Java, add-on to Hadoop).
The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra. Queries articulate data analysis tasks in terms of set-oriented transformations, e.g. apply a function to every record in a set, or group records according to some criterion and apply a function to each group.
DryadLinq from Microsoft (Distributed Systems and Web Search and Data Mining teams).

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph... Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

Dryad is closed source, written using .Net and C#.

Sawzall from Google

Greg has coverage on them (Yahoo Pig and Google Sawzall) and goes into some depth on some of the similarities and differences in the languages.

Tuesday, April 24

Images on Windows Live Products, An Improvement?

Recently Google renamed Froogle and gave it an upgrade, now it is Microsoft's turn with upgrades to Live Product Search.

Microsoft has a post up on their blog, Live Product Search More Images, More Relevant. According to their latest information 88.6% of the products now have images (a 9% improvement over the old system). Why aren't there 100% images in the top results?
The reason is largely because many sites, including very reputable merchants like,, and, block image crawling bots or seriously throttle them. We will have to work with these sites to address these issues, but the latest improvements in the number of Product Search top results with images are already quite significant.
Webmasters are weird about crawlers. Some complain even when the search engine will drive traffic to their site... and then there are the really insane webmasters who complain when you make 100 hits to their site. Some appear to have nothing to do but pour over their weblogs. Still, this it is surprising that MSN is having these problems with major retailers.

I don't buy MSN's 88.6%, at least not from a user perspective. I tried some queries and I'm seeing much worse results. See XBox 360 (10 out of 18 have images), ipod (12 out of 18), and a hard one kershaw shun (a brand of knife) (0 out of 18). This leads to my overall rating of approximately 41%. Compare this with Google. XBox 360 (7 out of 10), Ipod (10 out of 10, Kershaw Shun (10 out of 10). Google gets 90%. My unscientific 41% is a long ways off from MS's claim of 89%. Now I can't see what is in their entire database, but from my user experience something is fishy here.

In other product search related news Microsoft news... MS is working on Cloud DB (coverage via Geeking with Greg), a similar product to Google's BigTable. The key problem here is: how do you handle sparsely populated columns efficiently? From what appears to be some kind of leaked discussion on Cloud Db:
MSN Shopping. The total set of attributes that products can have (e.g. “Pixel Resolution”) is very large, but any given product only has a few (a vacuum cleaner doesn’t have ‘Pixel Resolution’).
A good review of BigTable and eventually BigTable and S3 is something for another night...

Future of Search Event at UC Berkeley

There has been a lot of talk about the future of search recently, Hakia's Quest for Better Search.

Matthew Hurst pointed out that the Future of Search 'research event' at UC Berkeley is coming up on May 4th. It is billed as an opportunity for students and academics to get together and talk with industry, to set agendas for relevant research.
This event will examine the path towards the next generation of Search. This requires new technology for its development, engineering design and visualization. As the technological expertise for each component becomes increasingly complex, there is a need to better integrate them into a global model. The ultimate goal is to understand how we can fully mechanize search engines with cognitive and natural language capabilities. This event will endeavor to construct an overview of what is to come, to elucidate and formulate the main open questions in this grand quest and to highlight promising research directions.
About half the day will be taken up with three panels: NLP, Communities and Search, and Multi-Media.

Matthew is participating in the communities and search panel. The speakers and panels all look really interesting -- I wish I could attend! If you are near Berkeley, don't miss it. Registration is free!

Friday, April 20

StumblingUpon $40 Million-ish

Rumor has it that EBay has purchased StumbleUpon a social search / bookmarking site that let's user explore new sites via "collaborative serendipity." The sell price is reportedly approximately $40 Million dollars (see TechCrunch and GigaOm). Not bad for a small company with only 1.5 M in investment. I can't remember if I blogged about it, but I predicted that they would be purchased this year. However, I predicted that one of the big three would buy them.

There is a great interview, Q&A with Garrett Camp on SELand, one of the founders, on some of the technology:

Our 2 million registered users stumble around 5 million times a day, so we have a pretty active user base. If they find something new, it's incredibly easy for them to submit it to us. All they need to do is click the thumbs-up button on the toolbar and it's submitted to our database. We get over 16,000 new URL submissions a day - all new and unique content endorsed by our members... We have a classification engine which automatically places content into one of 500 predefined categories based upon on-the-page factors. This means most content submitted can be distributed to interested members even before tags have been applied.
It is also a great relatively undervalued marketing channel:
StumbleUpon has a unique business model that works well for marketers where we can deliver traffic directly to your site. You can target by category, age, gender and location. So for product launches, distributing audio/visual content or just getting feedback on your blog, StumbleUpon often works better than PPC approaches since targeting is precise and no click through is required.

Perhaps eBay will extend it to products... or videos of products. Who knows. I'm not sure I get this one...

In response, Google has launched it's own blatant rip-off. See Google's blog post: "Searching Without a Query. Google's new personalization will take into account not only your search history, but now also your web browsing history via the Google Toolbar (in a separate post Your Slice of the Web). You did know that the Google Toolbar tracked the sites you visit if you had PageRank turned on, right?

Remember my previous post from Hakia's future of search-- How much do you trust Google with your data (mail, docs, purchase history (Checkout), search history, web browsing history, files (GDrive), etc...) ? Imagine the possiblities... for good or evil.

QueryCat: The query's meow

One of my co-workers has launched a new vertical search engine: QueryCat a FAQ search engine. Other coverage on Search Engine land, QueryCat - Search FAQs.

Kevin created the search engine using Alexa's Web Search platform to mine the web for questions and answers in FAQs. After pages are discovered they are mined and questions and answers are extracted and indexed with Lucene.

According to Kevin:
The idea was inspired by some of the "answer engines", such as as well as Google's "one box". I think that the next level of search will involve more understanding of a user's query and matching it up with structured information parsed from the web. These sort of techniques help the user find the answer just a little bit faster...We have about 2 million questions and answers right now, but I believe we can double or even triple that in the next few weeks.
Some of the answers are spot on, but others still need some tuning. For example, A query for what is the capital of mexico? returns as the first result What is the difference between the mortgage rate and the APR? (presumably because mexico and capital are in the description).

Good luck Kevin.

Thursday, April 19

Ebay's Hybrid Desktop Application: Project San Dimas

There has been a lot of buzz about EBay's San Dimas prototype. Project San Dimas is a hybrid desktop-web app based on Adobe Apollo technology.

There has been a lot of coverage recently from the Web 2.0 Conference: TechCrunch, two of the developers blogs, Rob Abbott and Alan Lewis, and of course the video from the Adobe Conference. (Side Note, Alan has a great presentation, The Future of the Desktop that he gave at Web 2.0 with some great slides on San Dimas.)

The Adobe video highlights some of the features, including the wicked feature that allows you to interact (post items, bid, etc...) with EBay even if your network is disconnected. Ebay will automatically sync up when the connection is restored.

As a side note, there is some interesting shake-up going on with San Dimas project, one of the lead UI designers on the project, Alan Lewis, is leaving the company to join Ribbit. He writes, "Despite our success and raves from the San Dimas team, all design work on our side ended abruptly at the end of Q1."

I have not experienced San Dimas first hand, but I look forward to playing with what Apollo has to offer. It is interesting to note that the screenshots I have seen of San Dimas look somewhat similar to Ebay Express, at least with the controls for query refinement on the top of the page (what Google just abandoned for Product Search).

I will save a in-depth discussion of faceted search and EBay express for another day.

Froogle Rebranded to boring name

Google has re-branded Froogle to Product Search. Maybe if you're stock valuation is as high as Google you go afford to go Dean and Deluca instead of Price Chopper.... Here is the news straight from their blog, Back to Basics. I must admit, the new name works, but it's a lot less witty. Next thing you know Google Base will become Google Object Database.

As usual, Danny Sullivan at SE Land has great in-depth coverage, Goodbye Froogle, Hello Google Product Search. One interesting aspect of the UI change is the change in the way query refinement is done:
The big giant box of query refinement options that were at the top of the page will move to the bottom and be more condensed. The refinements were relatively little used at the top of the page, Mayer said, and putting them down at the bottom also seemed to make more sense.
I'm not sure if I like the change, but it sure makes products the focus of the page, with more content above the fold.

CNet's coverage, Google takes the pun out of shopping, has a great title. It is a decent article, but most of it is on Google Base, not Product Search (highlighting some continued confusion in this area). Here are some highlights from what I would describe as the Google Base article:
Rather than encourage people to go to specific sites for specialized search, which is what vertical sites do, Google wants them to go to first and find the best results from its own specialized searches there. And most people do start their searches, for everything from cars to houses to jobs, on a major search site, experts say. Recent statistics from online traffic measurement firm Hitwise found that search engines are the primary way that Internet users navigate to key industry categories...

But Mayer says Google Base isn't intended to be competition for e-commerce companies. "Faceted search is an important part of the process," allowing people to search for part-time versus full-time jobs and to search for a five-bedroom house, she said. "We know that's important to search and that's something Google hasn't done particularly well in the past."

A case study on culinary Web site provided by Google said the company didn't see any results from its recipe listings on Google Base until it added descriptors such as cuisine type, course and main ingredient. Then traffic to the site jumped 6 percent immediately.

As Google's prominence and power of user attention grows, vertical sites can find Google's approach unsettling. Instead, many verticals are trying to lessen their dependence on Google and find ways to drive direct and repeat usage where Google is not a part of the transaction.

Wednesday, April 18

April Showers Bring... Search Engine Video Lectures

Here in the northeast it has not been pleasant, the cold rain is incessant. The old saying for April this year could be "April Monsoons (hopefully) bring May flowers." Since it's raining outside you may as well watch search engine video lectures, assuming you aren't completely under water.

Here are some of the best sources for search videos on the web.

SIMS 141: Search Engines: Technology, Society, and Business. The class lectures of Marti Hearst's UC Bekeley class from 2005. Great speakers including John Batelle, Sergey, and Jan Pedersen. A good mixture of technical content and business content.

Resarch Channel
- A great wealth of academic lectures available online. A good starting place can be found via the SIGIR Talks page.

VideoLectures.Net - A European site focused on computer science research videos (from conferences and workshops) with over 1,000 video lectures online. It is focused primarily on machine learning and the semantic web. As a starting point, many of the videos from The Future of Web Search workshop from last May hosted by Yahoo Research Barcelona have been posted.

Google Tech Talks - A series of internal lectures given at Google. The topics run the gambit from biofuels to computer security and programming languages. While most are not search focused, they are quite fascinating (and obvously for the technical and geeky audience).

The hard part is choosing what to watch.

That should be enough to keep you busy for at least 40 days and 40 nights, or at least until the spring floods subside.

Sunday, April 15

The Spock Entity Resolution challenge and other miscellany

The Spock Contest - via O'Reilly Radar. Spock is a new people search engine. Like NetFlix, Spock has started a contest. First some background on Spock:
Spock is a search application that helps consumers discover more about people who matter in their lives. At the core, we organize relevant information around people and have developed unique technology to do so...With over one hundred million individuals indexed and millions added every day, Spock is the largest and most comprehensive people specific search application.
Next up information on The Challenge:

We have selected one of our most interesting problems, namely Entity Resolution, to share with the community, allowing other leading computer scientists and engineers to compete in an open contest... You can work individually and in teams. The competition will last 4 months and the winning team will win a Grand Prize of $50,000! Most importantly you’ll be working on a very important and widely applicable problem. We will also be issuing prizes for 2nd and 3rd place.

The dataset is 1.5 GB compressed. Time to dig a little deeper... more soon.

Now, other miscellany:

Microsoft Research Asia has released a package of benchmarks for creating and testing machine learning based ranking algorithms called Letor (LEarning TO Rank). Their goal is to create a platform that allows researchers to more easily compare the effectiveness of their ML based ranking systems through the use of a standard set of benchmarks.
Ranking is the central problem for many applications, and using machine learning technologies to learn the ranking function has been a promising research direction. However, the lack of public benchmark datasets (e.g. standard features, relevance judgments, data partitioning, and evaluation metrics) makes the existing work difficult to be compared with each other...We benchmarked several state-of-the-arts ranking models with these features and provide baseline results for future studies.

Found via Fernando Diaz's (a grad student at UMass CIIR) blog post on the topic.

AIRWeb 2007 Papers Announced

Also, for the latest in Web Spam research, the AIRWeb 2007 accepted papers are now online. Search Engine Land has a great article on the topic, with links and descriptions of all the papers, something lacking on the website. One of the primary organizers of AIRWeb is Brian Davison. Brian is presenting a paper on link filtering at the conference, Measuring Similarity to Detect Qualified Links.

Friday, April 13

Friday Round-up: GSOC, ICSWSM vids, and Edison

  1. The winners of Google's Summer of Code was announced today, all of results are here. The collaborative filtering software Taste was a winner, with two projects selected.

  2. The ICSWSM keynote videos are online at the ICSWSM blog.

  3. Apostolos Gerasoulis, co-founder of Teoma and Professors at Rutgers, leaked at SES NY (more coverage on the Social Search panel at SES NY from SEORoundTable) that Ask is working on a new ranking system. The new system, "Edison", combines Teoma's HITS link analysis (see Kleinberg's Authoritative Sources in a Hyperlinked Environment) with user behavior analysis, an evolution of DirectHit technology.

    From Rahul Lahiri, Vice President of Product Management and Search Technology at

    Edison is still in development, so we can't say too much at this juncture. I can tell you that it's a next generation algorithm that, among many other things, synthesizes modernized versions of Teoma and DirectHit technologies, as AG said this morning. It's much more complicated than saying we're just counting clicks, in the case of DirectHit. The technologies we have, and the patents we hold, go way beyond that. We're also taking a deeper look at communities and calculating the authorities in those communities. We were really inspired by looking into the universe of user behavior, and what that could tell us, and the social fabric of the Web itself, and what that tells us. We're also rolling out an upgraded search infrastructure over the course of 2007 and building a new datacenter along the Columbia River in eastern Washington, which will help our speed, freshness and data quality. It's safe to say that Edison itself will roll out over the course of the year, as we improve it and tweak parameters.

    via SearchEngineLand. Remember what I said about implicit feedback being underrated...

Grand Re-Opening: un caffè, per favore

Now open later for your search engine information fix.

You may have noticed (or more likely not have noticed) that the name of the blog has been changed from "Jeff's Search Cafe" to "Jeff's Search Engine Caffè". Likewise, the address has been updated and has a new TLD name,

The reason for this is that I was contacted by a company who owns the trademark for "SearchCafe". In order to avoid any more conflict and hassle, I decided to take this as an opportunity to buy a new domain name and update the site. In the long run, having its own domain name will give me more freedom -- in case I want to switch off of Blogger.

In the next few weeks I hope create a new template to go with the new name and address; the current look and feel is pretty lame.

Thanks for reading, and if you get a chance I would appreciate it if you update any bookmarks (such as those on Delicious).

P.S. Just a piece of advice to anyone who decides to do anything online, before you start do a trademark search; at the very least, search the USPTO and Google -- even if you only decide to create something on a sub-domain of a free service and your content is completely non-commercial.

Wednesday, April 11

Hakia's Quest for Better Search

Hakia, a semantic and NLP based engine, started a discussion from some of the leading bloggers and search engine journalists on the future of search:

The Search For Better Search

Here are some of my thoughts on The Future.

The future of search engines is about context and authority.

The future of search engines will provide more relevant information because they will have more information: my level of expertise in different disciplines (5th grader versus a post-doc), my current location (geographic and home/work), what I am working on at the moment (writing a research paper, writing code, reading news, researching my next trip, etc...), etc...

Second, search engines will (hopefully) be able to tell the difference between overall popularity and topical authority. For example, The Wall Street Journal may not be an authority on Food or culinary information. And a mere blogger may be an authority on Personalization.

The future of search has major implications for digital identity and privacy as people do more online. Search engines will begin to be aware of what we do: what websites we visit through toolbars, what searches we execute, what search results and ads we click on, what information is in our documents, and how this information is connected with our blogs and even our Amazon Wishlists and product reviews.

How much privacy are you willing to give up to get good search results? How much is necessary?

As John Battelle commented:
To get there, we'll need to trust that everything we disclose online - our behaviors, our clickstreams, and our intent - are managed through a trusting relationship. The future of search is as a conversation with someone we trust.
How much do you trust Google? Or MSN or Yahoo?

In the future, the service that gets my search traffic may not be who provides the most relevant search results, but the one I trust with my data the most.