Friday, May 4

Open Source Scraping (Wrapper Generation) Tools

Web information extraction is also sometimes referred to as 'screen scraping' or 'web scraping'; converting the unstructured or semi-structured web content intended for human consumption into structured data suitable for computers.

Using a few simple tools it is easy to create wrappers that reliably extract structured content from semi-structured HTML web pages.

First, there is the open source project Web-Harvest.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
Another similar project is JScrape (Alpha). JScrape is very similar to Web-Harvest in technique and technology. From the website:
JScrape using the HttpClient API to get an input stream to a web page, then using the TagSoup API to turn the HTML into an acceptable DOM object and then from their saxon is used to apply the XQuery.
Simple Alternative
Alternatively, it is easy to roll your own XML based extractor. You will need an HTML to XHTML converter such as NekoHtml or TagSoup (see my post last year on these parsers) and a XPath/XQuery engine such as XOM or JDom.

Here is a quick example using XOM and Tagsoup. You will need: Java JDK, Xom 1.1, and TagSoup 1.1. A simple data extractor:

// Setup your HTML to XML parser.
XMLReader htmlToXmlParser = new org.ccil.cowan.tagsoup.Parser();
htmlToXmlParser.setFeature("", true);
XPathContext xpathContext = new XPathContext("html", "");

// Build / parse your HTML document (here represented as bytes from a String)
Document doc = new Builder(htmlToXmlParser).build(new ByteArrayInputStream(bytes));

// Query your document using XPath Expressions
Nodes nodes = doc.query("//html:span[@class='headline1']", xpathContext);

You may need some help getting started with XQuery / XPath Expressions. A great way to start is to download the freely available first chapter of XQuery: A guided Tour.

That's all there is to it, at least for simple wrappers. The hard part is scale and maintenance of large numbers of wrappers over time... and there are some commercial engines that help to manage this. More on these in a future post.

Wednesday, May 2

Live Product Search Images Follow-up

I posted last week on MSN's update to their product search.

Ling Bao from the product search team responded to my comments on the post:
Jeff, you make a good observation. We've verified your queries and all the offers without images are from Product Upload Beta feeds where the merchant has blocked our image bot. In terms of the categories that have more images, this is heavily skewed by what merchants are uploading.

Additionally, I think the big difference between our numbers is due to two reasons. Part of it is because of sample size. The other cause is that we're getting more feeds over time, exacerbating the problem.

As you can imagine, we are actively working to address the image issue with feeds in coordination with merchants.

I agree, I only tried three queries -- so my sample size was tiny. I guess I wonder how the numbers would change over a larger query set. Having products with images only matters if those are the results that appear first in the search results. In short, the overall percentage of products with images can differ drastically from what users actually see in search results.

Good luck to the team on working on the arrangements with merchants to get your crawls. You can also read the team's full post on the MSN Product Search Blog.

In the post on their blog they asked for some feedback on ranking product results. Here are some of my thoughts:

From the their post: Is the product what the user was looking for given the query? The example given was a query for "speaker" and both speaker stand and speaker system both contain the desired search term in the product name.

Here are my thoughts:
It would be nice if the search engine recognized the speaker was a modifier/adjective of stand and not a speaker itself. This would solve the speaker stand vs. speaker system problem (this may not be as easy as part-of-speech tagging)...

Can you cluster the results by similarity (features include: manufacturer, price, dimensions, weight, etc...) and then bias the results toward prevalent product clusters (i.e. expensive large, expensive speaker systems are probably more prevalent than cheap and light speaker stands)? Factoring in aggregate product popularity might be important here...

Factors in Product Result Ranking
1) How popular is this product? I tend to be biased towards more popular (higher selling) items. (like the Amazon SalesRank)

2) How is the item rated? I like Amazon because it not only provides the overall rating, but also provides the number of ratings the items received and properties of those reviews (are there recent reviews? are there constant reviews over a long period of time?). Is it rated by consumer reviews or other rating services?

4) Who manufactured the product? I am going to probably prefer products from major name brands -- Wusthoff, Sony, Canon, Microsoft, Apple, etc...

5) When was the item first released? I am going to prefer newer items / models.

6) What are the seller's shipping rates / policies? I am going to prefer sellers that have cheaper shipping fees and that can get me my item faster.

7) Seller proximity for some items. For some large items I might want to be able to pick the item up and local sellers are better than distant sellers. (I probably won't ship a large screen tv or large piece of furniture).

ACM SIGKDD Webcasts Online

The ACM SIGKDD has started hosting webcasts to improve data mining education and share expertise.

There are currently two webcasts online and a third scheduled for May. Here is information on the first two:

Web Content Mining
By Bing Liu, University of Illinois at Chicago (UIC)
Web content mining aims to extract/mine useful information or knowledge from Web page contents. Apart from traditional tasks of Web page clustering and classification, there are many other Web content mining tasks, e.g., data/information extraction, information integration, mining opinions from the user-generated content, mining the Web to build concept hierarchies, Web page pre-processing and cleaning, etc.
His website also has slides and other data mining material to go with his new book: Web Data Mining (December 2006).

Towards Web-Scale Information Extraction
By Eugene Agichtein, Emory University
Data mining applications over text require efficient methods for extracting and structuring the information embedded in millions, or billions, of text documents... First I will briefly review common information extraction tasks such as entity, relation, and event extraction, indicating the main scalability bottlenecks associated with each task. I will then review the key algorithmic approaches to improving the efficiency of information extraction, which include applications of randomized algorithms, ideas adapted from information retrieval, and recently developed specialized indexing techniques.
Eugene has a web page that accompanies the webinar. The page has lots of good resources, including links to other similar tutorials.

Monday, April 30

Search at Ebay Part I: Faceted Search and Ebay Express

This is the beginning of a two part series on search technology at EBay. Search is important at EBay because user need to be able to quickly find products. Not long ago, I blogged about EBay's San Dimas project, which uses a faceted search UI. This article will explore faceted search in more detail. It will first provide an introduction to faceted search terminology and then look at EBay Express as a model of a faceted search system.

Facets refer to categorized properties of objects in a collection. Each facet has a name, such as Cooking Method, Ingredients, Course, or Cuisine for a recipe collection. A facet may be flat ( such as Author) or it may be hierarchical (Cuisine > Italian > North Italian > Milan). Facets are not categories because you don't place items INTO a facet, facet values are properties assigned TO items; facets are structured tags. For more background on faceted search systems you can read SearchTools' report on Faceted Metadata Search.

Marti Hearst at UC Berkeley is one of th leading experts on faceted search systems. She lead the design of the Flamenco faceted search system. At CHI 2006 in Montreal Marti led a course with Preston Smalley and Cory Chandler from EBay (the San Dimas Project designers) entitled "Faceted Metadata for Information Architecture and Search".
The main objective of the course is to instruct attendees about how to integrate navigation and search for large collections in a seamless, flexible manner that helps users find things quickly and browse items comfortably...The instructors have designed an approachable, reproducible methodology for the design of highly usable, highly searchable information-centric web sites.
The goals of these systems are outlined well in Marti's 2006 paper Design Recommendations for Hierarchical Faceted Search Interfaces from the Faceted Search Workshop at SIGIR 2006:
...the overarching design goals are to support flexible navigation, seamless integration with directed (keyword) search, fluid alternation between refining and expanding, avoidance of empty results sets, and at all times retaining a feeling of control and understanding.
EBay Express
In the Chi Course Preston and Corey present Ebay Express as a new model for a state of the art faceted search system. They outline a series of lessons learned and design pitfalls to avoid. Here are the main lessons they walk you through:
  • "Parsing" feels natural to users (and the text in the search box is not sacred)
  • Controls placed along the top of the page are used more than when on the left side.
  • People browse using the facets more when they are not familiar with the domain
  • Users stop using refinements when a) not useful, and b) item count low enough
  • Prominently showing 4 facets is sufficient (but prioritization is important)
  • Shifting columns doesn't disturb people
  • Truncated list of values per facet is okay (users know how to access the rest)
  • Showing sample values help users understand facets and can expose breadth
  • Users often want to select multiple facet labels and are pleased when they can (treated as an OR by search engine)
  • Traditional breadcrumbs don't work here
  • Users understand the idea of applying and removing facets using this modified breadcrumb without instruction
The course is very rich and the above outline is only a very illusory glimpse into the wealth of wisdom they walk you through.

There is a good review of the course by Jessyca Frederick a developer from ShopZilla that attended the course.

There are a lot of hard technical details to dig into, for starters:
  • How do you parse user queries intelligently and match query terms to facets?
    i.e. translate the query: 5 MP Cannon PowerShot A530 to Company:Canon, Resolution:5 Megapixel, Series: Powershot, Model: A530?
  • How do you pick what values of facets to display when the list of values is very large?
  • How do you efficiently integrate relational database querying with keyword search using inverted indexing systems? Do you even have to?
That's all for now, although there is certainly a lot more to be said on faceted search systems, including looking at the software that can power these interfaces. On that note, next up in the series is a look at EBay's search infrastructure: "Voyager" (no, not the Star Trek Series).