Is it economically feasible for a new vertical search engine to build its own web crawler and build a multi-terabyte data storage system? This presents a sizable barrier to entry into the vertical search arena. This is one of the main reasons there are still so few vertical search engines. Alexa hopes to change that by offering a hosted platform for companies and users to create their own custom search engines, or perhaps just get some meta-data.
From Alexa's website:
The Alexa Web Search Platform provides public access to the vast web crawl collected by Alexa Internet. Users can search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. Alexa provides compute and storage resources that allow users to quickly process and store large amounts of web data. Users can view the results of their processes interactively, transfer the results to their home machine, or publish them as a new web service.My intial question was: How did they get around the legal problems associated with this? After all, they are essentially charging for access to my and other users copyrighted content.
They got around this the same way IBM did with WebFountain. I had the opportunity to talk to the head of IBMs WebFountain project at a search engine conference earlier in the year. One of the reasons that WebFountain was never a runaway hit was because they couldn't provide direct access to the content. The reason the IBM employee gave was that it wasn't legal to charge for access to others copyrighted works. Imagine me charging you to download all of the Star Trek episodes as a service off my website. IBM got around this by providing access to a derivative work: metadata. WebFountain aggregated the knowledge of the web to create a new product that they could sell. IBM would mine the web for you and provide answers to your business intelligence questions. However, IBM had to write the software to run on its cluster. You paid per question you asked because each one was expensive because it required custom programming.
The Alexa platform is the next evoluation of this business model. I call it WebFountain 2.0. Instead of approaching IBM and asking them to design a program to answer your question, now you can create your own program and have Amazon, err Alexa, run it.
So, what exactly is the platform they are offering? According to the FAQ:
This store contains both the raw document content and the metadata extracted by Alexa's internal processes. All Platform users have access to the data in this store... Alexa maintains three Web snapshots in the Data Store. Each Web snapshot represents two months of crawling activity. Each snapshot contains about 100 Terabytes of uncompressed data so at any time, the Data Store contains 200 - 300 Terabytes of data.In other words, they will give you access to run programs against their 5 billion page web crawl. The Alexa Web Platform allows you to run code on their cluster to process web data. At the end you can download your results (metadata) or you can publish your own private search engine hosted on their cluster. I don't think you can't actually download the content directly from the repository.
The pricing model is pretty simple and straightforward, you pay for CPU time, bandwidth, and storage space.
What's innovative about the Alexa platform over WebFountain are two things: 1) The ability to write your own code against the system and 2) The end product can be a private / custom search engine instead of just some meta-data.
We'll see what happens!