Thursday, April 17

Building a Web-Scale Recommendation Engine

Paul Lamere wrote about web-scale recommendation engine projects.

He talks about Project Aura and Project Caroline.

Project Aura is a web-scale, open, hybrid recommendation system that uses social data (the wisdom of the crowds) combined with the 'aura' of information extracted directly from content or mined from the web to make recommendations. By combining content-based methods with social methods Project Aura can avoid much of the 'cold start' problems that plague traditional collaborative filtering recommenders, while providing a way to offer explainable recommendations.

Project Caroline is a platform that allows you to programmatically control all of the infrastructure resources you might need in building a horizontally scaled system. You can allocate and configure databases, file systems, private networks (VLAN's), load balancers, and a lot more, all dynamically, which makes it easy to flex the resources your application uses up and down as required.

From Pauls post:
Now that we have Project Aura running on top of Project Caroline - I'm getting used to the idea of having 60 web crawling threads feeding a 16-way datastore that is being continually indexed by our search engine - and all of this is running across some number of processors - I don't really know how many, and I don't really care.


  1. Anonymous9:29 AM EDT

    I couldn't find the link to P. Aura. Am I missing it?
    Closed source?

    N nodes fetching, indexing, etc. - sounds like Nutch+Hadoop, really.

  2. From what I know, Aura is a research project at Sun research labs. Therefore, you won't find much information on it, at least not yet.

    Their example is pretty simple. From what I understand, which is not too much, I admit, I see P. Caroline as very different from Nutch and Hadoop. First, Nuch is only a distributed search engine. Hadoop is a distributed file system and platform for running map-reduce jobs. Project Caroline is neither of these.

    Project Caroline is a platform for building applications building and using scalable services, In that respect, it is closer to Amazon's EC2 platform. It allows developers to use a distributed cluster to run tasks or access data (say from a database) without having to worry about the details of what machines are running what, etc... These annoying concerns are abstracted via the services interfaces.

    I hope I got that right.