Wednesday, April 21

Bixo Labs Makes Web Crawl Data Available

Bixo Labs announced today that the first data from the Public Terabyte Dataset Project is available. The project uses the Bixo crawler to collect a large set of webpages and make them publicly available.

A sample of the data to get started is available on Amazon S3, bixolabs-ptd-demo. The data is stored using Avro for serialization, the very simple schema is available on their website.

I look forward to seeing more from the project!

