Jeff's Search Engine Caffè
Information Retrieval research and search engine development discussion.
Wednesday, April 21
Bixo Labs Makes Web Crawl Data Available
that the first data from the
Public Terabyte Dataset Project
is available. The project uses the
crawler to collect a large set of webpages and make them publicly available.
A sample of the data to get started is available on Amazon S3, bixolabs-ptd-demo. The data is stored using Avro for serialization, the very simple schema is available on their website.
I look forward to seeing more from the project!
Post a Comment
Post Comments (Atom)