Monday, January 26

Aperture framework for content crawling and conversion

Aperture is framework for crawling, extracting, and storing data from different systems for indexing and other processing. Aperture contains crawlers for different content systems and content converters to extract text from a variety of common file formats. It writes the extracted data in RDF for storage and indexing.

Found via the Search and Text Analysis presentation from Grant.

See also the Lucene Tika incubator project for extracting text and structured data from a variety of formats.

3 comments:

  1. Curious to know what real world applications can use Aperture.

    Also, why not use libraries like POI / Tika for metadata and content extraction?

    The only evident difference is that it provides a query interface supporting SPARQL, et al.

    ReplyDelete
  2. Anonymous9:18 AM EST

    Hi Jeff,
    This is Gyanesh. I want to access data and metadata of any types of file, so i am moving to Aperature, I am new with Aperature, can u guide me that how can I extract data and metadata of any types of file. I have seen some given examples but they are generating data/metadata in RDF or in xml format how can I get data/metadata direct without using any encoding. like : DATA.creator/DATA.title and one more thing i have not found any DATA class for Extraction.
    Pls help me it's urgent..

    Thanks in Advance....

    Gyanesh Gupta
    gupta.gyanesh@gmail.com

    ReplyDelete
  3. Anonymous8:53 AM EDT

    Hi Jeff,

    I want to crawl binary files in a linux filesystem and i found Aperture which is running successfully on windows but when i run its 1 of d example applications say crawlerwindow.sh on Fedora Core 4 system, it gives me a java classpath error like this:

    Exception in thread "main" java.lang.NoClassDefFoundError: org/semanticdesktop/aperture/examples/filecrawler/CrawlerFrame

    as i m new to Aperture, may b i m missing something like setting some path!! pls help me...its urgent!!!!

    thanx in advance...

    Priyanka
    prisood@gmail.com

    ReplyDelete