Aperture is framework for crawling, extracting, and storing data from different systems for indexing and other processing. Aperture contains crawlers for different content systems and content converters to extract text from a variety of common file formats. It writes the extracted data in RDF for storage and indexing.
Found via the Search and Text Analysis presentation from Grant.
See also the Lucene Tika incubator project for extracting text and structured data from a variety of formats.