Lucene Desktop Notes

Here's some notes about Lucene Desktop, Kevin Burton's latest crazy concoction.

PDFbox library for PDF extractions (anything for doing thumbnails?) BSD license.

TextMining.org for Word documents. Apache License.

OpenOffice.org also has an API.

Java 1.4 for creating image thumbnails? GIF, PNG, JPEG, BMP supported...

POPsearch, a similar idea, already implemented in C. Lots of features.

Index should have basic file metadata structures for all files, plus additional fields for each file type (images would have width and height, for example).

Field names should be lower case and human type-able for advanced queries.

Easy install should be a goal.

Windows has Google Desktop -- who would switch from that? Mac OS X Tiger will have Searchlight -- who would switch from that? But there is a niche on pre-Tiger Macs and Linux.

How do you index email?

Jetty would make a good embedded webserver and servlet engine as Tomcat blows and is way too big.

Index home directory only? What about privacy? Probably needs to be configurable.

All config should be possible through the web interface.

File formats: HTML, text, Word, Open Office, PDF, MP3, images (GIF, BMP, JPEG, PNG -- create thumbnails in the cache directory)...

Meta-file formats (must index inside the file): mbox, Microsoft mailbox formats, maildir (sorta).

Need some spiffy 16x16 icons to represent file types for above.