Weblogs for Mailinglists
#
Here's the features I'd like to implement in this software:
- One weblog post per thread.
- Use JWZ's threading algorithm and inline threading like Don Marti's linux-elitists hack (here's an example)
- Replies to the original message will be placed "below the fold" of the weblog entry so the thread can grow quite large without the weblog becoming unmanagible.
Initally, I want to support MBOX format, because that's what I use. To set up a weblog for a mailing list, you'd subscribe to the mailing list and filter the subscription to an MBOX file which would then be monitored with this tool. You could even use this to make an automatic weblog out of your filtered spam!
One important detail is how to determine which items have been posted to the weblog already and how to post follow-ups to the entry they belong to.
RSS generation will be handled by the weblog tool the software posts to using the Blogger API. However, it may turn out to be easier to simply generate flat HTML and RSS for the weblog. But I'd like to leverage existing tools if I can.
There's some existing workon this:
- LaughingMeme: RSSifying the Mailing List, RSSifying the Mailing List, an update and see also Revolutionizing the Archive
- MMRSS scraping RSS feeds for Mailman
- eGroups RSS feeds
Why is my idea different?
- You can create a weblog for a mailing list that you don't control (normal web etiquette applies)
- You can create a weblog for MBOX folders which aren't really mailing lists (like the spam example, above)
- No screen scraping
- Will leverage existing weblog tools to generate the RSS feed and display the site
- I'm writing it, so it has to look good ;-)
Here are some challenges I foresee:
- Not becoming a mail archiver in its own right. I want to use existing software because it's easier, but the temptation to simply output good looking HTML and metadata-rich RSS feeds will be great.
- Dealing with large archives. For threading to work well, it's easiest to have one MBOX file. However, the limitations of the MBOX format make dealing with large files a pain. Pipermail splits mailing list files by month, but that breaks threads.
- Matching new posts with existing threads and appending them in the weblog software.
- Preserving metadata from the email. This will be difficult, but I am not a semantic web bigot so I don't care that much. I will try, though.
- Reformatting the email to HTML. We'll probably want to preserve monospaced emails, but I'll want to do auto-linking. HTML mail should be displayed as-is. I'd like to colorize quotations like Google Groups.
- A snappy name
For challenge #2, we can look into supporting maildir instead, but that's in the future. My mail server doesn't use maildir so I'm not too concerned. Another solution is to periodically delete old threads from the archive file. However, I envision this for smaller mailing lists right now. If it becomes popular enough to be used on a big mailing list, we'll try to fix the problem.
For challenge #3, I have a decent solution. The program will generate the HTML for each thread in a cache directory (one file per thread) and save it. Thread files are generated from what JWZ calls "first princliples" every time the script is run. It will also save a database containing file names and sizes (or SHA1 hash, but that's total overkill). If a thread file changes inbetween program invocations, the weblog post corresponding to that thread is updated with the contents of the file.
Challenge #6 is really the most serious. I need a catchy name!
Writing the software:
I want to write this in Python. It's just the right size for a good learning experience, but not too big for me to get discouraged. I'm trying to convince Gabe to help me write it and learn some Python-fu at the same time. Hopefully, there's a library for reading MBOX format. I might try to convince Don Marti to lend me his implmentation of JWZ's algorithm; otherwise I'll write a Python library to do that. For the database, I'll pickle a dictonary. Finally, there should be a Blogger API module floating around somewhere we can use.
Update: I found an implementation of JWZ's algorithm in Python. There seem to be a couple Blogger API implementations, including PyBlogger from Mark Pilgrim. Surprisingly, no MBOX API yet.
Update 2: Gabe told me that procmail can output maildir format. So I might support maildir first instead of MBOX.
Not sure about the license yet. GPL or BSD most likely.