Login | Register
My pages Projects Community openCollabNet

threadfind
Project home

If you were registered and logged in, you could join this project.

(Locked )

This project is currently locked. Access to the project's tools is read-only. Until the project is unlocked, no project files or data can be modified.

Summary Efficient locator of mailing list threads.
Category process
License CollabNet/Tigris.org Apache-style license
Owner(s) kfogel, madanus

What is ThreadFind?

ThreadFind is an open source, web-based interface for discovering the message and thread URLs for messages on a mailing list, given the "Message-ID" header or other metadata. Secondarily, ThreadFind serves as a subsystem for SummaryDesk, which uses ThreadFind to gather thread information for writing summaries.

ThreadFind development is sponsored by: CollabNet

Why is ThreadFind necessary?

Often, mailing list management software is not well-enough integrated with list archivers to know a message's archive URL at the time that message goes out to recipients. This means there is no header on the message saying what that message's archive URL is — so if someone wants to refer to that message, they need to browse the archives, find the message, and cut-and-paste the URL. This is a time-consuming process.

ThreadFind solves this problem by keeping a database mapping "Message-ID" headers (which all emails have) to message archive URLs. Thus if you have the message itself in your hand, you can instantly get its archive URL. Furthermore, when you query the database, by default ThreadFind returns the results in a way that is friendly to cut and paste.

To use ThreadFind, you would, for example, visit a URL like this:

   http://www.red-bean.com/threadfind/findit?mid=200004280600.BAA02369@floss.red-bean.com
and get a result like this:
   http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=3
   (Thread URL: http://subversion.tigris.org/servlets/BrowseList?list=dev&by=thread&from=42742)
   Message-ID: 200004280600.BAA02369@floss.red-bean.com
   From: Karl Fogel <kfogel@red-bean.com>
   To: dev@subversion.tigris.org
   Cc: sussman@red-bean.com, jimb@red-bean.com
   Subject: Hello, and status update
   Date: Fri, 28 Apr 2000 01:00:39 -0500
   Reply-to: kfogel@red-bean.com

This is just the default output; ThreadFind can be configured (via the request URL) to give other kinds of output.

Other functionality: because ThreadFind has to constantly scan the mailing list to keep itself up-to-date anyway, it grabs other important mail headers as well. This allows users to map quickly between, say, "Subject" and URL. Thus ThreadFind duplicates some of the search capabilities found in most archivers; but because this functionality came nearly for free, and allows people to add new kinds of search interfaces, we decided to implement it anyway.

Overview of how it works

You configure ThreadFind to watch a set of mailing lists, via the base URLs of their archives. For each mailing list, it starts at message number 1 and polls upward. Each time it finds a message, it pulls it in via HTTP and parses the page to get the header data. Then it stores the header data and the URL in a database record.

ThreadFind is a self-updating system: no manual update process is required when ThreadFind comes back online after having been offline for a while. It just looks at its config file and the mailing list archives, and brings itself up-to-date automatically.

Starting and stopping is done through the threadfind-ctl.py program, using the commands "threadfind-ctl.py -c live.cfg start" and "threadfind-ctl.py -c live.cfg stop", respectively. You can also use "threadfind-ctl.py -c live.cfg restart" to get it to reread its configuration file, which is where you add new mailing lists, or tweak the configurations for existing lists. Run "threadfind-ctl.py -c live.cfg" with no other arguments to get a status printout of the instance.

Todo list

ThreadFind is alpha software. Some of the remaining work, in no particular order, is:

  • Develop all the queries that SummaryDesk and others might need.
  • Document those queries.
  • Improve threadfind-ctl.py to detect hangs better.
  • Learn to probe automatically upwards a bit when a gap is found in the message sequence, so that the "lower_bound" option in a list's configuration does not need to be manually tweaked for this case.
  • Add instructions to hide config files from web browsers.
  • Need to start using the bug tracker, instead of todo lists, to keep tabs on open issues :-).

How to get ThreadFind working, from scratch.

  1. Create the database user.

    Make sure the mysql users 'threadfindrw' and 'threadfindro' exist, that the first has read/write access to an existing database named threadfind, and that the second has read-only access:

       $ mysql -u root -p
       Password: *******
       mysql> grant all on threadfind.* to threadfindrw@localhost 
                identified by 'SECRET';
       mysql> grant select on threadfind.* to threadfindro@localhost 
                identified by 'SECRET';
       mysql> ^D
       $ 
    
  2. Create the database.

       $ echo "create database threadfind;"  \
           | mysql -u threadfindrw --password=SECRET
       $ cat init-threadfind.sql             \
           | mysql -u threadfindrw --password=SECRET threadfind
    
  3. Make sure that you have Python and the MySQLdb module installed.

  4. Set up your web server to make 'threadfind' CGI-executable:

       Alias /threadfind /home/yourname/path_to_threadfind_working_copy
       <Directory /home/yourname/path_to_threadfind_working_copy>
           Options Indexes +ExecCGI
           <Files *.cfg>
             Deny from all
           </Files>
           <Files *.sh>
             Deny from all
           </Files>
           <Files findit>
             SetHandler cgi-script
           </Files>
       </Directory>
    

    Also add '.cgi' to the 'AddHandler cgi-script' directive, to enable viewing the default threadfind page.

  5. Copy "example.cfg" to a new file (say, "live.cfg"), edit the new file in the obvious ways.

  6. Take a deep breath, then run "threadfind-ctl.py -c live.cfg start". ThreadFind will now start populating the database with messages. Use threadfind-ctl.py to start, restart, and stop ThreadFind as necessary. When started again, ThreadFind will always pick up where it left off, no matter how long it's been stopped for.

  7. Test the database, by pointing your browser at 'threadfind':

       http://localhost/threadfind/findit?mid=BLAHBLAHBLAH
    

    You can also query the database directly, of course:

       $ mysql -u threadfindro -p threadfindro
       password: SECRET
       mysql> select * from messages where \
              message_id = '200004280600.BAA02369@floss.red-bean.com';
       [... see results ...]
       mysql> ^D
       $