Meant to post a couple of days ago when I actually finished the script but I’ve been waiting to put the finishing touches on it.
The script will:
- Instantiate a new HyperDatabase object using the same path that will be used when the list is fully installed and operational
- Rebuild all dumbbtrees from the /database directory of the mailing list
- Extract each pickle.dumps() representation of an article by message id from the dumbbtree dictionary
- Generate a new StormArticle entry in the sqlite database by calling addArticle on the HyperDatabase instance
- Generate a new Conversation object if necessary
- Zip the existing dumbbtree structures into old_archives.zip
- Remove the remnants of the dumbbtree marshal files, thereby clearing the directory of all files except for the HyperDatabase sqlite file and the old_archives.zip file
That’s it. I can’t think of anything else that needs to be done for the migration so unless I receive any suggestions, I’m happy with the script. I’ve implemented all of the steps but #5 – I’m having trouble determining any kind of pattern between articles in a conversation and those who are not, so I’m not sure what to look for to generate a conversation. I’ll track down Barry or dushyant and in hopes that they can explain how conversations are set up.
God I LOVE Python’s IDLE.
Anyway, after hacking away for two days I was finally able to mock up a pipeline for the process behind pickling an article. I’m so pumped. Thanks pretty much entirely to the raw data from python.org’s playground Mm2 list archives, I was able to reverse engineer enough of the data to be able to develop some beginning theories on how this script should work.
I had been expecting pickles of articles sorted by index (article, thread, subject, etc.) somewhere and instead discovered that python’s marshal module (http://docs.python.org/library/marshal.html) was able to open the contents of these files whereas the pickle module was not. Remembering that DumbBTree uses marshal, I looked back at the DumbBTree class in HyperDatabase.py and realized that IT was what is being saved by HyperDatabase to the /database folder within the list archives. After reading back through HyperDB, I noticed that dumbbtree instances contain pickle.dumps() string representations of the articles they contain. Since DumbBTress are dictionaries, the articles can be accessed via: pickle.load(tree[articlemsgid]), which is the core of the algorithm.
Now I need to develop an efficient algorithm for accessing each archive and generating StormArticle instances. Tomorrow’s project
Yeah I know, it’s been a while. My bad
Now, I’ve changed my focus from removing the remnants of dumbbtree from pipermail to writing that upgrade script to migrate the existing Mm2 pickle structures to the Mm3 storm schema. Barry was able to give me the raw list archives from python.org’s “playground” Mm2 implementation and the structure in pipermail.py is making more sense now that I can see how the database files are laid out. I have a general idea of the algorithm but not enough worth posting – I’ll post back as soon as I determine how I’m going to do this. I’m generating the database independently of mailman for now and will later add some testing if needed after I have a working implementation.
Mostly posting to assure any followers that I am not dead and am still working on Mm3
Pretty sure I finished everything that needed to be done for HyperDatabase:
- Added StormArticle and Conversation models to be used in place of DumbBTrees
- Load sqlite database and create the tables if necessary in HyperDB __init__ method
- Rewrote methods in HyperDB to rely on the Storm API instead of indices/DumbBTrees
- Wrote __init__ and toArticle methods for StormArticle for easy conversion to and from pipermail Articles
- Added extra method(s) for conversation object support for dushyant’s dynamic page generation and ui code
As far as I can tell, the three methods next, first, and clearIndex in HyperDatabase are no longer needed and have been commented out and replaced with a pass. Since those three deal specifically with DumbBTree, I did not think it necessary to try to duplicate the functionality with the database. If this is not the case then I can simply go back and implement them.
It seems that I have two primary objectives now:
- Remove remnants of DumbBTrees and indices from pipermail.py
- Write database upgrade script for Mm2 -> Mm3 procedure
Barry had previously mentioned that marshal needs to be removed from the project. I agree, and since it is only used in DumbBTree and that class is no longer used (probably will just remove it), then that pretty much takes care of itself.
If there’s anything I’m missing, don’t hesitate to find me on IRC (dcrodman) or shoot me an email (firstname.lastname@example.org). Feedback and suggestions are welcome
Currently working on rewriting dushyant’s improved implementation of HyperDatabase to support Storm by replacing much of the code with the datastore queries. A couple of the requirements we identified:
1. getArticle by msgid
2. get a list of msgids by conversation-id by using a Conversation object ( : storm.object)
Currently working on implementing getArticle by querying the datastore for an article with the given msgid. In the StormArticle class I’m implementing a toArticle() method that will return a pipermail.Article representation of the StormArticle for external use (ie., when the user calls getArticle). Aslo working on replacing many of the existing methods with datastore queries – the main snag I’m predicting is how to transition from article indices/sequences to conversation ids, and in that respect it seems that the best approach might be to leave the setArticleIndex and similar methods empty (via pass) and add methods for returning the msgids in Conversation objects instead.
Also need to implement additional mappings for methods of retrieving an article: article id, message id (already implemented), and the proposed message-id hash (pending since I believe another student is working on this)
I’ll be posting updated versions of my branch as I continue to make progress. Unfortunately the setup script in the mm3-archive branch is suffering from some errors, so until I (or dushyant) fix those then I’m unable to run any tests. For now, I’m just carefully hacking
After working with HyperSQLDatabase for a couple of days I decided that it would be the best way to interact with the sqlite database for Articles. Here’s the Storm object representation of an Article:
# The StormArticle ORM class represents an Article object
# archive : year-Month representation of the article previously used for pickle indexing
# sequence : Sequence number, unique for each article in a set of archives
# subject : Subject
# datestr : The posting date, in human-readable format
# date : The posting date, in purely numeric format
# headers : Any other headers of interest
# author : The author’s name (and possibly organization)
# email : The author’s e-mail address
# msgid : A unique message ID
# in_reply_to: If != “”, this is the msgid of the article being replied to
# references : A (possibly empty) list of msgid’s of earlier articles
# in the thread
# body : A list of strings making up the message body
“”” Class to represent an Article in a sqlite database rather than using pickles. “””
__storm_table = ‘article’
archive = Unicode() # parameter passed to all database methods – simplify lookups
sequence = Unicode()
subject = Unicode()
datestr = DateTime()
date = Float()
headers = List()
author = Unicode()
email = Unicode()
msgid = Unicode(primary=True)
in_reply_to = Unicode()
references = List()
body = List()
def __init__(self, archive, article, subject, author, date):
self.archive = unicode(archive)
self.subject = unicode(subject)
self.author = unicode(author)
# TBD: Generate float and UTC representations of date
Notice that msgid has a named parameter primary. In order to make searches more efficient, I am using an article’s message id as its primary key so that an Article can be accessed more easily. This seemed the most suitable of the attributes to use as a key since it is guaranteed to be unique and it is a parameter for every method of the DatabaseInterface class.
As I mentioned before, ideally the only class that will work with this ORM mapping is HyperSQLDatabase in order to make it easier for other classes to use and to modularize it for maintainability. The only snag that’s preventing me from finishing the methods of the DatabaseInterface is the transition from the various indicies (…-date, …-author, etc.) and directories to a unified database. That and generating the path in which to store the database are what I haven’t been able to figure out so I’m probably going to go hunting for help on the Mm IRC tomorrow.
Further study of the pipermail system revealed a hierarchy of “layers” that I had not previously noticed. At the top layer is the HyperArch class, which contains a HyperDatabase used to retrieve Article information. The next layer, HyperDatabase, contains a DumbBTree that stores pickles of Article objects. The DumbBTree layer itself works with the raw pickle directories to access the stored objects for the requested Articles.
The new scheme I’m working on uses a much simpler mechanism for retrieving these Articles. What I’m proposing is another extension to the Database class called HyperSQLDatabase that will implement the DatabaseInterface through the Database class (like HyperDatabase) and maintain a persistent store of Article objects (like DumbBTree) but without using pickles. This class will work as a substitute for HyperDatabase in the HyperArch class since it is also a Database object but will operate using Storm transactions and a sqllite database with an ORM layer to retrieve Articles rather than pickles.
I’ll be spending today and as much of tomorrow as I can spare working on implementing this new HyperSQLDatabase to see if it is feasible.