Using mechanize to process protected Plone pages
One of my long-running projects involves a workflow where content is produced in a Plone site, with the data later extracted and processed in various ways (including scripting Scribus to layout this data in a book). Initially the site where the content was produced wasn't protected, so I could run a simple urllib script to download the content and process it using lxml. A recent change in the workflow security settings meant this script didn't work anymore and I had to remember how to login into a Plone site using urllib2. Some google searches found me nothing, but I remembered that the zope.testbrowser can be easily used to run a programatical browsing session, complete with cookies support. But trying to install zope.testbrowser standalone in a buildout didn't lend to too much success, due to some dependency problems (and even after I covered for those dependencies, it still broke somewhere in zope.app.testing).
The solution was to use just the mechanize package, on top of which zope.testbrowser is built. mechanize has a slightly different API (more modern) and doesn't do so much handholding as zope.testbrowser, but I only need to process one form. In the end my script looks something like this (the asxmllist page is just an xml page that returns a list of urls to the entities that I want to process):
import lxml.etree import os import os.path import urllib import mechanize loginurl = "http://example.com/login_form" listurl = "http://example.com/asxmllist" def run(): curdir = os.getcwd() datadir = os.path.join(curdir, 'data') if not os.path.exists(datadir): os.makedirs(datadir) b = mechanize.Browser() b.open(loginurl) b.select_form(nr=1) b['__ac_name'] = "username" b['__ac_password'] = "password" b.submit() b.open(listurl) etree = lxml.etree.parse(b.response()) for entry in etree.xpath('//entry'): url = entry.get('url') print "Processing " + url e = lxml.etree.parse(b.open(url + '/asxml')) id = e.find('id').text print "Got entry " + id fpath = os.path.join(datadir, id + '.xml') f = open(fpath, 'w') xml = lxml.etree.tostring(e) f.write(xml) f.close() print "Saved " + fpath if __name__ == "__main__": run()