Using mechanize to process protected Plone pages

One of my long-running projects involves a workflow where content is produced in a Plone site, with the data later extracted and processed in various ways (including scripting Scribus to layout this data in a book). Initially the site where the content was produced wasn't protected, so I could run a simple urllib script to download the content and process it using lxml. A recent change in the workflow security settings meant this script didn't work anymore and I had to remember how to login into a Plone site using urllib2. Some google searches found me nothing, but I remembered that the zope.testbrowser can be easily used to run a programatical browsing session, complete with cookies support. But trying to install zope.testbrowser standalone in a buildout didn't lend to too much success, due to some dependency problems (and even after I covered for those dependencies, it still broke somewhere in zope.app.testing).

The solution was to use just the mechanize package, on top of which zope.testbrowser is built. mechanize has a slightly different API (more modern) and doesn't do so much handholding as zope.testbrowser, but I only need to process one form. In the end my script looks something like this (the asxmllist page is just an xml page that returns a list of urls to the entities that I want to process):

import lxml.etree
import os
import os.path
import urllib
import mechanize

loginurl = "http://example.com/login_form"
listurl = "http://example.com/asxmllist"

def run():
    curdir = os.getcwd()
    datadir = os.path.join(curdir, 'data')
    if not os.path.exists(datadir):
        os.makedirs(datadir)
    
    b = mechanize.Browser()
    b.open(loginurl)
    b.select_form(nr=1)
    b['__ac_name'] = "username"
    b['__ac_password'] = "password"
    b.submit()
    b.open(listurl)
    etree = lxml.etree.parse(b.response())
    
    for entry in etree.xpath('//entry'):
        url = entry.get('url')
        print "Processing " + url
        e = lxml.etree.parse(b.open(url + '/asxml'))
        id = e.find('id').text
        print "Got entry " + id
        fpath = os.path.join(datadir, id + '.xml')
        f = open(fpath, 'w')
        xml = lxml.etree.tostring(e)
        f.write(xml)
        f.close()
        print "Saved " + fpath

if __name__ == "__main__":
    run()

Comments