Using mechanize to process protected Plone pages
One of my long-running projects involves a workflow where content is produced in a Plone site, with the data later extracted and processed in various ways (including scripting Scribus to layout this data in a book). Initially the site where the content was produced wasn't protected, so I could run a simple urllib script to download the content and process it using lxml. A recent change in the workflow security settings meant this script didn't work anymore and I had to remember how to login into a Plone site using urllib2. Some google searches found me nothing, but I remembered that the zope.testbrowser can be easily used to run a programatical browsing session, complete with cookies support. But trying to install zope.testbrowser standalone in a buildout didn't lend to too much success, due to some dependency problems (and even after I covered for those dependencies, it still broke somewhere in zope.app.testing).
The solution was to use just the mechanize package, on top of which zope.testbrowser is built. mechanize has a slightly different API (more modern) and doesn't do so much handholding as zope.testbrowser, but I only need to process one form. In the end my script looks something like this (the asxmllist page is just an xml page that returns a list of urls to the entities that I want to process):
import lxml.etree
import os
import os.path
import urllib
import mechanize
loginurl = "http://example.com/login_form"
listurl = "http://example.com/asxmllist"
def run():
curdir = os.getcwd()
datadir = os.path.join(curdir, 'data')
if not os.path.exists(datadir):
os.makedirs(datadir)
b = mechanize.Browser()
b.open(loginurl)
b.select_form(nr=1)
b['__ac_name'] = "username"
b['__ac_password'] = "password"
b.submit()
b.open(listurl)
etree = lxml.etree.parse(b.response())
for entry in etree.xpath('//entry'):
url = entry.get('url')
print "Processing " + url
e = lxml.etree.parse(b.open(url + '/asxml'))
id = e.find('id').text
print "Got entry " + id
fpath = os.path.join(datadir, id + '.xml')
f = open(fpath, 'w')
xml = lxml.etree.tostring(e)
f.write(xml)
f.close()
print "Saved " + fpath
if __name__ == "__main__":
run()