Wednesday, April 16, 2008

Getting lxml 2 into my gnarly old Tiger; going on ROA

In case anyone might google for something like:
lxml "Symbol not found" _xmlSchematronNewParserCtxt

, I'd like to reiterate the steps I just took to get lxml 2 (2.1 beta 1 to be precise) up and running on OS X 10.4 (which I haven't yet purged in favour of the shiny Leopard for reasons of fear, uncertainty and doubt. And while pleasurable, quite mixed impressions from that cat when running my wicked new iMac (and my funky old iBook, in case you're dying to know)).

In short: you need a fresh MacPorts (1.6, I had 1.3.something):

$ sudo port selfupdate

, and then, wipe your current libxml2 (along with any dependent libraries, MacPorts will tell you which):

$ sudo port uninstall libxml2@2.6.23_0

(You might not need the version number; I had tried to get the new libxml2 before this so there was an Activating libxml2 2.6.31_0 failed: Image error: Another version of this port thingy going on.) Then just:

$ sudo port install libxml2 libxslt

and you'll be ready to do the final (after removing any misbehaving lxml (compiled with the old libxml2) from site-packages) :

$ sudo easy_install lxml

But why?

Because lxml 2 is a marvellous thing for churning through XML and HTML with Python. There was always XPath and XSLT, C14N in lxml 1.x too (admittedly also in 4Suite as well; Python has had strong XML support for many, many years). But in lxml 2, you also get:
  • CSS selectors
  • much improved "bad HTML" support (including BeautifulSoup integration)
  • a relaxed comparison mechanism for XML and HTML in doctests!
And more, I'm sure.

So, why all this Yak Shaving tonight? Just this afternoon I put together an lxml and httplib2-based "REST test" thing. Doesn't sound too exciting? I think it may be, since I use it with the old but still insanely modern doctest module, namely its support for executing plain text documents as full-fledged tests. This gives the possibility (from a tiny amount of code), to run plain text specifications for ROA-apps with a single command:
Set up namespaces:

>>> xmlns("http://www.w3.org/1999/xhtml")
>>> xmlns(a="http://www.w3.org/2005/Atom")

We have a feed:

>>> http_get("/feed",
... xpath("/a:feed/a:link[@rel='next-archive']"))
...
200 OK
Content-Type: application/atom+xml;feed
[XPath 1 matched]

Retrieving a document entry with content-negotiation:

>>> http_get("/publ/sfs/1999:175",
... headers={"Accepts": "application/pdf"},
... follow=True)
...
303 See Other
Location: /publ/sfs/1999:175/pdf,sv
[Following..]
200 OK
Content-Type: application/pdf
Content-Language: sv
[Body]

That's the early design at least. Note that the above is the test. Doctest checks the actual output and complains if it differs. Declarative and simple. (And done before, of course.)

Ok, I was a bit overambitious with all the "ROA"; this is mostly for checking your content-negotiation, redirects and so on. But I think it'll be useful for a large system I'm putting together, which will (if things work out well) at its core have a well-behaved Atom-based resource service loaded with (linked) RDF payloads. Complete with feed archiving and tombstones, it will in turn be used as fodder for a SPAQRL service. (Seriously; it's almost "SOA Orchestration Done Right", if such a thing can be done.) But this I'll get back too..

(.. I may call the practice "COURT" — Crafting Organization Using Resources over Time. It's about using Atom Entries as manifests of web resources with multiple representations, fully described by RDF and working as an over-time serialized content repository update stream. (Well, that's basically what Atom is/works perfectly for, so I'm by no means inventing grand things here — grand as they IMHO are.)

No comments: