Dust Feed

Wednesday, November 09, 2011

One Liner At a Time

Just some one-liners. 'Cause, you know, to post something, and to each one its own.

even_map = dict(('key_%s' % i, i) for i in range(1, 9) if i % 2 == 0)

even_map = Hash[(1..9).find_all { |i| (i).even? }.collect {
    |i| ["key_#{i}", i] }]

(def even-map (into {} (for [i (range 1 9) :when (= (mod i 2) 0)]
                         [(keyword (str "key_" i)) i])))

Sunday, July 03, 2011

Resources In Various Frames of JSON

I've been mulling over the role(s) JSON should play in representing RDF the last couple of days (well, the last year or so really).

Having worked with RDF for some years now, more or less full time, in specific contexts (legal information, and lately educational data), I'm getting a hang of some usage patterns. I'm also following the general Linked Data community (as well as various REST endeavors, Atom applications, and various, mostly dynamic language based, platforms). In all of this, I've made some observations:

1. If I want to use RDF fully, anything but a full RDF API (Graph navigation, Resource and Literal objects and all) often cause a lot of friction. That is, unless I already know what "framing" I need of the data, I really need a whole data set to dig through, for purposes of e.g. finding incoming relations, looking up rdfs:label of types and properties, filtering on languages, irregular data (such as dc:creator used both with literals or URI references) and so on.

2. Encountering general JSON data (not RDF specific) on the web, I sometimes come across quite crude stuff, mostly representing a database slice, or gratuitous exports from some form of O/R mapper. It may look meaningful, but often shows signs of ad-hoc design, unstable modeling without "proper" domain understanding and/or implementation leakage. However, the data is accessible for most web programmers without the need to get the domain properly, no matter how poor this data representation may be. JSON is language native. The use case is to have users (programmers) be able to spin around a specific, boxed and digested slice of data. Ideally you should also be able to find and follow links in it. (Basically the values which match something like /^(http(s)?:|\/|\.\/)\S+/ ...).

3. If I know and control the data, I can frame it in a usage scenario (such as for rendering navigable web pages with summaries of entities) based on a specific domain (such as an article document, its revisions and author, etc.). Here is great potential for reducing the full data into a something, raw, more (web) programming language native. This is where a JSONic approach fits the bill. Examples of how such data can look includes the JSON data of e.g. New York Times, LCSH. The Linked Data API JSON is especially worth mentioning, since they explicitly reduce the data for this casual use so many need.

Point 1 is just a basic observation: for general processing, RDF is best used (produced and consumed) as RDF, and nothing else. It can represent a domain in high fidelity, and merges and is navigable in a way no other data model I've seen supports.

Point 2 is about quick and sometimes dirty. Cutting some corners to get from A to B without stopping for directions. You cannot do much more than that though, and in some cases, "B" might not be where you want to go. But it works, and if the use case is well understood, anything more will be considered waste for anyone not wanting the bigger contexts.

Point 3 then, is about how to go from 1 into 2. This is what I firmly believe the focus of RDF as JSON should be. And since 2 is many things, there may be no general answer. But there is at least one: how to represent a linked resource on the web, for which RDF exists, as a concise bounded description, showing inherent properties, outgoing links and per application considered relevant incoming links. And how to do this in JSON in high fidelity but immediately consumable by someone not wanting more than "just the data".

Many people have expressed opinions about these things of course. You should read posts by e.g. Leigh Dodds and Nathan Rixham, and look at some JSON Serialization Examples. Also monitor e.g. the Linked JSON W3C mailing list and of course the ongoing work of JSON-LD. Related to the Linked Data API and its "instrumental" JSON is also a recent presentation by Jeni Tennison: Data All the Way Down. It's short and very insightful. End-users have different needs than re-users!

Early on (over a year ago) I drafted Gluon. I have not used that much since. A related invention I have used though, is SparqlTree. While it isn't really a mechanism for defining how to map RDF terms to JSON (but to formulate SPARQL selects digestible into compact results), it does so quite well for specific scenarios. It is very useful to create frames to work on, where code paths are fully deterministic, and where there is a one-way direction of relations (which is needed in JSON trees, as opposed to RDF graphs where we can follow rel and rev alike). Admittedly I've done less than I should to market SparqlTree. But then again, it is a very simple and instrumental solution over an existing technology. I recently gave a glimpse of how I use it in a mail concerning the "Construct Where of SPARQL 1.1" .

Reflecting on all of this, I'm quite convinced that anything like RDF/XML or Turtle is beyond what JSON should ever be used for. That is, support for all kinds of general RDF, using prefixes (whom I love when I need to say anything about anything) and exposing the full, rich, internationalized and richly and extensibly datatyped world of literals is beyond the scenarios where JSON is useful. If you need full RDF, use Turtle. Seriously. It's the best! It's rather enjoyable to write, and I can consume it with any RDF API or SPARQL.

The only case I can think of where "full RDF in JSON" might apply is for machine-to-machine data where for some reason only JSON is viable. For this, I can see the value of having Talis' RDF/JSON standardized. It is used in the wild. It is reminiscent of the SPARQL results in JSON, which for me is also quite machine-like (and the very reason for me inventing SparqlTree in the first place!). I'd never hand-author it or prefer to work on it undigested. But that's ok. If handed to me I'd read it into a graph as quickly as possible, and that'd be dead simple to do.

So where does this leave us? Well, the Gluon I designed contains a general indecision, the split into a raw and a compact form. The problem is that they are overlapping. You can fold in parts of data into compact form. This is complex, confusing and practically useless. Also, the raw form is just another one in the plethora or more or less "turtle in JSON" designs which cropped up in the last years. I doubt that any such hybrid is usable: either you know RDF and should use Turtle, or you don't and you want simple JSON, without the richness of inlined RDF details.

My current intent is to remove the raw form entirely, and design the profile mechanism so that it is "air tight". I also want to make it as compact as possible, true to RDF idioms but still "just JSON". A goal will still also be that if present, a profile should be possible to use to get RDF from the JSON. This way, there is a possibility of adding the richer context and integratability of RDF to certain forms of well designed JSON. This of course implies that Gluon-profile compatible JSON will be considered well designed. But that is a goal. It has to look good for someone not knowing RDF!

I have a strawman of a "next generation gluon profile" in the works. I doubt that you can glimpse my design from that alone, but anyway.

Some things to note:

The 'default' feature will be more aligned with the @vocab mechanism of RDFa 1.1 (and JSON-LD)
Keywords ('reserved') can be redefined. There are preset top-level keys, but that's it. (A parser could parameterize that too of course.)
No CURIEs - every token is "imported" from a vocabulary.
Types will be powerful. They'll determine default 'vocab' for a resource description (i.e. JSON object), and you can also import terms locally for a type (so that a Person title is foaf:title although 'title' is globally from 'dc').
If there are multiple values for a term (i.e. multiple triples with the same subject and predicate), a defined prefix or suffix will be added to the term. This is an experiment to make this nagging problem both explicit and automatic.
The 'define' will be reduced to a much less needed component. Using 'autocoerce', pattern matching on values will be bravely used to coerce mainly date, dateTime and URI references to their proper types.
Incoming links can be represented as 'inverseOf' attributes, thus making it possible to frame more of a graph as a tree.
Named bnodes are out (though they might be snuck in via a "_:" link protocol..). Anonymous bnodes are just fine.

This is a design sketch though. Next steps are to work on adapting my test implementations and usage thereof.

An auxiliary but very interesting goal is the possibility of using these profiles in a high-level API wrapper around an RDF graph, making access to it look similar to using Gluon JSON as is (but with the added bonus of "reaching down" the abstraction to get at the details when needed). (This is the direction I've had in mind for any development of my nowadays aged Oort Python O/R mapper. More importantly, the current W3C RDF/Structured Data API design work also leans towards such features, with the Projection interface.)

(Note that profiles will reasonably not be anything like full "JSON schemas". It's about mapping terms to URI:s and as little else as possible to handle datatyping and the mismatch between graphs and JSON trees. There is a need for determining if a term has one or many values, but as noted I'm working on making that as automatic as possible. Casting datatypes is also needed in come cases but should be kept to a minimum.)

Finally, I really want to stress that I want to support the progress of JSON-LD! I really hope for an outcome to be a unification of all these efforts. The current jungle of slightly incompatible "RDF as JSON"s sketches is quite confusing (and I know, Gluon is one of the trees in that jungle). I believe JSON-LD and the corresponding W3C list is where the action is. Since there is work in JSON-LD on profiles/contexts, and a general discussion of what the use cases are, I hope that this post and my future Gluon profile work can help in the progress of this! For me Gluon is the journey and I hope JSON-LD is the destination. But there are many wills at work here, so let's see how it all evolves.

Thursday, December 31, 2009

The Stone in the House of Glass

The stone in the house of glass

began to tumble

It didn't really see

where it was going

The stone in the house of glass

began to rumble

It didn't really hear

what it was doing

The stone in the house of glass

took a dive

It didn't really sense

its own surroundings

The stone in the house of glass

began to crumble

It didn't really know

its limitations

The stone in the heap of shards

has smashed the building

It didn't really get

what it was all about

Thursday, November 12, 2009

The Groovy Times

It has been so long since my last post here. I've twittered away like the rest of my peers. I guess I could dish out details from my personal life of the past year now. To examine my interrupt. I won't.

I've been using Groovy a lot in my work on the Swedish Legal Information System. While the things I find interesting in this work deserve many separate posts, I'll just spend this one to drop some nice stuff about groovy.

"Why Groovy", you might ask? Oh dear. For "political" reasons (this is an entire topic of its own), I have to use Java. But the language Java is often so much overwork and ceremony; riddled with convoluted ways to make explicit patterns and formalisms. Dynamic languages are pragmatic. Sure they have flaws, but I find the compromise acceptable. There are probably thousands of articles discussing this, and I prefer to debate it elsewhere (mostly with friends over lunch/dinner/beer).

I must deliver the bulk in Java, and Groovy is impressively close to Java, but with so much more expressive power (ease of use). (More than necessary? My pythonic side says "probably".) I've worked a lot with Jython the last decade, and some with JRuby. Groovy trumps them both (IMHO) when it comes to java library and java culture interoperability. I can spike, explore, and make tests in groovy. Then I add all this horrendous checked exception handling, spinkle some semicolons for the grumpy old compiler, and finally explode-a-pop all the def:s to bulky types when things need to "harden" (to fulfil the contract of delivering .java...; the tests are left in groovy). Not a big deal (especially since I use an editor that makes code editing a breeze). And groovy cross-compiles nicely with traditional cruft.

So what have I used more specifically? Spock. Check it out. Rewrite ten of your JUnit4 tests in it, and if you go back, write ten more. Don't go back. If you're already testing with say JRuby, I won't push you, but I assure you Spock is worth looking at. Specs become liberatingly clear and thin. Data-driving some of them is pure joy. Mocking is dead simple. (Sorry, I won't put code examples here now: look at the spock docs for that. Try them out!)

For building, we do not use Gradle (not yet at least), but that scary beast of mindnumbing declarativity (which I'm usually for), dreadful xml (which I can handle due to prolonged exposure) and conflated purposes (no excuse here) known as Maven 2. It seemed paramount in the surroundings when we started, and won't go away soon. I use it as little as possible (and it is quite impressive when you let it do its thing).

Which leads me to the last thing I want to mention: how to use Groovy's very convenient, builtin Grape system (and its @Grab mechanism) together with my local maven2 artifact repo. That one in <~/.m2>, where all my local packages have been mvn install:ed (along with the umpteen dependencies).

The thing was, when I started, I naively thought things would kind of work at least semi-automagically. I've spent my time in CLASSPATH hell. I wanted groovy to tap into the local m2. Then I was disillusioned again, and attempted to run experimental groovy scripts via GMaven. Didn't fit my use cases at all (that thing is great at compiling, I leave it at that). I shellscripted the path from mvn dependency:build-classpath, then built pathing jars, then just felt quite uneasy (such moves work, but it's not particularly clean).

When Grape appeared I tinkered with the Ivy config in <~/.groovy/grapeConfig.xml>. It surely looks so simple. I couldn't figure it out. Benhard could. Neat.. But alas, that solution copies all the dependencies from the local m2 repo to grape's ivy repo. And I could not get it to grab my new local SNAPSHOT-stuff as they landed either (in spite of eleventy ivy attributes claiming to force all kinds of checks).

Then it appeared, from a combination of fatigue and taking a step back (cue magic "aahh").

Use Groovy's Grape with your Local Maven File Repo

Locate $HOME/.groovy/grapeConfig.xml. If it doesn't exist, see the Grape docs for how the default version should look.

Then add, directly after (xpath) ivysettings/settings, the following directive:

  <caches useOrigin="true"/>

And, in (xpath) /ivysettings/resolvers/chain, after the first filesystem, add:

<filesystem name="local-maven2" m2compatible="true">
  <ivy pattern="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision].pom"/>
  <artifact pattern="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[artifact]-[revision](-[classifier]).[ext]"/>
</filesystem>

With that in place (pardon the line width), Grape (and thus @Grab) will happily use anything it finds in your local m2 file repo, without copying the jar:s. (It will still download other stuff to use to the default ~/.groovy/grapes/, which is fine by me.)

That's it for now. There are lot's of cool stuff with Groovy, if you're in a Java environment and want to ackowledge that without giving up modern power.

Three years ago I looked at Scala for the first time, and found it quite interesting. Then I got a bit spooked by the academic machinations of it. Lately that interest has been quite rekindled though, and the future will show what will come of that. I am very happy to have used Groovy so far though, and I would certainly recommend it. Scala may yet be for tomorrow, Groovy is for the Java user of right now.

(Of course, I recommend to continuously look beyond the JVM as well. Simplicity is hard to reach in increments without designing for it from the start. But things will reasonably evolve in most "camps".)

Thursday, November 27, 2008

Labelled Reduction as A Good Thing

One small thing (of the many) in Python 2.6 I like (and have waited for since it appeared as a recipe), is collections.namedtuple. It is very useful in itself, but the fact that the stdlib has been adapted to use it throughout is quite nice. Consider the following code:

from urlparse import urlparse
print urlparse(
        "http://localhost:8080/doc/something;en?rev=1")

If run with Python 2.5, you get this tuple:

('http', 'localhost:8080', '/doc/something',
 'en', 'rev=1', '')

, whereas in 2.6 it is a namedtuple:

ParseResult(
        scheme='http', netloc='localhost:8080',
        path='/doc/something',
        params='en', query='rev=1', fragment='')

The last one is unpackable just like a regular tuple, but you can access the parts as attributes as well. This little "data struct" is quite handy, since I don't like to access tuples by index, but quite often pass them around and only access some piece of them at a time.

(With these you don't need to create full-fledged classes for every kind of instrumental data. (Sometimes coupling data and functionality in a single paradigm may be a coarse hammer, treating nails and iron chips alike..) Nor resort to the use of dictionaries where you really want a "restricted value lens", if you will.. But this is another rant altogether.)

Of course, there's lots more to enjoy in 2.6 (the enhanced property for decorator use, ABC:s, json, 2to3 etc).

On a related note, do check out Swaroop C H:s excellent and free books on Python (2.x + 3.0(!)): A Byte of Python. And if you're into Vim (you should be, IMHO) his new A Byte of Vim.

Friday, October 24, 2008

Gnostic Nihilism

Here's my take on it. God actually exists. It's the world that doesn't. It's just that when you leave a being like that all alone in nothingness for an eternity, it starts to dream up all sorts of amazing and insane things.

Or to put it in mathematical terms (incidentally also being my unbound alternative to "42" for half my life):

0 * oo = x, where 0 < x < oo.

("oo" is for 𝌺, i.e. "eternity", of course.)

Take that you goddamn realists! I laugh in your general direction! Ha. Ha I say.

If you nevertheless find a discrepancy in my flawless argument, I'd gather you're just cheating by using rationality. How gnostic is that? I suppose next you're going to argue that there is logical truth in empirical facts.

Saturday, September 27, 2008

Resources, Manifests, Contexts

Just took a quick look at oEmbed (found from a context I was led to from Emil Stenström (an excellent Django promoter in my surroundings btw. Kudos.)).

While oEmbed is certainly quite neat, I very much agree with the criticism regarding the lack of RESTfulness, and that they have defined a new metadata carrier. I think oEmbed would work very well as an extension element in Atom Entry documents (who already have most of the properties oEmbed (re-)defines). Or by reusing (in such atom entry docs) e.g. Media RSS, as Stephen Weber suggested.

Granted, if (as I do hope) RESTfulness and Atom permeation on the web becomes much more well established (approaching ubiquity), this would be dead easy to define further down the line. (And signs of this adoption continue to pop up, even involving the gargantuans..)

But since it wasn't done right away, oEmbed is to some extent another part of the fragmented web data infrastructure — already in dire need of unification. It's not terrible of course, JSON is very effective — it's just too context-dependent and stripped to work for much more than end-user consumption in "vertical" scenarios. While oEmbed itself is such a scenario, it could very well piggy-back on a more reusable format and thus promote much wider data usability.

A mockup (with unsolicited URI minting in the spaces of others) based on the oEmbed quick example could look like:


<entry xmlns="http://www.w3.org/2005/Atom"
       xmlns:oembed="http://oembed.com/ns/2008/atom/">
  <id>tag:flickr.com,2008:/3123/2341623661_7c99f48bbf_m.jpg</id>
  <title>ZB8T0193</title>
  <summary></summary>
  <content src="http://farm4.static.flickr.com/3123/2341623661_7c99f48bbf_m.jpg"
           type="image/jpg"/>
  <oembed:photo version="1.0" width="240" height="160"/>
  <author>
    <name>Bees</name>
    <uri>http://www.flickr.com/photos/bees/</uri>
  </author>
  <source>
    <id>tag:flickr.com,2008:/feed</id>
    <author>
      <name>Flickr</name>
      <uri>http://www.flickr.com/</uri>
    </author>
  </source>
</entry>

The main point, which I have mentioned before, is that Atom Entries work extremely well as manifests of resources. This is something I hope the REST community will pick up in a large way. Atom feeds complement the RESTful infrastructure by defining a standard format for resource collections, and from that it seems quite natural to expose manifests of singular resources as well using the same format.

In case you're wondering: no, I still believe in RDF. It's just easier to sell uniformity one step at a time, and RDF is unfortunately still not well known in the instrumental service shops I've come in contact with (you know, the ones where integration projects pop up ever so often, mainly involves hard technology, and rarely if ever reuse domain knowledge properly). So I choose to support Atom adoption to increase resource orientation and uniformity — we can continue on to RDF if these principles continue to gain momentum (which they will, I'm sure).

Thus I also think we should keep defining the bridge(s) from Atom to RDF for the 3.0 web.. There are some sizzling activities on that respect which can be seen both in the Atom syntax mailing list and the semantic web list. My interest stems from what I currently do at work (and as a hobby it seems). Albeit this is from a very instrumental perspective — and as a complement, rather than an actual bridge.

In part, it's about making Atom entries from RDF, in order for simple RESTful consumers to be able to eat some specific Atom crumbs from the semantic cakes I'm most certainly keeping (the best thing since croutons, no doubt). These entries aren't complete mappings, only selected parts, semantically more coarse-grained and ambiguous. While ambiguity corrupts data (making integration a nightmare), it is used effectively in "lower-case sem-web" things such as tagging and JSON. (Admittedly I suppose it's ontologically and cognitively questionable whether it can ever be fully avoided though.)

We have proper RDF at the core, so this is about meeting "half way" with the gist of keeping things simple without loosing data quality in the process. To reduce and contextualize for common services — that is at the service level, not the resource level. (I called this "RA/SA decoupling" somewhere, for "Resource Application"/"Service Application". Ah well, this will all be clarified when I write down the COURT manifesto ("Crafting Organisation Using Resources over Time"). :D)

Hopefully, this Atom-from-RDF stuff will be reusable enough to be part of my Out of RDF Transmogrifier work. Which (in my private lab) has been expanded beyond Python, currently with a simple Javascript version of the core mapping method ("soonish" to be published). Upon that I'm aiming for a pure js-based "record editor", ported from an (py-)Oort-based prototype from last year. I also hope the service parts of my daytime work may become reusable and open-sourced as well in the coming months. Future will tell.