HTTP Unix pipes for Open Data
A topic that has fascinated me for years now is (broadly speaking) nationalism. In the world of the internet that essentially boils down to something like "I am a Python programmer and this project is written in Java! Ignored!". It is this behavior (which I see a lot in programming, mostly manifested as "Not invented here") that leads to a bunch of solutions to the same problem written in a bunch of different languages where many of the solutions are half-baked.
For a concrete example consider open data catalogues. As evidenced by datacatalogs.orgthere are a ton of different solutions to the same set of problems, namely hosting open data. Having a rich ecosystem is a good thing, but I believe that there is a common open data infrastructure layer that we aren't maximizing our collaborating on: the conversion of data between different formats.
Wouldn't it be great if I, as a Javascript developer, could use the awesome data conversion libraries available in Java like Apache POI? Or if Ruby developers could use Python packages like csvkit (which contains the super useful csvclean utility). The good news is that the internet has settled on a common language for crossing these language barriers: HTTP and JSON. Additionally, nowadays the web is filled with hosted services (see SaaS, PaaS). There are numerous platforms where hosted services can be deployed for free (Google App Engine, Heroku, Dotcloud, Nodejitsu, etc).
On the Unix command line there are a bunch of useful single purpose utilities.
The Unix philosophy is "write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface".
The Unix command wc
is a great example. You give it a bunch
of text and it will count the number of words, lines and characters. Combined
with the cat
command, which reads and file and dumps out all
the text, you can use a Unix pipe (the | character) to 'pipe' the data
that cat
dumps out into wc
:
cat awesomeText.txt | wc
21 55 507
Taking heavy inspiration from unix pipes, HTTP and JSON I have come up with a modest proposal for how we might share our best tools for various data conversion jobs as hosted web services. I'm calling it gut, as in gutting a fish and getting the yummy filet out while leaving behind all of the junk.
Here's a simple example of how a gut server would work that takes in a CSV file and returns JSON data. As a developer using the gut server to process my CSV file I would send the following HTTP request containing my CSV data:
POST / HTTP/1.1
User-Agent: curl
Host: gutcsv.nodejitsu.com
Accept: */*
Content-Length: 64
Content-Type: application/x-www-form-urlencoded
name,appearance
chewbacca,hairy
bill,nonplussed
bubbles,relaxed
This is what the gut server would give me back:
POST / HTTP/1.1
host: gutcsv.nodejitsu.com
content-type: application/json
content-length: 186
Connection: close
{
"headers": [
{
"name": "name"
},
{
"name": "appearance"
}
],
"rows": [
{
"name": "chewbacca",
"appearance": "hairy"
},
{
"name": "bill",
"appearance": "nonplussed"
},
{
"name": "bubbles",
"appearance": "relaxed"
}
]
}
Essentially I am piping data from my computer through a gut server and when it comes back it is in the new format. In this example I used the node.js hosting platform Nodejitsu to deploy my CSV-to-JSON code so that it is available to anyone in the world who can make an HTTP request.
If you are writing code that converts data from one format to another, consider also exposing your solution in the form of a gut server! Last year I had great success at International Open Data Day teachin ScraperWiki because it scaled out well to a room full of people with different programming backgrounds. I think that writing these lightweight data converters/massagers/transformer servers is also a task that anyone can tackle in a short amount of time.
There is a Github project that contains the current gut servers that I have been working on and also a wiki pagewhere you can add your gut server to the list. Once there are a handful of gut servers we can start working on more extensive discovery and testing tools (ensuring gut server availability, better documentation, web based gut server API playground, etc).