Max Ogden | Open Web programmer
April 2019
Voxel.js Next
Check out the Voxel.js reboot
May 2016
Getting Started With Node For Distributed Systems
Where to get started with streams and peer to peer
July 2015
What's the deal with iot.js and JerryScript
Node.js will soon be running on tiny low power chips
July 2015
Electron Fundamentals
A quick intro to Electron, a desktop application runtime
May 2015
HD Live Streaming Cats to YouTube with the Raspberry Pi Camera
A how to guide
May 2015
Interdisciplinary Open Source Community Conferences
A list of community organized events
April 2015
Setting up HTTPS with a wildcard certificate and Nginx
How I set up HTTPS with Nginx
April 2015
A Month of Modules
Modules Mafintosh and I wrote this month
February 2015
Tessel Powered Plant Watering System
Make an HTTP accessible water pump
January 2015
Portland Fiber Internet
Review of 1Gb fiber from CenturyLink
January 2015
node-repl
An interactive console for node
January 2015
Nested Dependencies
Insight into why node_modules works the way it does
July 2013
Node Packaged Modules
Bringing NPM modules to the web
March 2013
Kindleberry Wireless
A Portable Outdoor Hackstation
January 2013
Bringing Minecraft-style games to the Open Web
A status report from the one month old voxel.js project
November 2012
A Proposal For Streaming XHR
XHR2 isn't stream friendly. Lets explore why and propose a solution!
October 2012
Scraping With Node
Useful modules and a tutorial on how to parse HTML with node.js
October 2012
Building WebView Applications
Things I learned while building @gather
May 2012
Fast WebView Applications
How to make web apps feel fast and responsive
April 2012
Node Streams: How do they work?
Description of and notes on the node.js Stream API
December 2011
Gut: Hosted Open Data Filet Knives
HTTP Unix pipes for Open Data
July 2011
Little Coders
Elementary school programming
Scraping with Node

Modules and tutorial demonstrating HTML parsing with node.js

One of the the best parts about server side JavaScript is the lack of the DOM, but sometimes you need to parse HTML in your node programs. For a while JSDOM has been the most well known module for accomplishing this task, but it has a number of issues. The author, @tmpvar, has been developing super awesome node powered robots instead of maintaining it. It also turns out that a full DOM level 3 implementation is super complex and crazy which means JSDOM suffers from some pretty bad memory leaks that leaves it unusable for a lot of complex use cases.

Instead of rewriting the DOM in pure JS, a more realistic approach is a nice and simple HTML parser that implements a CSS selector API. Enter cheerio, a module that can teach your server HTML.

Cheerio is built on top of the htmlparser2 module, a sax-like parser for HTML/XML. The goal of Cheerio is to implement most of the jQuery API in pure JS, without the need for a DOM. There is a separate dependency called cheerio-select that implements the sizzle API. The cheerio module itself more or less implements the jQuery API.

Using Cheerio

Since there is no DOM in node you have to initialize a cheerio instance from an HTML string. (this example comes from the cheerio readme)

var cheerio = require('cheerio'),
    $ = cheerio.load('<h2 class = "title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

$.html();
//=> <h2 class = "title welcome">Hello there!</h2>

If you have an HTML file on disk that you want to load, you can use nodes fs module (warning: don't use sync calls inside an event loop, only use them when you don't care about performance):

var $ = require('cheerio')
var fs = require('fs')

var htmlString = fs.readFileSync('index.html').toString()
var parsedHTML = $.load(htmlString)

// query for all elements with class 'foo' and loop over them
parsedHTML('.foo').map(function(i, foo) {
  // the foo html element into a cheerio object (same pattern as jQuery)
  foo = $(foo)
  console.log(foo.text())
})

Similarly, you can use the popular request module to grab HTML from a remote server using HTTP and then pass it to cheerio:

var $ = require('cheerio')
var request = require('request')

function gotHTML(err, resp, html) {
  if (err) return console.error(err)
  var parsedHTML = $.load(html)
  // get all img tags and loop over them
  var imageURLs = []
  parsedHTML('a').map(function(i, link) {
    var href = $(link).attr('href')
    if (!href.match('.png')) return
    imageURLs.push(domain + href)
  })
}

var domain = 'http://substack.net/images/'
request(domain, gotHTML)

Building on the last example, here is how to fetch the raw binary data of each img on the page and render the images in your terminal using picture-tube and the node Stream API:

var pictureTube = require('picture-tube')

var randomIndex = Math.floor(Math.random() * imageURLs.length)
var randomImage = imageURLs[randomIndex]
request(randomImage).pipe(pictureTube()).pipe(process.stdout)

Now, go forth and scrape!