HTML parsing in Clojure

Once again I’m doing a personal project that requires me to do some HTML parsing. If you have had to look at it, the landscape of Clojure HTML parsing libraries seems to be littered with dead projects.

Let’s look at the options.

Enlive is what I used for Tropology. It got the job done, even if it was a bit involved to get into, but at this point it looks like it’s pretty dead. It also combines templating and parsing, so might be overkill if all you want is to select some nodes out of a page.

clj-tagsoup is leaner and seems to mostly be a wrapper around tagsoup. Unfortunately it also lacks tests and seems to be pretty much abandoned.

Hickory looked to be clean, straight-forward, and had clear examples. The README also showed an understanding of why Hiccup is a good intermediate representation for HTML, but might not be the easiest one to process. My main hesitation came from it having multiple pull requests open that were created well over a year ago, but haven’t been merged.

And of course I could just use JSoup directly… but then I would end up with non-Clojurized data structures. Eww.

I settled on Hickory. It wasn’t dependency-heavy, it supported both Clojure and ClojureScript consistently, and performed well on my tests. While version 0.6.0 had some deprecated dependencies (.cljx files and cemerick.test), it was easier to just replace those than to build a new library myself.

Happily, it looks like Hickory is still very much alive, open pull requests notwithstanding. David Santiago, the maintainer, has merged the changes and we now have an official, updated Hickory 0.7.0.


Published: 2016-10-31

Author

...
Ricardo J. Méndez