Visualizing TVTropes - Part 4: Merges
Episode IV: A new approach
We got better performance with the last changes I made to the import process, at the cost of parallelization and a significantly less clean Clojure codebase.
As I was going about wrapping them up, I got a recommendation from Michael Hunger. Summarized:
- Don’t use the batch REST API
- The Cypher query endpoint is outdated
- Try merges
Well then. So much for that feature branch.
The crunchy bits
Let’s skip right to the point. This allows us to go back to our “parse one page at a time” implementation, since we no longer need to lump a bunch of statements into a single batch REST API call for better performance.
I haven’t found if it’s possible to pass a list of maps to the a Neo4j query, so on the backend call we get to only send a list of codes to relate it to.
Once we’ve done these changes a serial import of the top 10 test pages (15k nodes and 16k relationships) goes down from its previous 30s to under 10s… while actually making more requests - since I make no effort to de-dup the nodes across requests - and placing a significantly lower burden on the CPU. The live site is importing 300 pages and their references per minute, while keeping CPU usage at an average of under 60%.
That’s something we can definitely have running alongside the queries.
Sorta parallel, at least.
Now, Michael Hunger also mentioned we can parallelize as many of these as we have cores for. That will probably work well for other data sets, but since this one is so connected I keep getting transactions bail out on deadlocks when two imports try to refer to the same node.
These are identified by a particular exception indicating it’s only a transient error, meaning we could just leave the page aside and retry it later. I’m not yet trying to fully parallelize things, though, for two reasons:
- My current approach just gets pages directly from the TVTropes site, and makes no effort to cache them locally. I don’t want to flood them with requests for a single page if I keep failing on it.
- I did some tests retrying a transaction when I get a transient error, but since I need to put the thread to sleep for a period of time to avoid walking right into a deadlock again, the quick tests I did actually ended with that approach being slower than just doing these serially.
I should be caching the queried pages locally, which would also make selectively re-importing some data easily, but I’ll leave that for later.
There’s fairly little downside to this approach, actually.
I don’t get to add exactly the same data to a satellite node as I did before, since I can pass only one set of parameters for the list to link to (doesn’t seem like I can pass a map). This has the disadvantage of not being able to pass values like a code’s corresponding category, type and title, but we can update those once we crawl its specific page.
This means that the nodes returned by
query-nodes-to-crawl may not have a URL. In this case we construct one out of the known base path that we are crawling, given that it’s the same for all pages.
A road to Damascus aside…
I did run into a bit of a snag during this process.
I have two Neo4j installs on my local OS X: one from Homebrew, which I’ve been using for my quick experiments, and and also a Docker container for the test environment. The Linode I quickly provisioned for running the import process and the test container were both running Neo4j 2.1.6. When I upgraded all of them to 2.2 everything worked both fine on the test container and on my local exploration install, but calls to Neo4j on the Linode failed with an “Unknown Error” - using the same data and deployed application code.
After a while of stack trace scrying and setting reviewing that got me nowhere, a light bulb went on. I removed that Neo4j install from yum and replaced it with the same Docker container I’m using for the test environment. Everything worked without a hitch.
So that’s why people rave about Dockerizing things.
The main next step would be clean up. I still have the old per-node function calls in there, mostly since the tests suite depends on them to create some of its test data, but I’ll start moving the tests to the new code.
Pretty satisfied with the whole learning process up until now, and given the core data being imported is not likely to change in the immediate future, it’s probably time for a first release.