And here's another blog post about jackrabbit clusters and how to make your life better.

Adding a new instance to a Jackrabbit cluster is very easy. In the beginning. Just provide a proper repository.xml which points to the central sources, add a new cluster id and start. Everything is taking care of from then. The problems start, if you're data grows and gets larger and larger.

If you add a new instance to a Jackrabbit cluster, your new Jackrabbit instance starts up and begins to read all the content to reindex and built up its Lucene search index. It also reads the Journal and rebuilds everything it needs from there. You can imagine, that this can take quite some time if you have a lot of content build up lately.

Furthermore, as this the journal can get huge pretty fast, Jackrabbit introduced a janitor, which cleans the journal log daily. Great, if you have the same instances running all the time (they don't need the log from days/months ago), but not so great, if you want to add new instances (and the wiki entry linked above warns you from exactly this).

But there's a solution to this very problem, and it's not that complicated:

  • Shutdown one of your instances
  • Get the current revision number that instance was from your database
  • Copy your whole Jackrabbit repository directory to another server/location
  • Start your original Jackrabbit again
  • Change repository.xml with a new nodename in your clusterconfig
  • Add that nodename to your DB in JOURNAL_LOCAL_REVISIONS with the number from the original instance
  • Start your new Jackrabbit instance (or keep it for backup purposes)

With this approach, we can be sure that everything is in a consistent state (the Lucene indexes for example) and we can safely start that copy of this instance in another place and it should take up where it was without loosing anything (as long as the janitor didn't run between the backup and starting the new clone).

As a little proof of concept I wrote 2 little scripts, which exactly do what I described above. They can be found on Github at github.com/chregu/Jackrabbit-clone-scripts. They are not used in production (yet) and handle one specific setup (we use MySQL as Persistent Store for example), but it should be easy to adjust it to your needs. It has some tests for avoiding mistakes and the scripts stops then, but I'm sure I missed some not-so-obvious ones. It will help us a lot in adding new instances to a cluster in a decent amount of time. I'm sure some of you out there can make use of it, too (be it only to know how that works in Jackrabbit). The README has some more info.