Tuesday, December 1, 2009

Berkeley DB/GDBM Links

PhiloLogic uses GDBM (GNU Database Manager) for word searches. As we are starting to think about a new PhiloLogic series (the infamous "4"), we have been looking at a number of design and implementation issues, including advanced indexing schemes. For example, Clovis did a preliminary examination of various fuzzy matching systems. Richard and I have been starting to look at newer GDBM tools as well as Berkeley DB. Here are a few links and alternatives Richard proposed which I think we should experiment with and/or read:
Older perl-5 style: the tie function:
http://perldoc.perl.org/functions/tie.html
this lets you tie any complex data structure into a perl scalar, array, or hash, as you wish.
http://perldoc.perl.org/perltie.html
This is great for "hiding" object-oriented interfaces in a simple, "perl-ish" way. It can wrap GDBM or Berkeley, or MySQL, or SQLite, or Hadoop...and so on.

The low-level perl GDBM_File module:
http://search.cpan.org/~dapm/perl-5.10.1/ext/GDBM_File/GDBM_File.pm
tie and dbmopen both use this core perl module. On some mac's, I have had to Recompile perl to get GDBM_file working. You can't get it from CPAN.

The low-level perl Berkeley DB module:
http://search.cpan.org/~pmqs/BerkeleyDB-0.39/BerkeleyDB.pod
Pretty nice, but doesn't support all of the awesome Berkeley DB features, like joins. the python binding will do joins for you, at C speed. the $db->associate($secondary, \&key_callback) function lets you automatically maintain a secondary index. the DBM Filter functionality will do customized byte packing and unpacking for you transparently.

One other DBM product you might want to look at it is Tokyo Cabinet:
http://1978th.net/tokyocabinet/
It runs Mixi, the Japanese equivalent of Facebook, as I understand it, and some googling suggest that it's quite hot in the noSQL world. It's certainly faster than BerkeleyDB, and lightweight, and has nice Ruby bindings--Perl, not so hot.
Some people claim it's more stable than Berkeley. There's an impressive set of benchmarks here:
http://tokyocabinet.sourceforge.net/benchmark.pdf
This should be compared with Oracle's benchmarks:
http://www.oracle.com/technology/products/berkeley-db/pdf/berkeley-db-perf.pdf
which shows bulk read of 5,000,000 records/sec. "un de ces" indeed.

Berkeley has more features, Tokyo might be faster, we'd probably want to test both of them out at scale to see how they hold up. Tokyo is designed to do cloud-style partitioning and stuff.

We also might want to look at Project Voldemort, which runs LinkedIn:
http://project-voldemort.com/
This one keeps its database in-memory, and has really sophisticated protocols for distributed hash tables, load balancing, consistency, etc.
Lots to think about indeed. And he sent along a little script as an example to tinker with, which I won't post here....

Thanks, Richard.