Tuesday, December 1, 2009

Berkeley DB/GDBM Links

PhiloLogic uses GDBM (GNU Database Manager) for word searches. As we are starting to think about a new PhiloLogic series (the infamous "4"), we have been looking at a number of design and implementation issues, including advanced indexing schemes. For example, Clovis did a preliminary examination of various fuzzy matching systems. Richard and I have been starting to look at newer GDBM tools as well as Berkeley DB. Here are a few links and alternatives Richard proposed which I think we should experiment with and/or read:
Older perl-5 style: the tie function:
this lets you tie any complex data structure into a perl scalar, array, or hash, as you wish.
This is great for "hiding" object-oriented interfaces in a simple, "perl-ish" way. It can wrap GDBM or Berkeley, or MySQL, or SQLite, or Hadoop...and so on.

The low-level perl GDBM_File module:
tie and dbmopen both use this core perl module. On some mac's, I have had to Recompile perl to get GDBM_file working. You can't get it from CPAN.

The low-level perl Berkeley DB module:
Pretty nice, but doesn't support all of the awesome Berkeley DB features, like joins. the python binding will do joins for you, at C speed. the $db->associate($secondary, \&key_callback) function lets you automatically maintain a secondary index. the DBM Filter functionality will do customized byte packing and unpacking for you transparently.

One other DBM product you might want to look at it is Tokyo Cabinet:
It runs Mixi, the Japanese equivalent of Facebook, as I understand it, and some googling suggest that it's quite hot in the noSQL world. It's certainly faster than BerkeleyDB, and lightweight, and has nice Ruby bindings--Perl, not so hot.
Some people claim it's more stable than Berkeley. There's an impressive set of benchmarks here:
This should be compared with Oracle's benchmarks:
which shows bulk read of 5,000,000 records/sec. "un de ces" indeed.

Berkeley has more features, Tokyo might be faster, we'd probably want to test both of them out at scale to see how they hold up. Tokyo is designed to do cloud-style partitioning and stuff.

We also might want to look at Project Voldemort, which runs LinkedIn:
This one keeps its database in-memory, and has really sophisticated protocols for distributed hash tables, load balancing, consistency, etc.
Lots to think about indeed. And he sent along a little script as an example to tinker with, which I won't post here....

Thanks, Richard.

Conventional versus electron flow

The nice thing about standards is that there are so many of them to choose from.

Having a couple of motorcycles means that I also own a couple of battery chargers. During the winter, you need to keep them charged when not riding, since letting them go completely flat wrecks the battery. Even during riding season, the combination of big engines, small batteries, and periods when one does not ride much, you can don the helmet and leathers, stride confidently to the machine of choice, and get .... nothing. I was charging the 'buza's battery the other day, and asked the simple question: "which way does electricity flow"? Sure, in a DC system, electricity flows from the positive (red) to the negative (black) posts. That's odd. Why would negatively charged electrons move from the positive to the negative? Shouldn't they move from negative to positive? The amount of misinformation on the Internet regarding this simple question is staggering. Thankfully, Andrew Tanenbaum provides the answer in his discussion of "Conventional versus electron flow" in the very helpful All About Circuits site.

There are two ways to consider the direction of electricity flow, which are pretty much contradictory. The "conventional" view, where electricity flows from positive to negative, dates back to Franklin. He detected flow and assumed that there was a surplus of charge (hence positive) on one pole and a lack of charge on the other (hence negative). Of course, many years later, it was found that the true flow of electrons was the opposite, from negative to positive.

By the time the true direction of electron flow was discovered, the nomenclature of "positive" and "negative" had already been so well established in the scientific community that no effort was made to change it, although calling electrons "positive" would make more sense in referring to "excess" charge. You see, the terms "positive" and "negative" are human inventions, and as such have no absolute meaning beyond our own conventions of language and scientific description. Franklin could have just as easily referred to a surplus of charge as "black" and a deficiency as "white," in which case scientists would speak of electrons having a "white" charge (assuming the same incorrect conjecture of charge position between wax and wool).

However, because we tend to associate the word "positive" with "surplus" and "negative" with "deficiency," the standard label for electron charge does seem backward. Because of this, many engineers decided to retain the old concept of electricity with "positive" referring to a surplus of charge, and label charge flow (current) accordingly. This became known as conventional flow notation

He goes on to discuss the distinction in useful detail, concluding
I sometimes wonder if it would all be much easier if we went back to the source of the confusion -- Ben Franklin's errant conjecture -- and fixed the problem there, calling electrons "positive" and protons "negative."
Mystery resolved. Now, if I can only figure out how to get light bulbs out of sockets on ceiling fan/light systems without breaking them, I will be forever grateful. The vibration of the fan tends to wedge them in pretty tightly.....