Tuesday, December 1, 2009

Berkeley DB/GDBM Links

PhiloLogic uses GDBM (GNU Database Manager) for word searches. As we are starting to think about a new PhiloLogic series (the infamous "4"), we have been looking at a number of design and implementation issues, including advanced indexing schemes. For example, Clovis did a preliminary examination of various fuzzy matching systems. Richard and I have been starting to look at newer GDBM tools as well as Berkeley DB. Here are a few links and alternatives Richard proposed which I think we should experiment with and/or read:
Older perl-5 style: the tie function:
http://perldoc.perl.org/functions/tie.html
this lets you tie any complex data structure into a perl scalar, array, or hash, as you wish.
http://perldoc.perl.org/perltie.html
This is great for "hiding" object-oriented interfaces in a simple, "perl-ish" way. It can wrap GDBM or Berkeley, or MySQL, or SQLite, or Hadoop...and so on.

The low-level perl GDBM_File module:
http://search.cpan.org/~dapm/perl-5.10.1/ext/GDBM_File/GDBM_File.pm
tie and dbmopen both use this core perl module. On some mac's, I have had to Recompile perl to get GDBM_file working. You can't get it from CPAN.

The low-level perl Berkeley DB module:
http://search.cpan.org/~pmqs/BerkeleyDB-0.39/BerkeleyDB.pod
Pretty nice, but doesn't support all of the awesome Berkeley DB features, like joins. the python binding will do joins for you, at C speed. the $db->associate($secondary, \&key_callback) function lets you automatically maintain a secondary index. the DBM Filter functionality will do customized byte packing and unpacking for you transparently.

One other DBM product you might want to look at it is Tokyo Cabinet:
http://1978th.net/tokyocabinet/
It runs Mixi, the Japanese equivalent of Facebook, as I understand it, and some googling suggest that it's quite hot in the noSQL world. It's certainly faster than BerkeleyDB, and lightweight, and has nice Ruby bindings--Perl, not so hot.
Some people claim it's more stable than Berkeley. There's an impressive set of benchmarks here:
http://tokyocabinet.sourceforge.net/benchmark.pdf
This should be compared with Oracle's benchmarks:
http://www.oracle.com/technology/products/berkeley-db/pdf/berkeley-db-perf.pdf
which shows bulk read of 5,000,000 records/sec. "un de ces" indeed.

Berkeley has more features, Tokyo might be faster, we'd probably want to test both of them out at scale to see how they hold up. Tokyo is designed to do cloud-style partitioning and stuff.

We also might want to look at Project Voldemort, which runs LinkedIn:
http://project-voldemort.com/
This one keeps its database in-memory, and has really sophisticated protocols for distributed hash tables, load balancing, consistency, etc.
Lots to think about indeed. And he sent along a little script as an example to tinker with, which I won't post here....

Thanks, Richard.

Conventional versus electron flow

The nice thing about standards is that there are so many of them to choose from.

Having a couple of motorcycles means that I also own a couple of battery chargers. During the winter, you need to keep them charged when not riding, since letting them go completely flat wrecks the battery. Even during riding season, the combination of big engines, small batteries, and periods when one does not ride much, you can don the helmet and leathers, stride confidently to the machine of choice, and get .... nothing. I was charging the 'buza's battery the other day, and asked the simple question: "which way does electricity flow"? Sure, in a DC system, electricity flows from the positive (red) to the negative (black) posts. That's odd. Why would negatively charged electrons move from the positive to the negative? Shouldn't they move from negative to positive? The amount of misinformation on the Internet regarding this simple question is staggering. Thankfully, Andrew Tanenbaum provides the answer in his discussion of "Conventional versus electron flow" in the very helpful All About Circuits site.

There are two ways to consider the direction of electricity flow, which are pretty much contradictory. The "conventional" view, where electricity flows from positive to negative, dates back to Franklin. He detected flow and assumed that there was a surplus of charge (hence positive) on one pole and a lack of charge on the other (hence negative). Of course, many years later, it was found that the true flow of electrons was the opposite, from negative to positive.

By the time the true direction of electron flow was discovered, the nomenclature of "positive" and "negative" had already been so well established in the scientific community that no effort was made to change it, although calling electrons "positive" would make more sense in referring to "excess" charge. You see, the terms "positive" and "negative" are human inventions, and as such have no absolute meaning beyond our own conventions of language and scientific description. Franklin could have just as easily referred to a surplus of charge as "black" and a deficiency as "white," in which case scientists would speak of electrons having a "white" charge (assuming the same incorrect conjecture of charge position between wax and wool).

However, because we tend to associate the word "positive" with "surplus" and "negative" with "deficiency," the standard label for electron charge does seem backward. Because of this, many engineers decided to retain the old concept of electricity with "positive" referring to a surplus of charge, and label charge flow (current) accordingly. This became known as conventional flow notation

He goes on to discuss the distinction in useful detail, concluding
I sometimes wonder if it would all be much easier if we went back to the source of the confusion -- Ben Franklin's errant conjecture -- and fixed the problem there, calling electrons "positive" and protons "negative."
Mystery resolved. Now, if I can only figure out how to get light bulbs out of sockets on ceiling fan/light systems without breaking them, I will be forever grateful. The vibration of the fan tends to wedge them in pretty tightly.....

Wednesday, November 25, 2009

Projekt DeutschDiachronDigital

Alain suggests that Projekt DeutschDiachronDigital is involved in some interesting efforts that might be related to some work we are doing. There are a number of useful papers on the project's publication list, including
Lukas C. Faulstich, Ulf Leser und Anke Lüdeling. Storing and Querying Historical Texts in a Relational Database. Informatik-Bericht Nr.176 des Instituts für Informatik der Humboldt-Universität zu Berlin, Februar 2005

Lukas C. Faulstich, Ulf Leser und Thorsten Vitt. Implementing a Linguistic Query Language for Historic Texts. Query Languages and Query Processing (QLQP-2006): 11th Intl. Workshop on Foundations of Models and Languages for Data and Objects (FMLDO), 2006.
Interesting to see they are using SQL to power this project.



Monday, November 23, 2009

Geoffrey Rockwell's DHCS Notes

Geoffrey Rockwell has posted his DHCS Notes (link), which includes a rather provocative declaration that we might be witnessing the end of Digital Humanities:

The End of Digital Humanities I can't help thinking (with just a little evidence) that the age of funding for digital humanities is coming to an end. Let me clarify this. My hunch is that the period when any reasonable digital humanities project seemed neat and innovative is coming to an end and that the funders are getting tired of more tool projects. I'm guessing that we will see a shift to funding content driven projects that use digital methodologies. Thus digital humanities programs may disappear and the projects are shunted into content areas like philosophy, English literature and so on. Accompanying this is a shift to thinking of digital humanities as infrastructure that therefore isn't for research funding, but instead should be run as a service by professionals. This is the "stop reinventing wheel" argument and in some cases it is accompanied by coercive rhetoric to the effect that if you don't get on the infrastructure bandwagon and use standards then you will be left out (or not funded.) I guess I am suggesting that we could be seeing a shift in what is considered legitimate research and what is considered closed and therefore ready for infrastructure. The tool project could be on the way out as research as it is moved as a problem into the domain of support (of infrastructure.) Is this a bad thing? It certainly will be a good thing if it leads to robust and widely usable technology. But could it be a cyclical trend where today's research becomes tomorrows infrastructure to then be rediscovered later as a research problem all over.

TXM Search Engine

Serge Heiden suggests that we look at the CQP (Corpus Query Processor) and its successors which he is using in TXM:

-- Tiger Search : http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/
- the corresponding PhD (in german) : http://www.ims.uni-stuttgart.de/projekte/corplex/paper/lezius/diss/
-- NXT Search : http://www.ims.uni-stuttgart.de/projekte/nite/
- related doc and technical papers :
- http://groups.inf.ed.ac.uk/nxt/nxtdoc/docnql.xml
- http://www.ltg.ed.ac.uk/NITE/documents/NiteQL.v2.1.pdf
- http://www.ltg.ed.ac.uk/NITE/papers/NXT-LREJ.web-version.ps

Since we're looking at various interesting models, I don't want to forget the CDL's XTF (eXtensible Text Framework).

Friday, November 13, 2009

International Journal of Motorcycle Studies

Ran across a call for papers for INTERNATIONAL JOURNAL OF MOTORCYCLE STUDIES CONFERENCE, Colorado Springs, Colorado, June 3-6, 2010 with links to the journal. Biographical statement and an abstract of 150 words by January 15, 2010. Aside from pulling an abstract together, the only serious question is whether to take the Areo or the Buza out to the conference.

Thursday, November 12, 2009

DHCS 2009

The 4th annual Chicago Colloquium on Digital Humanities and Computer Science (DHCS) is fast approaching. This year's festivities are hosted by Shlomo Argamon and his collaborators at the Illinois Institute of Technology, November 14-16. The program is interesting and wide ranging and I am particularly looking forward to the presentations by our keynote speakers. Several of the ARTFL group will be giving presentations at the pre-conference meetings and workshops on Saturday, also know as our "Birds of a Feather" meeting. Clovis and I will be talking about recent work, based in part on two talks, "From Words to Works" and "PAIR/PhiloLine", as well as some of the more recent work on topic modeling. I also prepared a more technical PhiloLogic overview, demonstration, followed by a discussion of database loading and configuration (slides), just in case I need one. The second half should probably be expanded at one point, since I have had many requests for more extensive documentation on loading and configuring databases in PhiloLogic.

Other links: PhiloLine/PAIR installations for ARTFL Frantext and the Encyclopédie.

Tuesday, November 10, 2009

Plaintext in PhiloLogic

A while back we added a plaintext loader in PhiloLogic at the request of several folks who wanted to work with documents from the Gutenberg Project, Liber Liber and (many) other archives of unencoded or minimally encoded documents. Other use cases for a plaintext loader include direct loading of OCR output and downloading E-PUBs from Google, which can also be converted to TEI as an alternative. I suspect that we will want to retain plaintext loading for implementations of PhiloLogic, since many folks appear to have significant restrictions on accessing materials from various vendors. In a recent blog post, Devin Griffiths described his examination of MONK and ProQuest data, deciding to assemble his own corpus from Project Gutenberg.

Wednesday, November 4, 2009

Find Installed Perl Modules

Here is a helpful one-liner to find installed perl modules, thanks for Blane Warrene:

perl -MFile::Find=find -MFile::Spec::Functions -lwe 'find { wanted => sub { print canonpath $_ if /\.pm\z/ }, no_chdir => 1 }, @INC'

There is also an interactive function called instmodsh.

Friday, October 30, 2009

Apache PDFBox

We have had a number of PDF oriented projects in the past little while. Richard has brought to my attention an Apache Incubator project, PDFBox, which may be very handy for future work. In addition to the normal goodies one would expect, it supports"Lucene Search Engine Integration". Something to keep in mind.

Tuesday, October 27, 2009

Textométrie Project

The Textométrie Project is a multi-institutional and multi-disciplinary effort to develop an open source and distributed platform for sophisticated, often quantitative, text analysis led by Serge Heiden and his collaborators. There is a useful discussion of the French tradition of textométrie and how this fits into other modes of text processing and mining, along with some recent publications, and links to other software and resources. Alpha code is available on Sourceforge.

OMNIA Project

Our colleagues at the Ecole nationale des chartes are working with other researchers in France on the OMNIA (Outils et Méthodes Numériques pour l’Interrogation et l’Analyse des textes médiolatins) project, a four year effort to develop an interactive encyclopedia of medieval Latin.

Monday, October 26, 2009

Conference: Online Humanities Scholarship

Online Humanities Scholarship: The Shape of Things to Come "is a three day conference (March 26-8, 2010) to explore how to develop and sustain online humanities research and publication. Nine scholarly papers and eighteen responses will leverage discussion by a broad group of persons invited to the conference to contribute their expertise. This group includes scholars working on other projects and persons from funding agencies, publishers, museums, libraries, and professional organizations. The conference is closed to this group in order to provide maximum focus to the discussions."

This looks to be very interesting indeed. Have a peek at resources and participants. Papers and responses are to be posted well in advance of the meeting itself. Certainly something to keep track of.

Sunday, October 25, 2009

Total Perspective Vortex

Thinking about building a renvois navigation scheme, with some kind of visualization, for the Encyclopédie, reminded me of the Total Perspective Vortex from the Hitchhiker's Guide to the Galaxy, the greatest selling electronic book in the history of the universe. It is important to note that "in an infinite universe, the one thing sentient life cannot afford to have is a sense of proportion." Thankfully, the renvois system is finite, so we won't risk brain vaporization. The original radio broadcast is available in bits and pieces on YouTube, with Don't Panic in large, friendly letters as the video track. :-) The Guide's best advice is, aside from Don't Panic, "expect the unexpected".

Wednesday, October 21, 2009

Arbre généalogique: Static Image

We periodically get requests for a high resolution image of the splendid representation of the organization of knowledge in the Encyclopédie called ESSAI D'UNE DISTRIBUTION GÉNÉALOGIQUE DES SCIENCES ET DES ARTS PRINCIPAUX de Chrétien Frederic Guillaume Roth (1769), which we have put up under Zoomify. The static image is a 10 MB jpeg file, available here. Browsers beware. I like this image so much, I purchased a large reproduction and had it nicely framed. Yes, the framing cost more than the reproduction. Isn't that always the case? Manuel Lima mentions the Essai to his stunning array of visualizations at Visual Complexity, which is well worth the visit, and linked it to a modern interactive representation of the Système Figuré des Connaissances Humaines by Christophe Tricot. The Encyclopédie Collaborative Translation Project has released an English translation of the Système Figuré.

Marti Hearst, Search User Interfaces

I have been reading Marti Hearst's excellent Search User Interfaces, which is fully available at http://www.searchuserinterfaces.com/. Of particular interest to me is her chapter on Information Visualization for Text Analysis. She writes "the categorical nature of text, and its very high dimensionality, make it very challenging to display graphically" and goes on to present a number of ways to handle display of text analysis results from concordances to directed graphs. This is certain something to consider for any future renovation of PhiloLogic and our related systems. We do have collocation clouds and I did a quick implementation of word frequency histograms (link) in PhiloLogic. But these are very rudimentary. Some of the examples in Hearst's a quite remarkable and we might want to model extensions of PhiloLogic on some of these.

One final note for you scribblers out here. She has a couple of entries on http://www.searchuserinterfaces.com/blog/ about how she talked her publisher (Cambridge) to let her put the book online for free and why. :-)

An important and visually compelling site/book.

Monday, May 18, 2009

Yoga for cyclists

Riding season has started again, so this should be obvious
[YouTube].