Tuesday, November 10, 2009
Plaintext in PhiloLogic
A while back we added a plaintext loader in PhiloLogic at the request of several folks who wanted to work with documents from the Gutenberg Project, Liber Liber and (many) other archives of unencoded or minimally encoded documents. Other use cases for a plaintext loader include direct loading of OCR output and downloading E-PUBs from Google, which can also be converted to TEI as an alternative. I suspect that we will want to retain plaintext loading for implementations of PhiloLogic, since many folks appear to have significant restrictions on accessing materials from various vendors. In a recent blog post, Devin Griffiths described his examination of MONK and ProQuest data, deciding to assemble his own corpus from Project Gutenberg.