At the time of writing (mid 1998) fifteen years of Notes and Queries and Gentleman's Magazine have been made available. The selected journal consists of a wide mixture of intermingled article types laid out on the page in a somewhat unstructured manner. This means that providing full-page images is the simplest way to present the material. A segment of a page is shown in figure 8-1. Note the skewing of the original scan (probably to microfilm) and poor definition of the original typeface.
Each full page image is between 100 and 200K in size. Pages can be browsed by volume/number/page or searched directly. The input to the search database is OCR text from the page scans. This OCR is done using off the shelf software (designed for modern typefaces and layout) and without manual intervention (because of the prohibitive cost of correcting errors). The result is quite 'dirty' OCR, which would normally present problems in searching. As a partial solution, the Excalibur Technologies EFS search engine is being used for full text fuzzy matching on the dirty OCR text.
Last modified: Monday, 11-Dec-2017 14:39:58 AEDT