Metadata Extraction Software

Installation Instructions

Package Version 4.1

June 28, 2010

Digital Library Research Group, Old Dominion University

1. Introduction

This document provides instructions for installing Extract, the ODU CS Metadata Extraction software. The software package is distributed as an executable Java JAR file. The package itself is platform-independent. However, because the OCR software used in conjunction with it only supports Microsoft Windows environments at this time, we provide installation instructions only for Windows.

Updated versions of this manual can be found at http://www.cs.odu.edu/~extract/developersWiki/doku.php?id=extract:manuals

2. System Requirements

  • Operating System: Windows XP or Vista.
  • Disk storage: approx 350 Meg. (An addition 50-100 Meg needed during installation)
  • OCR Software (optional) - Currently supported are
    • Luratech (Abbyy), or
    • Omnipage 14 (support is being phased out)

3. Basic Installation

This section discusses a basic installation scenario.

We assume the use of the Luratech PDF Compressor product for OCR support. We assume in this section that you are interested in installing a single copy of the Extract program, and that it will reside on the same PC as the Extract program.

A later section will cover how to modify these installation procedures for more complex scenarios, such as having multiple installations of Extract on different PCs sharing a common Luratech installation via the local network.

3.1 Install and Configure Luratech

Skip this step if you are going to use the Extract program without an OCR program.

Step 1: Install the Luratech LuraDocument PDF Compressor software.

Step 2: Create a folder named “Luratech” on the C: drive. Inside that folder, create three folders named “ocr_in”, “ocr_out”, and ocr_error respectively.

(You may choose to use different folders for this purpose if necessary. In particular, if those folders are already in use for some other purpose, or if you need to keep this data on a different drive, then by all means use a more appropriate location. As you proceed through the remainder of the configuration of Luratech and the installation of the metadata extraction software, remember where you chose to place these folders and enter that information in place of any uses of c:\Luratech\… )

Step 3: You should see a red or green Luratech icon in the taskbar tray. Right-click on this and select “Open main window”.

  • The LuraDocument control window should open.

Step 4: From the Entry menu, select “Add New Entry”.

  • You will see a new “Stopped” entry created. It will probably have a name “Entry 01” or something similar.

Step 5: Right-click on the new entry and select Properties.

  • The properties window will open up

Step 6: Give this entry a more descriptive name (e.g., Extract batch)

Step 7: On the “Mode” tab, select “Apply OCR”. Click on “OCR Options”.

  • The OCR Options window will come up.
  • Select the “most accurate” Mode. For “Additional output formats”, select XML and choose “simplified” from the drop-down box.
  • Select “Auto-detect page orientation”.
  • Click “OK” to close the OCR Options window.

Step 8: On the “Input” tab, select “Directory” and then click on the folders button and choose the directory C”\Luratech\ocr_in folder that you created in step 2.

  • Select “Check Every” and enter 15 for seconds.
  • For Input file formats, select PDF and clear the other options.

Step 9: On the “Output” tab, select “Place output in directory” and then click on the folders button and choose the directory C”\Luratech\ocr_out folder that you created in step 2.

  • Select “Overwrite existing”.

Step 10: On the “Options” tab, select “Delete input file” on success and “Move input file to” on failure. Click on the “failure” folders button and choose the directory C:\Luratech\ocr_errors folder that you created in step 2.

Step 11: Click on “OK” to close the Properties window.

Step 12: Back in the main window, right-click on the new entry and select “Start” to run Luratech.

3.2 Install Java JRE

Obtain the Sun Java JRE from http://java.com/en/download/index.jsp Follow the instructions given there to install.

3.3 Metadata Extraction software installation

Download the software at http://dtic.cs.odu.edu/deliverables/deliverablesNew.html. The software is packaged as a compressed jar file. Download this file and save it in a convenient folder. Open a window showing that folder so that you can see the downloaded file. Warning: some versions of Microsoft Internet Explorer will change the file extension from “.jar” to “.zip” when it is saved. If you discover that this has occurred, you will need to rename the file to restore the “.jar” ending. Right-click on the file and select “rename” from the menu.

Following are the steps of installation:

Step 1: Double click the JAR file to start the installation.

Step 2: Choose the “EndUser” option in the “InstallationType”..

You will be presented with a dialog box of installation options. Some of these will be disabled (greyed out), indicating that these options are not available or appropriate for that document collection. Even those that can be changed will have been filled in with “default” or recommended values, which in many cases are likely to be acceptable for your organization.

Step 3: Linux installations: The top “Choose” button is for the location of the jar file that you have launched. This is not always determined accurately when launching on Linux systems. Use the corresponding “Choose” button to correct this.

Step 4: Specify the location where the software should be installed (click on the circled portion). This could be any directory of your choice. The installer will have suggested a default location, and you can use that if you find it acceptable.

Step 5: Select the folder where you will put the input PDF documents to be processed(click on the circled portion). The installer will have suggested a default location, and you can use that if you find it acceptable. Again, it is best if this is an existing but empty folder.

Step 6: Select the output folder where the software should place the metadata it extracts(click on the circled portion). The installer will have suggested a default location, and you can use that if you find it acceptable. It is best if this folder is empty, to avoid overwriting other unrelated files.

Step 7: If you are not using an OCR program, skip to step 8.

If you have configured Luratech to takes its input and output files someplace other than C:\Luratech (see Install and Configuring Luratech, step 2, above), then click the “Change” buttons for OCR batch mode is set to read PDFs from: and enter the alternative you selected for C:\Luratech\ocr_in. Then click the “Change” buttons for OCR batch mode is set to write XML to: and enter the alternative you selected for C:\Luratech\ocr_out.

Step 8: Installation is complete! Press Continue to finish.

The installed Metadata extraction software consists of a set of JAR files and other files needed to extract metadata from PDF files. A batch file(metadataextraction.bat) is provided at installationDirectory\extract_software directory to simplify the execution.

4.0 More Advanced Installations

4.1 Fixing/Changing Choices Made at Installation Time

If you decide to move the folders you use for PDF input or metadata output, or if you wish to change the directories used for working with the Luratech OCR, program, you may do so by re-installing. However, a re-installation is not really necessary. These options are controlled by properties files and can be changed using Notepad or any other plain text editor.

Look in the in the installed-software directory. The two files of particular interest are installation.properties and OCRbatch.properties.

In installation.properties, you should find two lines that look something like

extract.inputDir=C:/extract/extractInputs
extract.outputDir=C:/extract/extractOutputs

The values on the right of the '=' characters may vary depending on what folders were specified as installation time.

Change these to alter the PDF input and metadata output directories, respectively. [Take note: entries in properties files always use the / symbol to separate folders, not the \ character.]

In ocrBatch.properties, you should find two lines that look something like

extract.inputProcessing.ocr.in_dir=C:/Luratech/ocr_in
extract.inputProcessing.ocr.out_dir=C:/Luratech/ocr_out

The values on the right of the '=' characters may vary depending on what folders were specified as installation time.

Change these to alter the directories used for interacting with Luratech.

4.2 Sharing an OCR Program Over a Local Network

In this section we describe the procedure for installing one or more copies of the Extract program on PCs on a local network where a single installation of the Luratech OCR program resides on another PC.

For this to work:

  • Your network must permit “shared folders” among the PCs that will be used.
  • You must have purchased a “networking” license from Luratech that allows you to use their program to read and write files in shared folders.

For the sake of this discussion, we will assume that you want to install the Extract program on N different PCs, which we will refer to as “ExtractPC1”, “ExtractPC2”, …, “ExtractPCN””. We will assume that Luratech has already been installed on another PC, which we will call ”“LuratechPC”.

  1. On the LuratechPC, create folders named Luratech1, Luratech2, …, LuratechN. Within each of those folders, create sub-folders ocr_in, ocr_out, and ocr_error.
  2. Use Windows Explorer (a.k.a. My Computer) to view the folder(s) containing Luratech1, Luratech2, …, LuratechN. Right-click on each of those folders, select “Properties”, and go to the Sharing tab. Select “Share this folder”. [This procedure may vary slightly for differnet versions of the Windows operating system.]
  3. Go through the steps to 3.1 Install and Configure Luratech, starting at step 3 and using your new Luratech1 folder instead of c:\Luratech. Repeat for Luratech2, …, LuratechN. At the end of this process, you should have a separate Luratech batch job for each PC that will be running Extract.
  4. Now log on to your Extract1 PC. You now want to create a shortcut to the Luratech1 share on the LuratechPC. [The procedure for doing this may vary considerably depending on what version of Windows you are running and on how your local network is set up.] Use My Computer to view the place where you would like this shortcut to reside (e.g., the C: drive). Right-click in an empty space within that folder/drive and select “New shortcut”. When the Create Shortcut wizard pops up, you can either enter the share name directly (e.g., \\LuratechPC\Luratech1) or use the Browse button to navigate your local network and search for it. If you can't find the share, please contact a local expert on your network setup. Repeat this process on each of the machines ExtractPC2 …, ExtractPCN.
  5. Go through the steps to 3.2 Install Java JRE on each of the machines ExtractPC1, ExtractPC2 …, ExtractPCN.
  6. On the ExtractPC1 machine, go through the steps for 3.3 Metadata Extraction software installation, using your new Luratech1 shortcut instead of c:\Luratech. Repeat for Luratech2, …, LuratechN.

4.3 Sending the Extract Output to Another Machine Over a Local Network

The final metadata output appears in the extractOutputs\resolved folder.

  • This can be shared over the local network so that software running on another PC can load the generated files of metadata from it.
  • Alternatively, it can be replaced by a shortcut to a shared folder on the same PC or another PC on the network that already exists for use by some other software.
    • Multiple installations of Extract on different PCs can use this technique to route their output to a common collection point.
  • Similarly, the work folder among multiple installations of the Extract program can be shared so that locally-produced templates can be deployed to a single folder and thereby be made available to several installations of the software.
    • Note, however, that staff who are working on the development of new templates should not share their work folders with production copies of the Extract program, lest they inadvertently propagate buggy templates into the production environment.
  • As a general rule, sharing of the other Extract folders is not recommended, nor are we aware of any procedural benefits to additional sharing.

5.0 Updating the Software

An installed instance of the Extract program contains two main directories (folders): extract-software and work.

The extract-software directory contains the source code and data obtained from our distribution of the software.

The work directory contains local modifications (e.g., templates developed at your own site).

To update the Extract program software:

  1. Make a back-up copy of the work folder, just in case.
  2. Take note of the directories being used, in the old installation, for the software installation, the PDF input folder (often c:\extract\extractInputs), the metadata output folder (often c:\extract\extractOutputs) and the Luratech input and output folders (often c:\Luratech\ocr_in and c:\Luratech\ocr_out). [If you are unable to determine what folders were used, you can find this information recorded in the \\extract-software
    folder at the files installation.properties and OCRbatch.properties.]
  3. Clean out the PDF input and metadata output filders [optional, but recommended].
  4. Get a new binary distribution of the Extract software. you may do this by downloading it from our Software Releases page or by getting the latest source distribution from that page and compiling the source code. Either way, you should wind up with a .jar file.
  5. Follow the instructions at 3.3 Metadata Extraction software installation, above. Make sure to give the same answers for the installation location, the PDF input, metadata, and Luratech directories as were used before.

The new installation will replace all files in the extract-software directory but should leave your local modifications in the work directory unchanged.

extract/installation/installation_manual.txt · Last modified: 2013/02/11 13:51 by zeil
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0