Metadata Extraction Software

Operational Manual

Package Version 4.0

March 15, 2010

Digital Library Research Group, Old Dominion University

1. Introduction

In this file, we provide instructions for running and monitoring the ODU CS Metadata Extraction software Version 4. Updated versions of this manual can be found at http://www.cs.odu.edu/~extract/developersWiki/doku.php?id=extract:manuals

During installation of the software, three important directories were selected:

  1. The installation directory where the software itself was installed.
  2. The input directory where PDF documents to be analyzed can be dropped
  3. The output directory where metadata extracted from the input PDFs will be placed.

Also selected at installation time was the default document collection for which the software would be configured.

For this document we assume

  • The input directory (the directory where you will put input documents to be processed) was chosen during installation as C:\extract\extractInputs.
  • The output directory (the directory where you will get all the output files) was chosen during installation as C:\extract\extractOutputs.
    • Please note that this directory further has several sub-directories and the final output goes to C:\extract\extractOutputs\resolved if extraction was successful or to C:\extract\extractOutputs\untrusted when extraction produces metadata whose quality is suspect.
      • By special request from DTIC, the DTIC release is configured to place all metadata output in C:\extract\extractOutputs\resolved regardless of whether the output is considered trustworthy or not.
    • The other sub-directories are for debugging and intermediate processing.
  • All the necessary software has been installed in the directory: C:\extract\installed.

You may have installed the software at slightly different locations, in which case you will need to adjust the instructions below accordingly.

2. Metadata Extraction Process

2.1 Starting the Software

2.1.1 Quick Start

Open a folder showing the installation directory. Inside that directory you will find a folder extract_software. Open this folder. Double-click on the file metadataextraction.bat within that folder. (For Linux installations, cd to that same directory and run metadataextraction.sh.) This launches the software, running on the default document collection selected at installation time.

Drop PDF files into the input directory C:\extract\extractInputs and the Extract software will begin processing them. (Other input options are described in Supplying Inputs, below.) Watch the directories C:\extract\extractOutputs\resolved and C:\extract\extractOutputs\untrusted for the output metadata.

2.1.2 Running on Other Document Collections

Open a folder showing the installation directory. Inside that directory you will find a folder extract_software. Open this folder. You will see a number of files within that folder with names of the form metadataextraction_collection.bat, with collection being the name of a document collection such as “dtic” or “gpo_epa”. Double-click on the one you want to launch the program, configured to work with that collection.

2.1.3 Customized Launch

Several behaviors of the system can be customized via command-line parameters. To do this, the metadataextraction.bat file must be launched from a Windows cmd window or from a “shortcut” icon defined with the appropriate command parameters. If you do not know how to do either of these, have one of your local systems staff create the desired shortcut for you.

The most important command-line parameters are:

  • -batchmode : The program will automatically shut down after processing everything in its input directory. If the input directory is empty when the program starts, it will wait for at least one file to be dropped there.
  • -nogui : Suppresses the appearance of the window-based interface. Status info on the progress made by the program will scroll by in a Windows cmd window. If batchmode has not been specified, the program will run until terminated with Ctrl-C.
  • -collection=name : changes the selection of which collection the program will work with

If you are collecting error and debugging information, possibly under the direction of or in response to requests by ODU team members, the following options may be requested:

  • -debugMode=true : Turns on collection of a substantial amount of information on the intermediate steps of the extraction process. (Currently, debug-mode is on by default.)
  • -properties=filename : Much of the behavior of the program is controlled by an elaborate set of properties files. This option overrides the normal choice of properties files (created as part of the software installation) and allows loading of a custom set of properties

2.2 Controlling the Software

As the software is running, you will see a window something like this:

Figure 1 – Main Window for the Extract Program

This window allows you to supply inputs to the program, and to see the progress being made on analyzing documents.

2.2.1 Supplying Inputs

  • You can supply input PDF documents to the program by simply copying the files into the input directory. Within a few seconds, you should see an entry appear for each such document as the program detects the input and begins to work on it. When the program finishes with documents supplied in this manner, the original PDFs are moved from the input directory to the backup folder within the output directory.
    • You can change which directory is used as the input directory by selecting Watch… from the File menu
  • You can also supply inputs directly to the program by selecting Load… from the File menu and then simply choosing the PDF file or files that you want to process. Files selected in this manner can be in any folder, not just in the input directory. Input files chosen in this manner are not moved to the backup folder when the program is done with them, but are left, unchanged, in their original location.

2.2.2 Monitoring Progress

The various columns shown for each document represent a specific step in the extraction of metadata from a document. It’s not really necessary to know what these steps are, but, roughly speaking (Figure 2), documents are subjected to input processing (conversion from PDF to an internal form, possibly requiring the use of a separate OCR program), then an attempt is made to locate a form (a.k.a. Report Document Page) from which metadata can be extracted. If no form is found, then documents are passed on for classification of the document’s layout and metadata extraction without a form. Extracted metadata from either form or non-form extraction is then post-processed into a standard form. Finally the output is checked to see if it appears valid.

Figure 2 Overview of Extraction Process

As the document passes through the various steps, the columns reflect the status of each step (Figure 1):

  • Green indicates a step that completed successfully
  • Yellow is a step in progress
  • White is a process that started but was found to be unsuited to the document
  • Grey indicates a step that has not started or that is not required for this document.
  • Dark red indicates a step that failed. Such failures should be reported to the support staff.
  • Pale red indicates a step that failed but for a document specific reason. The two common cases are:
    1. A pale red “no templates” entry under the Classification column indicates that the software could not find a matching template for that document. This could be because the document itself is in such poor condition that nothing much could be done with it, or because it is a new type of document and no template has yet been written for its layout.
    2. A pale red “untrusted” entry in the Validation column indicates that metadata was extracted, but that the software itself judges the output as being suspicious and is requesting human inspection and, if necessary, correction of the output. Clicking on this entry will open up a metadata editor that will allow direct correction of the metadata.

for which the failure can be handled in some acceptable manner. The most common reason for this is that metadata was extracted for a document, but the validation step decided that one or more output fields looked suspicious and is recommending that they be reviewed or corrected by a human.

2.2.3 Interacting with the Running Software

Clicking on the column headers for the document names, the starting times, or the finishing times will cause the table entries to be sorted on that column. This can be useful, for example, as a way of gathering all the unfinished document entries together at the top of the display.

Clicking on an entry for a particular document, below most of the column headers, will show the data generated in that step. Of particular interest:

  • Click an entry in the Document column to view the PDF of the original input.
  • Click a pale red “untrusted” entry in the Validated column to open up a metadata editor that will permit changes to the extracted metadata.
  • Click on a green “trusted” entry in the Validated column or a green entry in the Saved column to view the extracted metadata record.

Clicking on entries in other columns may bring up displays of intermediate data at different steps in the document processing, but these outputs will be mainly of interest to the software developers and maintainers.

2.2.3 Obtaining the Output Metadata

Output appears in the output directory designated at the time of installation of the software. If the extraction resulted in metadata that is considered by the program to be of good quality (the validation step completed with a green indicator) then the output appears in the resolved folder within the output directory. If the extraction resulted in metadata that is considered suspicious by the program (the validation step completed with a pale red indicator) then the output appears in the untrusted folder within the output directory.

(As noted earlier, for DTIC, all output will be placed into resolved no matter how the validator judges the output.)

2.3 Stopping the Software

Select Exit from the File menu. Shutdown is not always immediate. In particular, if an OCR operation is in progress, the software will wait for this to finish before stopping entirely.

3. Managing the Input and Output Directories

During normal operation, the input directory will be emptied out as documents are processed.

The output directory will accumulate files, and may grow unwieldy over time. You may want to clean this out if it gets too cluttered. You should make sure that you have copies of any document PDFs that were moved from the input directory to the backup folder in the output directory. Make sure, as well, that you have any output metadata sets that you need from the resolved and untrusted folders of the output directory. Any files in these three folders that you no longer need may be deleted at any time, even when the program is running.

The remaining folders in the output directory can be deleted if desired, but this should only be done when the program is not running.

extract/operation/operation_manual.txt · Last modified: 2013/02/11 13:47 by zeil
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0