Monthly Report for June 2006



Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06


Work Accomplishments during period




1)      Created testbed for preliminary experimentation with image-based classification algorithms. Demoed prototype software to display color visual layout of point pages.

2)      Proposed baseline/alternative to classifier – apply all templates and validate – choose the “best” results – this highlights the need for validation of extracted metadata. Spawned new active subgroup to focus on validation


Metadata Extraction:

1)      Added post-processing demo to website:

2)      Began looking at idea of extracting selected field based largely upon content vocabulary (i.e., semantic tagging). Candidate fields include authors, author affiliations, and addresses.   Idea would be to identify these, then search for other fields relative to the positions of the already-identified (e.g., a template might indicate that we should look for a title just above the author names)

3)      Omnipage OCR software – new version (15) drastically changes the XML output produced by OCR and used as input to our engines. After comparison of OCR output from old and new versions, we revived an older effort to define a non-proprietary XML document model for input to our engines. Began work on

a.    converting new Omnipage version 15 output to our non-proprietary model

b.    revising form-based engine to work from the non-proprietary format instead of directly from Omnipage version 14

c.    converting non-proprietary model to the “clean XML” intermediate form used as input to the nonform-based engine


NASA Collection Characteristics:

1) Completed feasibility report on collection characteristics:


Validation of Extracted Metadata

1)      Internal whitepaper proposed a collection of validation functions. Many proposed validation functions are statistical (e.g., is extracted data significantly longer than the average for this field? Does this extracted data contain an unusually large numbers of misspellings or unknown words? Does this extracted data contain an unusually small number of phrases that have never been seen in prior instances of this field?) Validation will therefore result in a “confidence” value indicating the degree to which extracted data meets the norms for the collection.  Not all validation functions will be appropriate to all fields, so propose to use custom bindings of validation functions to metadata fields via a collection-specific validation spec (in XML). 

2)      Set up collection of DTIC metadata for statistics collection on fields of interest

3)      Implemented code for length statistics collection

4)      Began work on phrase dictionary construction

5)      Began work on binding language based upon the Apache Jelly project (


Auxiliary Activities:

1) Closed R2 effort in view of lack of interest by potential sponsor


Problem Areas and Corrective Actions




Deviations in Cost/Schedule




Work to be Accomplished Next Period

1) Complete conversion of to non-proprietary document model and generation of that model from Omnipage 15

2) Complete validation binding framework and core validation function sets (pattern testing, length & vocabulary tests, phrase dictionary tests)

3) Work on feature enhancement of template engine