Monthly Report for July  2007

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 04/30/07

 

Work Accomplishments during period

 

·        Had paper accepted at STEV07

·        Researched Text PDF as input instead of using OCR, continue working to see if we can emit a PDF tree with geometry, fonts & text

·        finished refactoring IDM generation

·        normalizing the set of properties

       1) convert old config file props toi descriptive hierarchical names

       2) convert directory/path fragments properties to full paths (use property substitution)

       3) convert getProperty to getPropertyAsFIle

       4) look for hard-coded strings for filenames and filename extensions - change to properties

·        Set up system and regression test in the build.system test in extractor subproject regression test at top level

·        Retested Omnipage 15 , still sam e problem with xml output all words have same x (left boundary) coordinates, however right boundaries ok

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

1) need an install script to create installation.properties pointing to the actual place where we have installed

2) more work on postprocessing, text pdf, regression testing

 

Papers:

“A Scriptable, Statistical Oracle for a Metadata Extraction System”, Kurt Maly, Steven J. Zeil Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Ratkal), accepted for publication, STEV07, Portland, Oregon, Oct 2007