Monthly Report for  August 2006



Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06


Work Accomplishments during period




Continued work on using validation approach as a means for classification. Work completed so far under this category:


Language framework complete and in cvs, dictionary builder arguments: metadata field name & proportion of collection to use for construction of dictionaries (remainder for testing) & phrase length & output file name & seed     stats program arguments: metadata field name & proportion of collection to use for construction of dictionaries (remainder for testing) & phrase length  & constructed dictionary file name & output file name & seed



Metadata Extraction RDP:

  • Completed stylesheet converting new Omnipage version 15 output to our non-proprietary model
  • Continuing work on revising form-based engine to work from the non-proprietary format instead of directly from Omnipage version 14
  • Worked with DTIC to clarify steps in running the form-based automation process using Omnipage 14
  • Modifying steps in process in accordance with wishes from DTIC
    • Rename output directories and create backup directory for pdf files
    • Add  attribute to XML output to show  which form was used.
    • Add  attribute to XML output to show whether post-processing failed
    • Add log files showing reasons of failure of post-processing
    • Testing the modification in the code and developing test suite for future testing


Metadata Extraction general:

§         converting non-proprietary model to the “clean XML” intermediate form used as input to the nonform-based engine

§         adding statistics collection for coverpage detection to the convertor chaining stylesheets


NASA Collection Characterisitcs

§         the site has now  360+70 nasa documents

§         We have delivered (website) the template set (2) of forms found in NASA documents and set up demo for showing engine’s performance (correctly identifies all forms and fields in sample set )

§         Have manually classified documents into:


§         Class1 - 134 docs

§         Class2 - 37 docs

§         Class3 - 36 docs

§         Class4 - 15 docs

§         Class5 - 12 docs

§         Class6 - 4 docs

§         Still templates has to be written.

§         Unclassified - 43 docs

Problem Areas and Corrective Actions


The major problem is the complete change of Omnipage’s XML schema when they went form version 14 to 15; we are in the process of making the software independent of OCR software used.


Deviations in Cost/Schedule




Work to be Accomplished Next Period

1) Complete conversion to Omnipage 15 for both form and non-form based engines

2) Work on feature enhancement of template engine

3) complete NASA templates for two big classes

4) have first prototype of validation vs classification package