Monthly Report for September 2006
Project No.: 260671
Funding Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Work Accomplishments during period
We completed first version of the validation approach and started integrating the validation approach to be used for classification. For this, we plan to apply all templates to each document and use the validator to select the best results.
Starting testbed: 500 non-form DTIC documents
Metadata Extraction RDP:
Tested and released Version 2 of the DTIC Form-based Metadata Extraction Software. We tested the new release on 20 DTIC form & 10 DTIC
non-form, made sure files were being created in correct directories,
eyeballed results, checked that post-processor was working on form docs.
New version updates
1. Now no need to back up the pdf files, you can find all the pdf files you placed in “C:\dticdocs\dticsoftware_input_pdf” at “C:\dticdocs\backup”.
2. The “C:\dticdocs\dticsoftware_output\resolved” contains only three directory “meta” , “final” and “xml” the meta file contains the metadata file before post processing and now at the root node it contains a attribute specifying the type of template it used to extract the metadata. “Final” folder contains the metadata file after post processing and it contains a attribute at root node specifying whether the post processing step was successful or failed(after post processing the meta folder will be empty because that is the file which is used for validation and moved to final folder).
Metadata Extraction general:
· Integrated the new document model, IDM, with our extraction software
· Started on developing a regression test to enable easy testing of incremental changes to the code. Specifically, we want regression test to compare outputs of changed engine against prior outputs (demo 13 output). Use XMLUnit or similar XML diff program to check without flagging on whitespace and other trivial changes.
Problem Areas and Corrective Actions
The major problem is the complete change of Omnipage’s XML schema when they went form version 14 to 15. We uncovered more problems with the content of Omnipage 15 XML (forms split across pages - second page is not identified as a table at all. Add this to other concerns (all words in a line are marked as beginning at the same position). We concluded that Omnipage 15 XML is too broken to use.
Deviations in Cost/Schedule
Work to be Accomplished Next Period
1) Integrate validation approach to perform classification.
2) Test the validation approach
3) Complete regression test