Monthly Report for October 2006
Project No.: 260671
Funding Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Work Accomplishments during period
The new validation subsystem assesses confidence in the metadata extracted for a particular document based upon a mixture of objective criteria and statistical models of previously observed metadata from a document collection. We completed the code for collecting those statistics and compiled the statistical models for the DTIC collection based upon approx. 800,000 metadata records. Completed first system test of valdiator with actual DTIC documents based upon those models. Began the customization of the statistical code required to build similar models for the NASA collection.
Completed basic code to employ the validation susbsystem for document classification. Documents will be process with all applicable extraction templates, yielding multiple candidate metadata sets. Each set is then validated and the set with the highest confidence value is chosen. Preliminary results suggested that, even prior to tuning the validation spec, this approach yielded a close approximation to prior manually-assigned classifications. Began work on an online demo of the validation subsystem.
Metadata Extraction RDP:
Minor bug fixes only during this period.
Metadata Extraction general:
IDM spec for representation of OCR’d documents published on website.
Completed a basic framework for regression testing. Began first sweep (100 documents) of regression tests comparing output from older code that worked from the proprietary Omnipage 14 format with the newer code based upon the IDM format. Some differences were found and are being analyzed.
Completed evaluation of new approach to POINT (Page Of INTerest) detection intended to quickly identify pages that are likely to contain metadata, making them candidates for the template-based extraction. A wide range of both empirical and statistical tests were found to perform at most marginally better than a simple rule – choose the first and the last 5 pages of each document. Since POINT detection is primarily done as a speed optimization (to avoid deeper analysis of all pages in a document), we have elected to stay with the first 5/last 5 rule for now. Some of the new POINT tests may be folded into the template engine in the future, as they may prove useful in identifying the exact metadata.
Problem Areas and Corrective Actions
Sophisticated POINT detection appears to be ineffective – adopting simpler rule.
Continued suggestions from template authors for new tests to be incorporated into the template engine.
Deviations in Cost/Schedule
Scheduled package release for document classification has been deferred after several approaches explored in prior months proved inadequate. On the other hand, the switch to a novel validation-based approach to classification gives preliminary indications of promise. In lieu of a released package, we have prepared a web-based demo which can be monitored from http://dtic.cs.odu.edu/validation.html
Work to be Accomplished Next Period
1) Complete integration of validator with template extraction to perform post-hoc classification and tune the statistical models for this purpose.
2) Test the validation-classification approach
3) Complete the stastistical models for validation of the NASA collection
4) Complete regression test of Omnipage 14-based and IDM-based versions of the extractor and analyze any differences found.