Monthly Report for February 2007

 

 

Project No.: 260671/72

Funding Agency: Defense Logistics Agency - Defense Technical Information Center/NASA

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 3/31/06

 

Work Accomplishments during period

 

 

 

Classification:

 

Revisited the classification problem to identify that we do not get false positives from a larger test set. We ran all known non-form documents(about 500) through the integrated process with the four known classes and obtained only a few false positives for the 3 major classes. The fourth one which is a catchall template for the title did, as expected, collect a number of documents.

 

Metadata Extraction RDP:

 

The production staff performed a full scale test on 400 previously un-processed documents and reported a number of new requirements they would want us to address (standard formats for certain output fields such as date, title, contract number). They reported success in over 90% of cases but were dissatisfied with time saving because of these nonstandard outputs(we output what is in document which is not necessarily what gets into database). Also they reported a number of times process seems to hang see problem cases below.

 

Metadata Extraction general:

 

We have completed the collecting of statistics for the NASA collections. We have begun an adjusting of parameters to be able to reproduce the manual classification of the major NASA test collection classes

 

NASA Collection Characteristics

We have now completed the software configuration so that we can use different statistics and different collections as input and are preparing the final software package

 

 

Problem Areas and Corrective Actions

 

We have encountered a serious problems in the automated process of processing files for their first and last 5 pages which is then fed to Omnipage. Preliminary identification of the problem points to files that are password protected. We are in debugging phase and retesting. Because of the extensive nature of the classification and automation processes and unexpected problems we have decided to concentrate on the robustness of the final software package rather than on more enhancements and templates for DTIC.

We also have recommended that DTIC set up a separate testbed that can be used to test the final software in a DTIC environment before it goes into production.

 

Deviations in Cost/Schedule

 

Due to the problems encountered we propose to have the software delivered at the official end date of the grant 4/30 including all reports. However we propose to forego more templates and engine enhancements beyond the current number (on the order of dozens rather than a hundred). Also we want to push the tutorial into the next phase of the contract

 

Papers and Reports

 

We have prepared a presentation to be given at the upcoming user conference and also provide input into a presentation of Gopi to senior management

 

Work to be Accomplished Next Period

1) Continue testing of the integrated system

2) Tune the post-hoc validation approach to recognize correctly 4 DTIC categories and 2 NASA categories

3) Write reports