Monthly Report for February 2007
Project No.: 260671/72
Funding Agency: Defense Logistics Agency - Defense Technical Information Center/NASA
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Work Accomplishments during period
Revisited the classification problem to identify that we do not get false positives from a larger test set. We ran all known non-form documents(about 500) through the integrated process with the four known classes and obtained only a few false positives for the 3 major classes. The fourth one which is a catchall template for the title did, as expected, collect a number of documents.
Metadata Extraction RDP:
The production staff performed a full scale test on 400 previously un-processed documents and reported a number of new requirements they would want us to address (standard formats for certain output fields such as date, title, contract number). They reported success in over 90% of cases but were dissatisfied with time saving because of these nonstandard outputs(we output what is in document which is not necessarily what gets into database). Also they reported a number of times process seems to hang – see problem cases below.
Metadata Extraction general:
We have completed the collecting of statistics for the NASA collections. We have begun an adjusting of parameters to be able to reproduce the manual classification of the major NASA test collection classes
NASA Collection Characteristics
We have now completed the software configuration so that we can use different statistics and different collections as input and are preparing the final software package
Problem Areas and Corrective Actions
We have encountered a serious problems in the automated process of processing files for their first and last 5 pages which is then fed to Omnipage. Preliminary identification of the problem points to files that are password protected. We are in debugging phase and retesting. Because of the extensive nature of the classification and automation processes and unexpected problems we have decided to concentrate on the robustness of the final software package rather than on more enhancements and templates for DTIC.
We also have recommended that DTIC set up a separate testbed that can be used to test the final software in a DTIC environment before it goes into production.
Deviations in Cost/Schedule
Due to the problems encountered we propose to have the software delivered at the official end date of the grant 4/30 including all reports. However we propose to forego more templates and engine enhancements beyond the current number (on the order of dozens rather than a hundred). Also we want to push the tutorial into the next phase of the contract
Papers and Reports
We have prepared a presentation to be given at the upcoming user conference and also provide input into a presentation of Gopi to senior management
Work to be Accomplished Next Period
1) Continue testing of the integrated system
2) Tune the post-hoc validation approach to recognize correctly 4 DTIC categories and 2 NASA categories
3) Write reports