Monthly Report for November 2006
Project No.: 260671/72
Funding Agency: Defense Logistics Agency - Defense Technical Information Center/NASA
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Work Accomplishments during period
Completed the integration of validator with template extraction to perform post-hoc classification and tune the statistical models for this purpose. Not unexpectedly we ran into a number of issues we needed to address. A major one is how to compound the confidence values of individual fields. First choices were min, max, avg, depending on the types of fields used. Had to introduce sum as well to allow for differences between some documents having many metadata fields with some wrong from few with all correct. Also issue of erroneous metadata introduced in the various knowledge databases. E.g., words such as defence contractors appeared in an author field producing wrong confidence for a phrase.
Preliminary results of using validation as a post-hoc classification are very encouraging. We correctly classified from a batch of 170 documents the 86 documents that were originally hand classified as being iin one class.
Metadata Extraction RDP:
No action during this period, except to integrate the RDP process into an entire package of code that will receive documents as input and produce appropriate output depending on whether the document has an RDP or whether the non-form process is to be employed.
Metadata Extraction general:
We have completed the repackaging of the statistics gathering code such that it is easily configured for different collections. We have begun the testing of the code by collecting different statistics for the NASA collections.
We have completed the regression testing and have now a process in place to repeat for new major changes.
Problem Areas and Corrective Actions
We have done extensive testing on the coverage of the 5-5 page rule (replace documents by their first 5 and last five pages) and have not found a single document that has POINT pages outside that range. We still will make the number of pages a parameter of a configurable system.
The engine enhancement has taken a back seat to the development of the classification approach and the integration of all components.
Deviations in Cost/Schedule
The software package (1C) promised for Sept 19 has now (Nov 15) been made available as a demo prototype(still needs fuller testing). That applies to the corresponding NASA deliverable as well (4C). The configurability of the software package (deliverable 4D) for classification has been addressed as we are developing the software and will be made available together with the delayed classification package.
Work to be Accomplished Next Period
1) Continue testing of the integrated system
2) Tune the post-hoc validation approach to recognize correctly 4 DTIC categories and 2 NASA categories
3) Decide on the prioritized feature set we want to implement for the final delivery
4) Assess the template set need we should deliver for final product, discuss with technical monitor priorities of final product