Monthly Report for April 2006
Project No.: 260671
Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
1. Continued review of alternative classification algorithms – training new student in this area in view of Tang’s imminent PhD defense
2. Preliminary classification of the 156 form-free NASA documents from among the first 200 downloaded not very success. These large classes identified, accounting for nearly half the documents. The remainder did not appear to group together into obvious classes of their own. May need to pursue a larger sample to perceive an overall pattern.
3. Began exploring techniques for page structure reconstruction employed in commercial PDF-to-Whatever tools to see if these could improve our own page modeling.
1. Fixed engine bug that prevented detection of abstracts when keyword “Abstract” appeared inline with subsequent text rather than offset on a separate line.
2. Added processing to form-based engine to compensate for common OCR error in which adjacent cells of a form are incorrectly merged.
3. Extraction of one of the three NASA form-free document classes suggested possible limitations in current engine features. Needs to be explored further to propose modification/extension of engine.
4. Extraction of the 44 form-based NASA documents largely successful with current code – one new form template encountered.
5. PDF-to-metadata-file process for form-based documents automated. PDFs are dropped into a directory, and the corresponding metadata file eventually appears in another directory. Documentation and installation instructions for automated process developed and tested (by enlisting novices to the project to attempt to install and run the process).
6. Added post-processing phase to enforce standard wording of selected fields, e.g., security and distribution restrictions. Phrases extracted from document are compared to an authority file of known variants. Close matches (small edit distance) are then mapped to a standard phrasing in the final metadata generated. Experiments with security/distribution of 10,0000 “released for public access” documents resolved all but 192 documents to the standard phrase, though some 672 distinct variants of this supposedly standardized phrase were encountered. Completed implementation and testing of new feature to identify “largest (fontsize) strings” as convenient rule for title extraction.
7. Post-processor integrated with automated process and documentation updated.
8. Demo 12 on website (http://dtic.cs.odu.edu/protected/demo.html) updated to incorporate bug fixes and provide additional information. Need to incorporate the new post-processor.
1. Presentations at DTIC Users Conference: Zubair on OAI, Maly on Metadata Extraction, Zeil on Categorization
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Continue exploring alternate classification schemes.
Resolve engine/template issues revealed by new NASA documents.
Integrate earlier POINT studies with classifier.
Begin exploring dynamic validation issues (generalization of the authority-based post-processor).