Monthly Report for September 2005
Project No.: 260671
Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
1. Document Classification
Exploration of alternatives
We are considering classification schemes based on
a) simple rules such as presence of white space, bold face, centering and the like
b) geometric layout of blocks of information creating trees of information
2. RDP Metadata Extraction
Engine enhancement design
Refine metadata extraction code for handling SF-298 forms. The earlier code had several problems, which are described below.
OCR problems: a line is split into two parts, no OCR output for some text, letter recognition errors; and word separation errors.
Other Problems: text lines from a same cell may not be adjacent after sorting by xy-coordinates, complexity of similarity matching, and how to extract metadata when adjacent field name has a problem.
Develop precision and recall metrics for evaluating metadata extraction for SF-298 forms. Since there are five types of SF-298 forms using different names for metadata element names, we need to normalize them. Also, we need to take into account the missing metadata elements in SF-298 forms.
DTIC process study
Not yet started as we want to have meeting with DTIC at DTIC first
Develop Documentation Standards
We have developed a standard software development process and documented it in the deliverable section on our website.
3.B has been put on our website
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Evaluate templates for extracting metadata from SF-298 using randomly selected documents
Integrate the improved code for handling SF-298 in the package
Design the classification scheme to match a document with an appropriate template
Start on the DTIC process study