Monthly Report for September 2005

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

1. Document Classification

            Exploration of alternatives

We are considering classification schemes based on

a) simple rules such as presence of white space, bold face, centering and the like

b) geometric layout of blocks of information creating trees of information

 

2. RDP Metadata Extraction

            Engine enhancement design

Refine metadata extraction code for handling SF-298 forms. The earlier code had several problems, which are described below.

 

OCR problems:  a line is split into two parts, no OCR output for some text,  letter recognition errors; and  word separation errors.

 

Other Problems:  text lines from a same cell may not be adjacent after sorting by xy-coordinates, complexity of similarity matching, and how to extract metadata when adjacent field name has a problem.

 

Develop precision and recall metrics for evaluating metadata extraction for SF-298 forms. Since there are five types of  SF-298  forms using  different names for metadata element names, we need to normalize them. Also, we need to take into account the missing metadata elements in SF-298 forms.

 

3. Transition

            DTIC process study

Not yet started as we want to have meeting with DTIC at DTIC first

            Develop Documentation Standards

We have developed a standard software development process and documented it in the deliverable section on our website.

 

 

Deliverables/Milestones

 

3.B has been put on our website

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Evaluate templates for extracting metadata from SF-298 using randomly selected documents

 

Integrate the improved code for handling SF-298  in the package

Design the classification scheme to match a document with an appropriate template

Start on the DTIC process study