Monthly Report for November 2005

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

1. Document Classification

Have done scalability study (see ELPUB 2006 papers below) on classification based on geometric layout of blocks of information creating trees of information. Preliminary results indicate a manageable number of classes (order of hundreds for ten thousand documents)

 

In preparation of evaluating classification algorithms manually classified set of 500 documents.

 

Continue on developing algorithms for POINT (pages of interest to metadata extraction) page detection.

 

 

2. RDP Metadata Extraction

Created a demo site where we can demonstrate in easy way to DTIC personnel how form processing works. Demo site allows for selection of form type and to do real time processing of a document to extract its metadata.

 

Found some serious problems with significant number of files(270 of 934 samples), began to analyze the causes of the problem.

 

Analyzed the good and bad points of citeseer software to see what we can learn from it if anything

 

Many of the missed files are due to ocr problems, are fixing the process so that we can catch automatically ocr problems and repair, e.g., bad xml character output.

 

Are finding new forms for which we need to write templates.

 

Have begun to automate the step of handling form-based and n-n form documents in same step.

 

Extending engine to handle features needed to recognize the following errors: problems with some documents having title, subtitle, authors, others just title, author - authors being falsely identified as subtitle

 

3. Transition

DTIC process study

Decided to postpone process study for general case until RDP product delivered. Process for RDP handling will be described in read me file for RDP product due March 2006

 

Deliverables/Milestones

 

.1 Metadata Extraction - RDP

A. Engine enhancement design -   Engine design , Template specification   - has been delivered

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents

 

Continue design for the classification scheme to match a document with an appropriate template

 

Continue with form-based process