Monthly Report for November 2005
Project No.: 260671
Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
1. Document Classification
Have done scalability study (see ELPUB 2006 papers below) on classification based on geometric layout of blocks of information creating trees of information. Preliminary results indicate a manageable number of classes (order of hundreds for ten thousand documents)
In preparation of evaluating classification algorithms manually classified set of 500 documents.
Continue on developing algorithms for POINT (pages of interest to metadata extraction) page detection.
2. RDP Metadata Extraction
Created a demo site where we can demonstrate in easy way to DTIC personnel how form processing works. Demo site allows for selection of form type and to do real time processing of a document to extract its metadata.
Found some serious problems with significant number of files(270 of 934 samples), began to analyze the causes of the problem.
Analyzed the good and bad points of citeseer software to see what we can learn from it – if anything
Many of the missed files are due to ocr problems, are fixing the process so that we can catch automatically ocr problems and repair, e.g., bad xml character output.
Are finding new forms for which we need to write templates.
Have begun to automate the step of handling form-based and n-n form documents in same step.
Extending engine to handle features needed to recognize the following errors: problems with some documents having title, subtitle, authors, others just title, author - authors being falsely identified as subtitle
DTIC process study
Decided to postpone process study for general case until RDP product delivered. Process for RDP handling will be described in read me file for RDP product due March 2006
.1 Metadata Extraction - RDP
A. Engine enhancement design - Engine design , Template specification - has been delivered
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents
Continue design for the classification scheme to match a document with an appropriate template
Continue with form-based process