Monthly Report for November 2005



Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06


Work Accomplishments during period


1. Document Classification

Have done scalability study (see ELPUB 2006 papers below) on classification based on geometric layout of blocks of information creating trees of information. Preliminary results indicate a manageable number of classes (order of hundreds for ten thousand documents)


In preparation of evaluating classification algorithms manually classified set of 500 documents.


Continue on developing algorithms for POINT (pages of interest to metadata extraction) page detection.



2. RDP Metadata Extraction

Created a demo site where we can demonstrate in easy way to DTIC personnel how form processing works. Demo site allows for selection of form type and to do real time processing of a document to extract its metadata.


Found some serious problems with significant number of files(270 of 934 samples), began to analyze the causes of the problem.


Analyzed the good and bad points of citeseer software to see what we can learn from it if anything


Many of the missed files are due to ocr problems, are fixing the process so that we can catch automatically ocr problems and repair, e.g., bad xml character output.


Are finding new forms for which we need to write templates.


Have begun to automate the step of handling form-based and n-n form documents in same step.


Extending engine to handle features needed to recognize the following errors: problems with some documents having title, subtitle, authors, others just title, author - authors being falsely identified as subtitle


3. Transition

DTIC process study

Decided to postpone process study for general case until RDP product delivered. Process for RDP handling will be described in read me file for RDP product due March 2006




.1 Metadata Extraction - RDP

A. Engine enhancement design -   Engine design , Template specification   - has been delivered


Problem Areas and Corrective Actions




Deviations in Cost/Schedule




Work to be Accomplished Next Period


Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents


Continue design for the classification scheme to match a document with an appropriate template


Continue with form-based process