Monthly Report for October 2005
Project No.: 260671
Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
1. Document Classification
Exploration of alternatives
We are considering classification schemes based on
a) simple rules such as presence of white space, bold face, centering and the like
b) geometric layout of blocks of information creating trees of information
· Began exploring literature on automated typesetting to look for typographic conventions that may serve as indicators of layout division (e.g., what constitutes “significant” amounts of vertical/horizontal whitespace in relation to font size)
2. RDP Metadata Extraction
Engine enhancement design
Refined metadata extraction code for handling SF-298 forms.
· Robust handling of cases where OCR errors split a form/table cell vertically
· Added capability to specify excluded strings as part of a template, so that boilerplate labels will not be confused with meaningful text appearing nearby
Evaluated precision and recall metrics for metadata extraction from SF-298 forms on an initial sample of 300 documents. Discovered additional variant forms, and added templates to process these variants. Results are posted on the website.
Uncovered additional OCR errors, including apparent software bug yielding non-conformant XML documents – ASCII character codes embedded in files that appear to lie outside the XML 1.0 standard. Developed input filters to correct prior to parsing XML.
Current design employs the conjunction of a large number of rules to identify metadata-bearing pages (i.e.,., title/cover pages). Began analysis of these rules as independent indicators to determine whether some might be redundant or possibly even counter-productive.
Developed preliminary measures of template complexity, as a preliminary step towards evaluating trade-offs in engine features versus template language modification
DTIC process study
Not yet started as we want to have meeting with DTIC at DTIC first
Substantial update to website, including addition of on-line demos to support GPO presentation
There are no deliverables specified for this time period.
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents
Continue design for the classification scheme to match a document with an appropriate template
Start on the DTIC process study