Monthly Report for October 2005

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

1. Document Classification

            Exploration of alternatives

We are considering classification schemes based on

a) simple rules such as presence of white space, bold face, centering and the like

b) geometric layout of blocks of information creating trees of information

 

·        Began exploring literature on automated typesetting to look for typographic conventions that may serve as indicators of layout division (e.g., what constitutes “significant” amounts of vertical/horizontal whitespace in relation to font size)

 

 

2. RDP Metadata Extraction

            Engine enhancement design

Refined metadata extraction code for handling SF-298 forms.

·        Robust handling of cases where OCR errors split a form/table cell vertically

·        Added capability to specify excluded strings as part of a template, so that boilerplate labels will not be confused with meaningful text appearing nearby

 

Evaluated precision and recall metrics for metadata extraction from SF-298 forms on an initial sample of 300 documents. Discovered additional variant forms, and added templates to process these variants. Results are posted on the website.

 

Uncovered additional OCR errors, including apparent software bug yielding non-conformant XML documents – ASCII character codes embedded in files that appear to lie outside the XML 1.0 standard. Developed input filters to correct prior to parsing XML.

 

Current design employs the conjunction of a large number of rules to identify metadata-bearing pages (i.e.,., title/cover pages). Began analysis of these rules as independent indicators to determine whether some might be redundant or possibly even counter-productive.

 

Developed preliminary measures of template complexity, as a preliminary step towards evaluating trade-offs in engine features versus template language modification

 

4. Transition

            DTIC process study

Not yet started as we want to have meeting with DTIC at DTIC first

            General Communications

Substantial update to website, including addition of on-line demos to support GPO presentation

 

Deliverables/Milestones

 

There are no deliverables specified for this time period.

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents

 

Continue design for the classification scheme to match a document with an appropriate template

 

Start on the DTIC process study