Monthly Report for July 2006
Project No.: 260671
Funding Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
Started work on using validation approach as a means for classification. Work completed so far under this category:
Finished with phrase dictionary & phrase tokenizer. Unit tested those and stats calculator.
Language framework largely complete and in cvs, see docs at
Language is based on Apache Jelly
(http://jakarta.apache.org/commons/jelly/), a scripting framework that
XML structure to be "executed" by binding fairly simple Java
classes (beans) to XML tags.
Jelly may be a good way to implement the extraction templates in
the future. The binding mechanism is very simple and, if we can come up
with a reasonably clean semantics for the template language itself, the implementation of any single tag is likely to be pretty straightforward.
1) Continue looking at idea of extracting selected field based largely upon content vocabulary (i.e., semantic tagging). Candidate fields include authors, author affiliations, and addresses. Looking at using regular expressions to narrow searching for names, etc. Downloaded common names from US Census bureau.
2) Finalizing work on
a. converting new Omnipage version 15 output to our non-proprietary model
b. revising form-based engine to work from the non-proprietary format instead of directly from Omnipage version 14
c. converting non-proprietary model to the clean XML intermediate form used as input to the nonform-based engine
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
1) Complete validation binding framework and core validation function sets (pattern testing, length & vocabulary tests, phrase dictionary tests)
2) Work on feature enhancement of template engine