Monthly Report for July 2006



Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06


Work Accomplishments during period




Started work on using validation approach as a means for classification. Work completed so far under this category:


Finished with phrase dictionary & phrase tokenizer. Unit tested those and stats calculator.



Language framework largely complete and in cvs, see docs at



      Language is based on Apache Jelly

(, a scripting framework that

allows an

      XML structure to be "executed" by binding fairly simple Java

classes (beans) to XML tags.


Jelly may be a good way to implement the extraction templates in

the future. The binding mechanism is very simple and, if we can come up

with a  reasonably clean semantics for the template language itself, the implementation of any single tag is likely to be pretty straightforward.




Metadata Extraction:

1)      Continue looking at idea of extracting selected field based largely upon content vocabulary (i.e., semantic tagging). Candidate fields include authors, author affiliations, and addresses.   Looking at using regular expressions to narrow searching for names, etc. Downloaded common names from US Census bureau.

2)      Finalizing   work on

a.    converting new Omnipage version 15 output to our non-proprietary model

b.    revising form-based engine to work from the non-proprietary format instead of directly from Omnipage version 14

c.    converting non-proprietary model to the “clean XML” intermediate form used as input to the nonform-based engine



Problem Areas and Corrective Actions




Deviations in Cost/Schedule




Work to be Accomplished Next Period

1) Complete validation binding framework and core validation function sets (pattern testing, length & vocabulary tests, phrase dictionary tests)

2) Work on feature enhancement of template engine