Extracting Metadata And Structure
Project sponsored by DTIC, GPO, and NASA
  1. Template Scaling Study
    1. Study the relation between DTIC non-form documents and the number of templates needed to cover them. - Sept. 1, 2009
  2. Template Set
    1. Create agreed upon number of templates to reach expected percentage of coverage of non-form documents. March 15, 2010
      Delivered as part of the version 4.0 release on 3/15. Continued enhancement of set through versions 4.01 and 4.02. As of June 1, set contain 123 non-form templates and 11 form templates
  3. Engine enhancement
    1. Deliver the capability required to effectively use the additional templates. - Jan 31, 2010
      Used internally Jan 31, 2010. Included in version 4.0 deliverable, March 15, 2010
  4. Text PDF
    1. Implement the text PDF support system that can use a text PDF file as input without first using OCR, Software upgrade - March 1, 2010
      Delivered engine version 4.0 that can handle text PDF March 15, 20010
  5. System Integration
    1. Supporting the process of reviewing and correcting metadata Oct 1, 2010
      Used internally Oct 1, 2010. Preliminary version Included in version 4.0 deliverable, March 15, 2010 This component remains available but refinement has been de-emphasized pending decisions by DTIC on wherther this capability is compatible with their integration strategy.
    2. Hardware/software environment at DTIC DTIC site review Feb 15, 2010
    3. Hardware/software environment at DTIC Luratech interaction available for test March 1, 2010
      Used internally Nov 15, 2010. Included in version 4.0 deliverable, March 15, 2010
    4. Make virtual machine available for testing at ODU March 15, 2010
      Virtual machine has been available to DTIC throughout this phase. Updated to version 4.0 March 15, 2010. Has subsequently been updated to versions 4.01 & 4.02
    5. Code delivered for review March 15, 2010 (
      Delivered March 15, 2010.
      Code review returned by DTIC April 15, 2010)
    6. Complete system integrated (at DTIC by DTIC) May 1, 2010
      See note on G., below.
    7. Complete system integrated (at DTIC by DTIC with ODU fixes) May 15, 2010
      We had weekly conference calls starting March 1,2010 and major topic was testing of software and integration. Testing will be completed by June 30,2010 but integration will depend in part on test results and management decisions on most effective use of our software. Code has been through two cycles of review/corrections/updates by June 1, 2010
  6. Training
    1. Template writing training using Template Creation Tool at ODU, Training Seminar June 1, 2010
      A training seminar was conducted on June 9 and all material has been made available on the wiki of our website including the latest version of our software 4.0.2;
  7. Future Directions
    1. New template language April 30, 2010
    2. Techniques to enable the system to learn from human interventions April 30, 2010
      A new language has been drafted and a very preliminary prototype has been built. A metadata editor has been prototyped and discussions been held on how to take advantage of the use of this editor to enable learning.

Old Dominion University Digital Library Group. extract@cs.odu.edu