Monthly Report for May 2006

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

 

Classification: 

  1. Performed literature survey of image based classification algorithm and produced annotated bibliography.
  2. Looking at establishing authority files for publishers,  conference info, etc. Will consider exploiting existing collection metadata to build  lists

 

Metadata Extraction:

No Action this month

 

NASA Collection Characteristics:

  1. Started on program to convert the Marc records for the nasa documents into our xml notation so we can later compare for successful extraction.
  2. Downloaded a second set of nasa documents:

186 files from NASA - 43 forms, 143 non-forms

43 forms files, 3 unresolved  (form was an image)

143 non-form: classified manually into 5 classes with 75 docs

  1. Current template features not sufficient to handle these classes,  add "to end of paragraph" feature

 

Auxiliary Activities:

Investigated ‘blade’ PDF extraction software for use in getting tabular information as alternative to getting from ocr output. Will compare its performance on form pages that have problem with ocr.

  

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

Prepare presentation for NASA summarizing current characterization of their collection based on two samples of 200 documents each.

Prepare prototype software that will color visual display of point pages.

Prepare prototype software for classification based on authority files

Investigate the use of oracle based validation as alternative to classification

Work on feature enhancement of template engine