Monthly Report for February 2006

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

 

Classification:  Use of cover page and title page detection algorithm.

 

1.      Started an alternative approach to classification based on the principle that all documents from an organization are likely to be of similar type. Created a testbed of 168 documents and manually classified documents. Wrote template to extract organization name only from POINT page and clustered similar names into one class, e.g., RAND Education and RAND Health. Promising results. This can be generalized to representing class by feature list such as organization, date, or report number.

2.      Explored still further on learning techniques to be used for classification.

 

Detection of POINT (Page Of INTerest, e.g., cover/title pages): 

 

1.      Still refining the as to what statistics give the most differentiation of point from regular body pages and need an algorithm to tie all the statistics together to arrive at a decision process.

 

 

Metadata Extraction:

 

1.      Started on the next major deliverable to have an automated process for documents containing forms. Developed design document for process and diagrams to illustrate it. Started programming the process scripts.

2.      Developed metrics to measure success of the form extraction. Validated against real metadata as provided by DTIC(800,0000 records available).

3.      Developed excel spreadsheet to show ‘goodness’ of our extraction programs for 6 types of forms we have found in  the testbed(10,000 documents).

 

NASA Collection Characteristics:

1. Had discussion with CASI team on ways to pick documents for study at random, what time period we should us for our study and the mechanisms for handling security issues. See the memorandum coming out of this.

2. the details of getting the right set of documents still need to be resolved

 

Auxiliary Activities:

1.      Final version of conference paper on current extraction results for ELPUB 2006 submitted.

2.      Completed the summary of our work so far in a presentation to be given to GPO

3.      Committed to give three presentations at DTIC user conference(OAI, extraction and categorization); started laying out presentations.

  

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Concetrate on form based extraction as deliverables are coming up.

 

Evaluate various classification schemes.

 

Begin the analysis of NASSA documents once we have obtained the right set.