Monthly Report for February 2006
Project No.: 260671
Funding Agency: Defense Logistics Agency - Defense Technical Information Center
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
Classification: Use of cover page and title page detection algorithm.
1. Started an alternative approach to classification based on the principle that all documents from an organization are likely to be of similar type. Created a testbed of 168 documents and manually classified documents. Wrote template to extract organization name only from POINT page and clustered similar names into one class, e.g., RAND Education and RAND Health. Promising results. This can be generalized to representing class by feature list such as organization, date, or report number.
2. Explored still further on learning techniques to be used for classification.
Detection of POINT (Page Of INTerest, e.g., cover/title pages):
1. Still refining the as to what statistics give the most differentiation of point from regular body pages and need an algorithm to tie all the statistics together to arrive at a decision process.
1. Started on the next major deliverable to have an automated process for documents containing forms. Developed design document for process and diagrams to illustrate it. Started programming the process scripts.
2. Developed metrics to measure success of the form extraction. Validated against real metadata as provided by DTIC(800,0000 records available).
3. Developed excel spreadsheet to show ‘goodness’ of our extraction programs for 6 types of forms we have found in the testbed(10,000 documents).
NASA Collection Characteristics:
1. Had discussion with CASI team on ways to pick documents for study at random, what time period we should us for our study and the mechanisms for handling security issues. See the memorandum coming out of this.
2. the details of getting the right set of documents still need to be resolved
1. Final version of conference paper on current extraction results for ELPUB 2006 submitted.
2. Completed the summary of our work so far in a presentation to be given to GPO
3. Committed to give three presentations at DTIC user conference(OAI, extraction and categorization); started laying out presentations.
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Concetrate on form based extraction as deliverables are coming up.
Evaluate various classification schemes.
Begin the analysis of NASSA documents once we have obtained the right set.