Monthly Report for December 2005

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

Classification:  Use of cover page and title page detection algorithm.

           

Developed algorithms to detect cover page and title page. The detection of these pages will help in developing robust classification algorithms. Some of the issues we identified while developing these algorithms are listed below.

 

1. Missing first page in xml (for example, ADA428853.xml), cover page location detect the second page as cover page. Some "#eagle" documents used first page and some used second page. Therefore, our code classified them into different classes.

 

2. Limitation of current classification code .

    a) Preprocessing is not good enough. For pages from same classes, it tends to produce different number of blocks

    b) Only record the information about the first document in a class. New documents try to match the first document in a class instead of the mean of all documents

    c) Too strict graph match (i.e. graphs should be in similar position with similar size (10% threshold)) However, OCR may output graph much different ( for example, 428998 vs. 429030)

 

Implemented KMean clustering algorithm for the same set with a simple similar measure ( divide a page by 30*50 bins, where each bin is either text bin or not,  and compute how many bins are matched). The preliminary result is not bad. Still need more validation to check whether KMean is implemented correctly.

 

We also looked at various page statistics and its relevance to the cover/title page detection. For example, we looked at:  word density/page , avg, deviation , dominant font,        dominant font use in words/pg, avg, dev

 

New Features

Added new feature, largestsize,  in our template for detecting title.

 

largestsize(ploc1, ploc2, wordcount, length, average_word_length1, average_word_length2, letterperc, below_field)

 

where ploc1 and ploc2 are used to specify a vertical area in a page

 average_word_length1 and average_word_length2 are used to specify the range of average word length

letterperc is the percentage of letters in a line

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents

 

Continue design for the classification scheme to match a document with an appropriate template