Monthly Report for December 2005
Project No.: 260671
Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
Classification: Use of cover page and title page detection algorithm.
Developed algorithms to detect cover page and title page. The detection of these pages will help in developing robust classification algorithms. Some of the issues we identified while developing these algorithms are listed below.
1. Missing first page in xml (for example, ADA428853.xml), cover page location detect the second page as cover page. Some "#eagle" documents used first page and some used second page. Therefore, our code classified them into different classes.
2. Limitation of current classification code .
a) Preprocessing is not good enough. For pages from same classes, it tends to produce different number of blocks
b) Only record the information about the first document in a class. New documents try to match the first document in a class instead of the mean of all documents
c) Too strict graph match (i.e. graphs should be in similar position with similar size (10% threshold)) However, OCR may output graph much different ( for example, 428998 vs. 429030)
Implemented KMean clustering algorithm for the same set with a simple similar measure ( divide a page by 30*50 bins, where each bin is either text bin or not, and compute how many bins are matched). The preliminary result is not bad. Still need more validation to check whether KMean is implemented correctly.
We also looked at various page statistics and its relevance to the cover/title page detection. For example, we looked at: word density/page , avg, deviation , dominant font, dominant font use in words/pg, avg, dev
Added new feature, largestsize, in our template for detecting title.
largestsize(ploc1, ploc2, wordcount, length, average_word_length1, average_word_length2, letterperc, below_field)
where ploc1 and ploc2 are used to specify a vertical area in a page
average_word_length1 and average_word_length2 are used to specify the range of average word length
letterperc is the percentage of letters in a line
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Continue precision/recall study for templates extracting metadata from SF-298 using randomly selected documents
Continue design for the classification scheme to match a document with an appropriate template