Monthly Report for January 2006
Project No.: 260671
Agency: Defense Logistics Agency -
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
Detection of POINT (Page Of INTerest, e.g., cover/title pages):
1. Carried out statistical study of metrics on documents pages, characterizing factors such as word & line count, dominant font sizes, etc.
2. Measured independent effectiveness of a variety of different simple tests for identifying POI NT, including deviations from the averages determined in the above-mentioned statistical study.
Classification: Use of cover page and title page detection algorithm.
1. Completed implementation of KMEAN bin-based classifier. Initial study on a small number of documents offered very promising results.
2. Evaluated new classifier on larger collection of documents. Performance on larger set was inadequate. Began looking at rules from POINT study to see if they could be employed in classifier..
1. Completed implementation and testing of new feature to identify “largest (fontsize) strings” as convenient rule for title extraction.
2. Added new templates aimed at public law documents.
3. Undertook study of extraction techniques employed by CiteSeer, including working from the open source. CiteSeer is widely regarded as having a quality metadata database. Only a portion of this, however, comes from document extraction. The (old) open source was found to use fairly ad-hoc methods that yielded relatively poor results. Sample documents were submitted to CiteSeer without accompanying metadata (to force the fall-back procedure of extracting the metadata from the document). Results were still poor, suggesting that current versions are less effective than our own techniques.
1. Obtained preliminary collection of NASA documents for subsequent study. Based upon preliminary review of this set, formulated characteristics of representative sample desired for the actual study and began process of obtaining that representative sample.
2. Prepared conference paper on current extraction results for ELPUB 2006
3. Began preparation for GPO presentation
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Incorporate POINT study into page identifier.
Consider alternative classifier schemes possibly based on POINT study rules and semantic tagging of lexemes (e.g., preliminary decomposition by publishing organization)
Commence study of NASA documents