Monthly Report for January 2006

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

Detection of POINT (Page Of INTerest, e.g., cover/title pages): 

 

1.      Carried out statistical study of metrics on documents pages, characterizing factors such as word & line count, dominant font sizes, etc.

2.      Measured independent effectiveness of a variety of different simple tests for identifying POI NT, including deviations from the averages determined in the above-mentioned statistical study.

 

Classification:  Use of cover page and title page detection algorithm.

           

1.      Completed implementation of KMEAN bin-based classifier. Initial study on a small number of documents offered very promising results.

2.      Evaluated new classifier on larger collection of documents. Performance on larger set was inadequate. Began looking at rules from POINT study to see if they could be employed in classifier..

 

 

Metadata Extraction:

 

1.      Completed implementation and testing of new feature to identify “largest (fontsize) strings” as convenient rule for title extraction.

2.      Added new templates aimed at public law documents.

3.      Undertook study of extraction techniques employed by CiteSeer, including working from the open source. CiteSeer is widely regarded as having a quality metadata database. Only a portion of this, however, comes from document extraction. The (old) open source was found to use fairly ad-hoc methods that yielded relatively poor results. Sample documents were submitted to CiteSeer without accompanying metadata (to force the fall-back procedure of extracting the metadata from the document). Results were still poor, suggesting that current versions are less effective than our own techniques.

 

Auxiliary Activities:

1.      Obtained preliminary collection of NASA documents for subsequent study. Based upon preliminary review of this set, formulated characteristics of representative sample desired for the actual study and began process of obtaining that representative sample.

2.      Prepared conference paper on current extraction results for ELPUB 2006

3.      Began preparation for GPO presentation

  

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Incorporate POINT study into page identifier.

 

Consider alternative classifier schemes possibly based on POINT study rules and semantic tagging of lexemes (e.g., preliminary decomposition by publishing organization)

 

Commence study of NASA documents