Monthly Report for March 2006

 

 

Project No.: 260671

Funding Agency: Defense Logistics Agency - Defense Technical Information Center

Award No.: SP4700-05-P-0148

Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II

Project Period: 09/19/05 - 09/18/06

 

Work Accomplishments during period

 

Detection of POINT (Page Of INTerest, e.g., cover/title pages):†

 

Continue working with POINT location rules - question of how to combine rules to recognize the limited number of POINT page classes. Normalization, possible combination include clustering, SVM, etc.

 

 

 

Metadata Extraction:

 

Evaluated the form based metadata extraction algorithm for around 10K documents. Created a Web site where one evaluated interactively a set of randomly selected, see the screen shot below for the Web site.

 

URL: http://128.82.7.60:8080/sf298/listFormClasses.do

 

 

 

Meeting with DTIC Personnel on March 20 and 21.

 

Items covered in the meeting are summarized below.

 

1. Form selection under new collection

what will be given as input for acceptance test and production

form generated at DTIC vs forms coming into process

text pdf vs image pdf

multiple forms in document,

which form is important(sf298-1, sf298_2,..) - resolve: report accuracy rates for individual forms, concentrate on that form

count 'answers produced' not in template classification part but after extraction

 

 

Action: wait for new collection (more representaTIVE - 50% FORM BASED) to arrive, scan for forms , & decide on strategic path

††††††††††††† send reminder email

 

 

2. Validation & post-processing

pulling in all of author info as authorís name

validate 'security classification' in report, abstract, title and 'distribution' 15. phrases like ' Approved for public release; distribution is unlimited' can be used for quality assurance, similarly, terms like' 19. security classification' should not be values of other fields.

Stamped 'DISTRIBUTION A: Approved for full...." was properly recognized but since stamp was in abstract it was put there.

confidence metric in extraction - linked with semantic

develop a module for throwing a result for human interaction

 

 

Action: develop post-processor for handling security classification & distribution based on authority files of known phrases, flag unknown values for human correction

 

 

3. Bug fixes and minor changes

ABSTRACT occurs in many fields as keyword

form over multiple pages

change definition of recall to say'" #answers' instead of "#fields" - yes

use the tag names for metadata the same as STINET - yes

 

Action: fixed

 

 

4, Automation steps

ocr process how long, how automated9. automate the two steps: open omnipage in batch mode and run directory of files, run our program on result of ocr

 

use the same naming scheme they use now (date, org) when ADA number not yet assigned

 

package development (installation & README)

 

 

 

Action: look at OCR process

††††††††††††† decide about target environment (Windows, Unix)

 

 

 

 

 

Problem Areas and Corrective Actions

 

None

 

Deviations in Cost/Schedule

 

None

 

Work to be Accomplished Next Period

 

Work on creating authority file for distribution and security statement and post-processing of the extracted metadata to normalize the distribution and security statement.

 

Work with NASA to get access to their documents based on randomly selected IDs.

 

Automate the form based metadata extraction process, and start packaging the software for delivery.