Monthly Report for March 2006
Project No.: 260671
Funding Agency: Defense Logistics Agency - Defense Technical Information Center
Award No.: SP4700-05-P-0148
Project Title: Tools for Automatic Extraction of Metadata from DTIC Electronic Documents Collections - Phase II
Project Period: 09/19/05 - 09/18/06
Work Accomplishments during period
Detection of POINT (Page Of INTerest, e.g., cover/title pages):†
Continue working with POINT location rules - question of how to combine rules to recognize the limited number of POINT page classes. Normalization, possible combination include clustering, SVM, etc.
Evaluated the form based metadata extraction algorithm for around 10K documents. Created a Web site where one evaluated interactively a set of randomly selected, see the screen shot below for the Web site.
Meeting with DTIC Personnel on March 20 and 21.
Items covered in the meeting are summarized below.
1. Form selection under new collection
what will be given as input for acceptance test and production
form generated at DTIC vs forms coming into process
text pdf vs image pdf
multiple forms in document,
which form is important(sf298-1, sf298_2,..) - resolve: report accuracy rates for individual forms, concentrate on that form
count 'answers produced' not in template classification part but after extraction
Action: wait for new collection (more representaTIVE - 50% FORM BASED) to arrive, scan for forms , & decide on strategic path
††††††††††††† send reminder email
2. Validation & post-processing
pulling in all of author info as authorís name
validate 'security classification' in report, abstract, title and 'distribution' 15. phrases like ' Approved for public release; distribution is unlimited' can be used for quality assurance, similarly, terms like' 19. security classification' should not be values of other fields.
Stamped 'DISTRIBUTION A: Approved for full...." was properly recognized but since stamp was in abstract it was put there.
confidence metric in extraction - linked with semantic
develop a module for throwing a result for human interaction
Action: develop post-processor for handling security classification & distribution based on authority files of known phrases, flag unknown values for human correction
3. Bug fixes and minor changes
ABSTRACT occurs in many fields as keyword
form over multiple pages
change definition of recall to say'" #answers' instead of "#fields" - yes
use the tag names for metadata the same as STINET - yes
4, Automation steps
ocr process how long, how automated9. automate the two steps: open omnipage in batch mode and run directory of files, run our program on result of ocr
use the same naming scheme they use now (date, org) when ADA number not yet assigned
package development (installation & README)
Action: look at OCR process
††††††††††††† decide about target environment (Windows, Unix)
Problem Areas and Corrective Actions
Deviations in Cost/Schedule
Work to be Accomplished Next Period
Work on creating authority file for distribution and security statement and post-processing of the extracted metadata to normalize the distribution and security statement.
Work with NASA to get access to their documents based on randomly selected IDs.
Automate the form based metadata extraction process, and start packaging the software for delivery.