Extracting Metadata And Structure
Project sponsored by DTIC, GPO, and NASA
DEMOS
     Demo 1

Test bed 39 DTIC documents
Approach Extract metadata from raw XML files obtained after OCR. We divide
the 39 documents into 2 classes and use a template for each document
class. Demo runs on the first page of the scanned PDF document.
OAI Support Yes, populate the extracted metadata into an OAI compliant digital library.
File Types The files that are accessible: original scanned PDF document,
raw XML for the first page, cleaned XML, and the extracted metadata
file in XML after execution.

Demo 2

Test bed 546 DTIC documents
Approach Extract metadata from raw XML files obtained after OCR. We divide
the 546 documents into 15 classes and use a template for each
document class. Demo runs on the first page of the scanned PDF document.
OAI Support Yes, populate the extracted metadata into an OAI compliant digital library.
File Types The files that are accessible: cleaned XML, and the extracted
metadata file in XML after execution.

Demo 3

Test bed 653 DTIC documents
Approach Extract metadata from raw XML files obtained after OCR. We treat
the 653 documents as one class and use only one template (prepared by
looking at 40 random documents). Demo runs on the first page of the
scanned PDF document.
OAI Support No
File Types The files that are accessible: original scanned pdf documents (first page only)
for first 100 documents and raw xml, cleaned xml, and extracted metadata for all
other documents

Demo 4

Test bed 17 DTIC documents (SF-298)
Approach Extract metadata from raw XML files obtained after OCR.
We took 17 documents of SF-298 class and use a template for that.
Demo runs on the first page of the scanned PDF document.
OAI Support Yes, populate the extracted metadata into an OAI compliant digital library.
File Types The files that are accessible: original scanned PDF document,
raw XML for the first page, cleaned XML, and the extracted metadata
file in XML after execution.

Demo 5

Test bed 33 DTIC documents (text based pdfs)
Approach Extract metadata directly from the pdf files.
Demo runs on the full-text pdf document.
OAI Support No
File Types The files that are accessible: original PDF documents
and the extracted metadata file in XML after execution.

Demo 6

Test bed 1044 DTIC documents (text based pdfs)
Approach Extract metadata directly from the pdf files.
Demo runs on the full-text pdf document.
OAI Support No
File Types The files that are accessible: original PDF documents
and the extracted metadata file in XML for all documents.

Demo 7

Test bed 18 DTIC color documents
Approach Extract metadata from raw XML files obtained after OCR.
We divide the 18 documents into 8 classes and use a template
for each document class. Demo runs on the coverpage of the
scanned PDF document.
OAI Support yes
File Types The files that are accessible: original PDF documents raw xml generated
by OCR, and the clean xml, and the extracted metadata file in XML after
execution.

Demo 8

Test bed 126 DTIC color documents
Approach Extract metadata from raw XML files obtained after OCR.
Initially the 126 documents are fed to our software, which
produces two classes of documents, resolved and unresolved.
Metadata is extracted directly from the resolved class.
The unresolved classes is further classified manually, and
appropriate templates for handling them are prepared
For resolved class: engine runs on full-length xml (ocr)
For unresolved class: engine runs on only the coverpages (currently).
OAI Support no
File Types The files that are accessible: original PDF documents raw, xml generated by OCR
extracted metadata in xml for resolved classes;
clean xml and template files for unresolved classes for which
extracted metadata file is available after execution.

Demo 9a

Test bed 10 GPO Type A and Type B documents without forms
Approach Extract metadata from raw XML files obtained after OCR. We divide
the 10 documents into 2 classes (Type A and Type B) and use a template
for each document type. Engine runs on coverpages only.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR, clean xml files and the
extracted metadata file after execution.

Demo 9b

Test bed 14 GPO Type C documents containing Technical Report Documentation Page
Approach Extract metadata from raw XML files obtained after OCR.
Our software uses a template for extracting metadata from the
Technical Report Documentation Page.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR, and the extracted metadata file
in xml after execution.

Demo 10

Test bed 50 DTIC documents containing sf298 forms
Approach Extract metadata from raw XML files obtained after OCR.
Our software uses 5 templates for different variations of
sf298 forms.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR, and the extracted metadata file
in xml after execution.

Demo 11

Test bed 57 GPO Documents without forms, 14 GPO Documents with forms,
16 Congressional Reports and 16 Public Law Documents
Approach Extract metadata from raw XML files obtained after OCR. We divide
the above documents into 3 classes and developed templates for each
document type. Engine runs on coverpages only.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR, and the extracted metadata file
in xml after execution.

Demo 12

Test bed The DTIC Test bed of total 9825 pdf documents.
Approach We executed our code on the testbed of 9825 documents
and classified the form-based documents (which constitute
upto 9250 documents) into 6 classes.
OAI Support no
File Types The files that are accessible: original PDF documents
and the extracted metadata file along with the
total fields/answers produced statistics.

Demo 13

Test bed 50 Nasa documents containing sf298 forms
Approach Extract metadata from raw XML files obtained after OCR.
Our software uses 2 templates for different variations of
sf298 forms.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR, and the extracted metadata file
in xml after execution.

Demo 14

Test bed The NASA Test bed of total 380 pdf documents.
Approach We executed our code on the testbed of 380 documents
and classified the form-based documents (which constitute
upto 81 documents) into 2 classes.
OAI Support no
File Types The files that are accessible: original PDF documents
and the extracted metadata file along with the
total fields/answers produced statistics.

Demo 15

Test bed 30 Nasa documents containing sf298 forms, using IDM approach(same
live demo as demo 13)
Approach Extract metadata from IDM XML files obtained from omnipage
14 raw XML. Our software uses 2 templates for different
variations of sf298 forms.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR(omnipage 14), and the extracted metadata file
in xml after execution.

Demo 16

Test bed 50 DTIC documents containing sf298 forms, using IDM approach(same
live demo as demo 10)
Approach Extract metadata from IDM XML files obtained from omnipage
14 raw XML. Our software uses 6 templates for different
variations of sf298 forms.
OAI Support no
File Types The files that are accessible: original PDF documents,
raw xml generated by the OCR(omnipage 14), and the extracted metadata file
in xml after execution.

Demo 17

Test bed 157 DTIC non-form documents; used Validator for Classification
Approach After OCR, the omnipage xml files are converted to IDM. For every document,
the metadata extractor applies multiple templates, and gets multiple metadata
files. Based on the existing dtic metadata statistics the Validator assigns
different confidence scores to the metadata fields in each file, and finally
picks the metadata file which has the highest confidence score.
OAI Support no
File Types The files that are accessible: original PDF documents, metadata from manually
classified approach, metadata picked up by the Validator, and the Validation
results.


Old Dominion University Digital Library Group. extract@cs.odu.edu