Templates Developed for GPO EPA Collection

Last Updated: 7/23/2008

Status

This status page provides access to a list of all document layour templates developed for the GPO EPA collection. The columns in the status page are:

Template:
A link to the actual document template descrigin the placement of metadat within this group of documents.
Sample Doc:
A link to an example (PDF document) of this particular document layout.
Expected Output:
Links to the "raw" output expected from the extraction engine when appled to the sample document. This expected metadata is human-generated and is used to test the templates. In full operation, this raw metadata would be passed through post-processing to convert values to standard forms, then validation to determine if the extracted data appeared trustworthy, and, finally, conversion to Marc XML. In viewing this data, keep the following in mind:
Test Code:
The actual code used to test the tempalte o nthe sample document. This is provided mainly to facilitate overview of the testing process by members of the ODU team.
Status:
Notes indicating whether a template has been written and tested. In particular, "Passed" indicates that the template and at least one test has been written, the test has been reviewed by a senior member of the ODU team, and the template has been run against the sample document and has passed that test.

Distribution

The table below summarizes relationship between the templates developed and the sample EPA collection provided by GPO as of 7/23/2008.

Layouts were selected for development if they appeared to have at least 5 examples among the sample collection. (Template development is assumed to be cost-effective only if a layout occurs frequently enough for the effort of template development to exceed the effort of manual extraction of the metadata.)

Templates developed cover 633/994 (64%) of EPA collection documents. Another 78 documents were covered by classes with fewer than 5 member documents, and the remaining 283 documents were singleton classes.

Template Name % of Collection Notes
5centered 0.6%
cosortregist 0.5%
coverHeader 5.7%
epa_nhx_sop 26.1% Variations of similar docs
epa_ord_study
epa_ord_labelled_title
facts-header 2.6% Variations of similar docs
facts-header_2
facts-header_3
glossy_1 1.7% Variations of similar docs
glossy_2
glossy_3
glossy_4
header2col 0.9%
hpv-assess 0.5%
hpvc 2.0%
HPVTest 0.6%
iuclid 1.8% Unclear what metadata to extract. Also, consistent use of date stamp is likely to cause frequent OCR errors.
nerl 0.6%
proceedings 1.6%
robustsum 0.7%
Robustsum2 1.0%
rsrchdev 0.6%
submission 0.6%
TestOverview 1.1%
testplan 6.6%
title2col 2.5%
titlecorp 1.4% Variations of similar docs
titlecorp2
tmdl 2.0%
toxreview 0.8%