Form Template Writers' Manual

June 28, 2010

Digital Library Research Group, Old Dominion University

1. Introduction

This document describes the current state of the template language for processing of form documents in the Extract metadata extraction system. Updated versions of this manual can be found at http://www.cs.odu.edu/~extract/developersWiki/doku.php?id=extract:manuals

Readers unfamiliar with the overall purpose and organization of this system are referred to this paper. This document concentrates upon the portion of the overall system associated with extraction of raw metadata from documents that contain “report data pages” (RDP), metadata-bearing forms. Such documents are processed by application of the best-matching of several templates, each template designed to describe a distinct RDP.

The template language is derived from the PhD thesis by J. Tang and retains the overall structure defined there, though significant additions and modifications have been made since then.

At the time of extraction, documents (originally received as PDF files) have been converted to IDM, a tree-like structure in which a document is divided into pages, pages divided into regions, regions into smaller regions and paragraphs, and paragraphs into lines of words.

The “form template engine” constructs a “scroll” consisting of all the successive lines making up the document and attempts to match this against a form template. (Because this scroll ignores page boundaries, forms can span multiple pages.)

2. The Template Language

2.1 Overview

The essential structure of the form template language is as follows:

  • A form consists of a number of rectangular regions called cells. Each cell contains a unique identifying label of some kind. Additional text in the cell may consitute a metadata value identified by that label.
    • For example, a cell might be labeled “2. REPORT TYPE”. In a particular document, this might be have been filled in by placing the text “Final report” below that label.
    • Cells are frequently, though not always, separated on the page by lines forming boxes around each cell.
  • A template describes the layout of such an RDP form.
    • Each template has a unique name, called the template ID.
    • Each template has a list of one or more cell labels.
    • Each template has one or more rules describing the geometry of the labels and of any metadata text that we wish to extract.
      • For example, we might specify that metadata for an UnclassifiedTitle field is located below the label “2. REPORT TYPE”, to the left of a label “5a. CONTRACT NUMBER”, and above a label “6. AUTHOR(S)”.
      • Note that geometry rules tell us both where to find desired metadata but also constrain the possible arrangements of labels on the page. In practice, such constraints suffice to distinguish one form from another (even when both forms contain similar sets of metadata) and also to distinguish a form from other incidental occurrences of the label text (e.g., the same label text might appear in a page of instructions on how to fill out a form, but these incidental uses will almost certainly not satisfy the geometric constraints of the real form,
    • A template may also specify any number of patterns for lines to exclude (typically lines that appear in the top and bottom margins of the page) from metadata when processing the form.

2.2 Applying Templates to Documents

As noted earlier, before a template is applied to a document, the document has been converted from the input format (PDF) into IDM. IDM is a hierarchical structure in which words are contained within lines, lines within paragraphs, paragraphs withing regions, and so on up to pages that are contained inside the document itself.

Templates, however, are based on a line-by-line view of the document. They are applied to a scroll, a linear list of the lines that make up the pages of the document.

Application of a template begins with an attempt to locate all of the cell labels defined for that form. Each cell may have several alternate labels (rewording of the text in sf298 and other forms is surprisingly common), any one of which will be accepted as identifying the cell. If more than one match is found in the document, the geometric rules are used to discard matches that are inconsistent with the specified layout of the form.

All attempts to locate cells labels are accomplished by fuzzy matching, meaning that a certain number of misspellings and alterations are permitted. The exact number varies with the size of the label. A label with only a few characters may need to match all but one character. A label with hundreds of characters may be accepted with tens of character mismatches.

It's possible that some cells might not be located at all. Typically, this arises because of severe OCR errors encountered when interpreting a scanned image of the form. There is a threshold set by the program governing the minimum percentage of cells that must be identified if processing is to continue.

Once the cell labels have been located, then the geometry rules are used to define the regions of the document pages that “belong to” each cell.

The remaining text (not part of a cell label) of the pages containing the form is then processed, line by line, to see which lines fall within those regions. Any text is extracted as the value of the metdata field associated with that cell.

2.3 Details of the Template Language

Figure 1 shows a template example. Each desired metadata item is described by a rule set designating the beginning and the end of the metadata. The rules are limited by features detectable at the line level resolution. The names of the metadata fields can vary from one organization collection to another.

 <template name="sf298_demo">
  <form>
   <fixed>
      <field num="1"><line>1. REPORT DATE (DD-MM-YYYY)</line></field>
      <field num="1"><line>1. REPORT DATE (DD-MM-YY)</line></field>
      <field num="1"><line>1. REPORT DATE</line></field>
      <field num="2"><line>2. REPORT TYPE</line></field>
      <field num="3"><line>3. DATES COVERED (FROM - TO)</line></field>
      <field num="3"><line>3. DATES COVERED</line></field>
      <field num="4"><line>4. TITLE AND SUBTITLE</line></field>
      ...
   </fixed>
   <extracted>
     <metadata name="ReportDate_ddmmyyyy">
        <rule relation="belowof" field="1"/>
        <rule relation="leftof" field="2"/>
        <rule relation="aboveof" field="4|5a"/>
      </metadata>
      <metadata name="DescriptiveNote">
         <rule relation="belowof" field="2"/>
         <rule relation="rightof" field="1"/>
         <rule relation="leftof" field="3"/>
         <rule relation="aboveof" field="4|5a"/>
      </metadata>
      <metadata name="DatesCovered_ddmmyyyy">
         <rule relation="belowof" field="3"/>
         <rule relation="rightof" field="2"/>
         <rule relation="aboveof" field="4|5a"/>
      </metadata>
      <metadata name="UnclassifiedTitle">
         <rule relation="belowof" field="4"/>
         <rule relation="leftof" field="5a|5b|5c|5d|5e"/>
         <rule relation="aboveof" field="6|5d"/>
      </metadata>
        ...
   </extracted>
   <exclude>\QSECURITY CLASSIFICATION OF FORM\E</exclude>
   <exclude>\QUNCLASSIFIED\E</exclude>
  </form>
</template>

Fig. 1. Form Template fragment

The template language is expressed in XML. This is not a fundamental limitation. Almost all executable languages are processed by translation into a tree-structured description (generally referred to as an “abstract syntax tree). XML is a standard notation for exchanging tree-structured data, and so it is convenient for a research/exploratory project to introduce new languages directly in the syntax tree format and bypassing the otherwise distracting process of providing higher level translation. XML is not particularly easy to read and write, and a later section of this document describes a preliminary approach to provide a more writer-friendly way to generate templates.

In the remainder of this section we cover the XML elements that make up a template. (Keep in mind that, although the template is stored in XML, the Form Maker tool allows you to edit them without worrying about the finicky details of getting the XML correct.

2.3.1 The <template>

The top-level element in any template is <template>. This serves as a container for one or more metadata field descriptions as described below. The template has two attributes:

Attribute Required Possible values Default Meaning
name yes word Unique identifier for this template

Example:

 <template name="sf298_demo">
      ...
 </template>

2.3.2 The <form>

Inside the <template> element is a <form> element. This is of no particular significance, but is a holdover from older versions of the form template language.

<template>
Attribute Required Possible values Default Meaning
max no integer ignored in current system

Example:

 <template name="sf298_demo">
    <form max="-1">
      ...
    </form>
 </template>

The <form> contains up to four components: a list of matching strings, the cell labels, the geometry rules, and a set of patterns for excluded lines.

2.3.3 The <match>

Earlier versions of the form extraction engine located a form within the document by searching for several “match” strings on the page containing the form. This approach has been abandoned because it was not sufficiently flexible to deal with all forms, nor with forms that spanned multiple pages.

However, many form templates will still contain the <match> element. This may be ignored and, if desired, deleted.

2.3.4 The <fixed> (Cell Labels) Section

This section of the template lists the cell labels. It takes its name from the idea that the labels define the “fixed” content of the form, as opposed to the actual data entered into the form, which will vary from one document to the next.

The fixed area contains an arbitrary number of field items.

Each <field> item describes a single cell label.

<field>
Attribute Required Possible values Default Meaning
num yes name N/A An arbitrary but unique identifier for the cell of the form where this label appears.

Example:

 <template name="sf298_demo">
  <form>
   <fixed>
      <field num="1"><line>1. REPORT DATE (DD-MM-YYYY)</line></field>
      <field num="1"><line>1. REPORT DATE (DD-MM-YY)</line></field>
      <field num="1"><line>1. REPORT DATE</line></field>
      <field num="2"><line>2. REPORT TYPE</line></field>

The num attribute can be any name string (made of alphabetic and number characters and/or the punctuation symbols . - _). It is a name for the particular cell of the form that we are describing. It is not a name for the cell label that we are describing - it's possible to have multiple, alternate labels for the same cell, reflecting different wordings in different versions of the same form, as shown in the example above, where three alternate labels for the report date cell have been identified.

Most template authors use the numeric identifier of the cell for the num (hence the attribute name, “num”). Others prefer to use an abbreviation that describes the contents of the cell. (E.g., in the sample above we might have done:

      <field num="reportDate"><line>1. REPORT DATE (DD-MM-YYYY)</line></field>
      <field num="reportDate"><line>1. REPORT DATE (DD-MM-YY)</line></field>
      <field num="reportDate"><line>1. REPORT DATE</line></field>
      <field num="reportType"><line>2. REPORT TYPE</line></field>

Within each field is a <line> element, and within the <line> element is the actual text for the cell label.

Many forms have quite sizable amounts of fixed text in some cells, often contaning instructions on how to fill out that part of the form. For example, in another form, we have:

<field num="descNote"><line>7. DESCRIPTIVE NOTES (The category of
         the document, e.g. technical report, technical note or memorandum.
         If appropriate, enter the type of document, e.g. interim,
         progress, summary, annual or final. Give the inclusive dates when
         a specific reporting period is covered.)</line></field>

When these get extremely long, it is sometimes convenient to pretend that there are two cells, one labeled with a few of the beginning lines of the label text and the other with a few of the ending lines of that text. Then, in the geometry rules, we simply note that the second cell is immediately below the first.

2.3.5 The <extracted> (Geometry Rules) Section

The cell labels define a set of known locations on the pages of the form. The Geometry Rules explain how those locations should relate to one another on the page. In doing so, the rules serve two purposes.

  1. They make it possible to confirm whether we are looking at the correct form - if we were to find the some of the same cell labels arranged differently, we may be very well be looking at an instance of some other form.
  2. They tell us how to associate any text (that is not part of a cell label) with a cell and, possibly, with a metadata field described by that cell.

The <extracted> element contains an arbitrary number of <metadata> elements.

Each <metadata> element describes a rectangular region of the document. This region can be given a name, in which case any text found in this region will be extracted as a value for a metadata field of that name.

<metadata>
Attribute Required Possible values Default Meaning
name yes name or empty string ”“ N/A The name of a metadata field or ”” if we prefer to ignore any text appearing in this cell.

Example:

 <template name="sf298_demo">
  <form>
   <fixed>
      <field num="1"><line>1. REPORT DATE (DD-MM-YYYY)</line></field>
      ...
      <field num="2"><line>2. REPORT TYPE</line></field>
      ...
   </fixed>
   <extracted>
     <metadata name="ReportDate_ddmmyyyy">
        ...
      </metadata>
      <metadata name="DescriptiveNote">
        ...
      </metadata>
        ...
  </form>
</template>

The example above shows two cells that are going to be associated with metadata fiends, named “ReportDate_ddmmyyyy” (later in the processing this will be rewritten to “ReportDate”) and “DescriptiveNote”, respectively.

The above example does not indicate how these cells are related to the cell labels above them. That relationship is established by one or more <rule> elements that appear inside each <metadata> item.

<rule>
Attribute Required Possible values Default Meaning
relation yes “aboveof”, “belowof”, “leftof”, or “rightof” N/A How this rectangular area of the form page(s) relates to a set of previously-defined cell labels.
field yes A list of <field> nums N/A A list of one or more num values from the earlier <field>s.

More than one field name can be listed in the field attribute, in which case they are separated by '|' characters (e.g., '4|5a').

Example:

 <template name="sf298_demo">
  <form>
   <fixed>
      <field num="1"><line>1. REPORT DATE (DD-MM-YYYY)</line></field>
      ...
      <field num="2"><line>2. REPORT TYPE</line></field>
      ...
   </fixed>
   <extracted>
     <metadata name="ReportDate_ddmmyyyy">
        <rule relation="belowof" field="1"/>
        <rule relation="leftof" field="2"/>
        <rule relation="aboveof" field="4|5a"/>
      </metadata>
      <metadata name="DescriptiveNote">
         <rule relation="belowof" field="2"/>
         <rule relation="rightof" field="1"/>
         <rule relation="leftof" field="3"/>
         <rule relation="aboveof" field="4|5a"/>
      </metadata>
        ...
  </form>
</template>

The second <metadata> element above indicates that we have a cell below the cell label named “2”, which in the earlier <fixed> section is described as “2. REPORT TYPE”. That cell is positioned to the right of the cell label 1 (“1. REPORT DATE (DD-MM-YYYY)”), to the left of the cell label 3 and is above of both cell labels 4 and 5a.

It's important to keep in mind that the purpose of these rules is to define rectangular areas of the page. In forms with lines that visibly divide the page into boxes, the rectangular regions that we deine in this manner are similar to, but usually not identical to, the visible boxes. For example, the rules given above would result in the left most two of the top row of rectangular cells shown as the colored regions below:

These differ slightly from the original boxes in the form:

  • The top-left corner of each cell is actually the top-left corner of the text making up the cell label.
  • The bottom and right edges of the cell extend to the labels of other cells that the geometry rules state that this cell is aboveof and leftof.
  • If there are no other cells below or to the right of a cell, then that cell extends all the way to the bottom or right page margin, respectively. You can see this behavior, in the picture above, with the cells on the extreme right.

There are several reasons why we don't simply rely on the boxes appearing on the form:

  1. Not every form has these boxes.
  2. If the PDF file is passed through an OCR program, all information about these lines/boxes is lost. (When our program is able to process a “born-digital” PDF file without invoking OCR, we usually do recover the lines, and use them in early processing to help divide the text into lines and paragraphs.)
  3. Many people who fill out these forms are not particularly careful about staying within the boxes. When assigning lines of text to cells, the program will look at the center point of the line of text and assign the line to whichever cell within which that center point falls. This allows the program to be fairly forgiving of lines of text that cross the box boundaries.
Redundancy in the Geometry Rules

The notation for entering the geometry rules allows for a certain redundancy. If, for example, cell 2 is to the rightof cell 1, do we really need to state that cell 1 is leftof cell 2?

The answer is, no, the program can handle the left-right relationship after being told only once. But we recommend that you put both in anyway. A common problem in writing form templates is forgetting to enter some of the geometric relationships. It's much easier to keep track of just what you have successfully entered so far if you get into the habit of making sure that every cell is described by all four directions, if possible. Cells on the left edge of the form will not have a rightof; cells on the right edge of the form will not have a leftof; and cells at the bottom of a form will not have an aboveof. But those omissions can be checked visually without difficulty. Note that all cells will have a belowof, because it's not a cell without a cell label.

Having a little bit of redundancy also helps make the form more robust. OCR errors can sometimes make it impossible to recognize a few cell labels. (Particularly if people have typed their text over part of the label or, in a few badly designed forms, if the lines/boxes run through the text of the labels.) A bit of redundancy can helps the engine to recover the remaining cell structure of the form.

There's another sense in which he geometry rules could be redundant - transitivity. If cell 2. is aboveof cell 4 and cell 4 is aboveof cell 6 and cell 6 is aboveof cell 7, could you say that cell 2 is aboveof “4|6|7”?

The answer is that, yes you could, but you certainly don't have to. This gets tedious very quickly, though. Unless you discover that a form is particularly prone to OCR errors that make it hard to recognize one of these “intermediate” cell labels, it probably is not worth it.

2.3.6 The <exclude> Section

Some forms contain text that appears

  • For example, some sf298 forms always contain the text “sf298” followed by version and date information at the bottom of the form.

However, because this information only appears at the very end of the form, it's actually useful to treat it as a cell label. We can then use it in geometry rules to indicate, for example, where to find the bottom border of the “Telephone Number” and other fields from the bottom row of the form.

  • On the other hand, this form

always has the strings “SECURITY CLASSIFICATION OF FORM” and “UNCLASSIFIED” (or some other classification indicator) on the top and bottom of each page, even when the form is split across multiple pages. Because we can't predict where those page breaks will always occur (they could, for example, break right in the middle of a long abstract), it is useful to indicate that these strings are not part of any adjacent cells.

Such strings can be excluded by adding one or more <exclude> elements after the geometry rules. Each <exclude> element describes a single line that, if it is encountered within the top and/or bottom margin, indicates that we have passed outside the borders of the actual form. In that case, that line, and all lines between it and the edge of the page, will be ignored.

<exclude>
Attribute Required Possible values Default Meaning
margin no 't' or 'b' Indicates that this pattern can be found in the top (t) margin only, or in the bottom (b) margin only. If no margin value is specified or if margin='' is stated, then the pattern can be found in either the top or bottom.

Example:

        ...
   </extracted>
   <exclude>\QSECURITY CLASSIFICATION OF FORM\E</exclude>
   <exclude>\QUNCLASSIFIED\E</exclude>
  </form>
</template>

3. The Form Maker Tool

FormMaker is a GUI-based tool to help in the template creation process. It allows a template author to edit templates, adding and modifying cell labels and geometry rules, and to quickly apply the modified template to a set of documents to see the effects of each rule. FormMaker can be used by authors who have no understanding of XML or who simply wish to avoid dealing with the finicky details of writing valid XML.

Because the names of the metadata fields vary from one document collection to another, the form maker tool will be customized slightly for each document collection. General functionality and usage is the same for all the tools.

The FormMaker is still considered experimental. The interface and functions are likely to change in later versions. Even in its current form, however, it remains a useful tool.

Tutorial: Using the Form Maker

3.1. Getting Started

Whether you are creating a new template or modifying an existing one, your starting point should always be to locate a set of documents that illustrate the form. It's best to have a least two sample documents, and more than that may be useful. For convenience, you should probably gather a copy of all your sample documents into a single directory.

In this tutorial, we will prepare a template for a Canadian variant of the RDP, illustrated by ADA480770 and ADA509654.

1. Get a copy of these PDF files and put them into a convenient folder. You might want to open both files in Acrobat Reader or some other PDF viewer and keep those at hand. (Eventually, we'll be doing a lot of copying and pasting of cell labels.)

2. Start the FormMaker program. You will see something like this:

The template maker window is divided into two main parts. On the left is the template we are working with (currently empty) and on the right is a place to view the documents we will be using and the metadata we are able to extract from them.

3. Load the two sample documents. From the File menu, choose “Select Documents”. You will get a conventional file selection dialog. Navigate to the folder where you saved the two PDFs and select both of them.

There will be a short pause while the PDF files are read in and converted to our internal IDM format. Eventually, you will see the documents appear on the right of the window.

What you are seeing is not the original PDF, but a drawing of the IDM obtained from that PDF. Graphics will be missing, as the IDM concentrates on the document text. There may be some irregularities in the display and location of the text - many PDF documents will contain fonts that we cannot display directly on your screen, so the FormMaker tries to choose the best match it can.

You might want to resize your window so that you can see the entire width of a page. You can also adjust the zoom on the page to make it more legible.

You can use the “document:” selector on the top right to choose which sample document to view. [Minor bug in current version: when you select the document, the view may not actually change until you click on one of the tabs.] You can see the tree-structure of the actual IDM on the IDM tab. the remaining tabs will be empty at the moment.

Use the Page control at the bottom of the column to advance to the forms. You will find them near the end of each document. Although the original documents are fairly long, our programs only process the first 5 and last 5 pages of each, so you won't have to go very far.

When you reach the page(s) with the form, you will notice that the lines and boxes are missing. That information is not preserved in our IDM files (and is often not retrieved accurately by OCR programs in any case).

4. Turning our attention to the left part of the window, let's start on the template. In this case, let's set the template name to 'canadian-demo”.

3.2. Adding Cell labels

I recommend working on forms a row at a time. That means that we would repeatedly add a row of cell labels, then a row of geometry rules, then try out what we have entered to be sure that we properly get the metadata, then repeat the whole process, row-by-row, until the form is completed.

However, when getting started we will need to enter two rows worth of cell labels. That's because, when it's time to add geometry rules, we need to be able to say that one label is above the cell and another is below it.

So let's start by adding some cell labels. There's already an empty row in the Cell Labels section, so we can enter our first label (for the Originator cell) there. Double-click in the Cell Name area to start entering text. We'll keep the cell labels simple by simply using the numbers. So enter 1 into that box. Double-click on the Label box and the entire text of the label (From “1. ORIGINATOR (The name” to “in section 8.)”). The easiest way to do this is to go to the original ADA480770 PDF file that, hopefully, you have open in Reader or some similar program and copy-and-paste the text from there into the Label box.

Now let's move on to cell 2. Click upon the ”+” button in the bottom label row to create a new row. Enter the name 2 and the label “2. Security … applicable.)”.

Then do the same for cell 3. Use the name 3 and the label “3. Title … the title)”.

3.3. Adding Geometry Rules

We now have enough labels in place to allow us to enter the geometry rules for the top row of cells. We do this in the Geometry section.

Any text entered into cell 1 would appear below label 1, to the left of label 2, and above label 3. So, in the empty geometry rule row, double-click in the below column and enter 1, in the above column enter 3, and in the left of column enter 2. Usually, we would also make an entry in the Metadata Field column to identify in what metadata field the text in that cell should be placed. But, as it happens, we do not actually use the information in this cell. (Information on originators, a.k.a. corporate authors, is recorded separately as a special code.) So leave this cell empty.

Any text entered into cell 2 would appear below label 2, to the right of label 1, and above label 3. So click on the ”+” button in our first rule to create a new empty rule below that. In the new geometry rule row, double-click in the below column and enter 2, in the above column enter 3, and in the right of column enter 1. Then click in the Metadata Field column. A small arrow should appear. (If it doesn't, try double-clicking.) Click the arrow to see a list of the commonly used metadata field names. For this field, choose CitationClassification. Click outside that column to fix your selection.

3.4. Trying It Out

We now have enough information entered to try out our first row. First, though, let's save our work. From the File menu, select Save Template. Save the template as canadian-demo.xml in any convenient directory. (The normal place to save templates is in a “Work” directory, …/Work/form-templates/dtic/, which should actually come up as the default save location. Templates stored in the Work area will be given precedence over templates with the same name that were distributed with the program and will also be preserved if the software is later upgraded.)

Now, let's apply the template to our two sample documents. From the Extract menu, select Apply template to docs. This will take a few seconds. Watch for a change in the status line at the bottom of the program window. After it changes to “Extractions Completed” (or, if you are unlicky, “Problem extracting from…”), then in the right column select ADA480770 and click on the Extracted tab. You should see that we have successfully extracted the CitationClassification value for this document.

Now select ADA509654. [If the metadata display does not change, click on the IDM tab and then again on the Extracted tab. This is a bug that will be fixed in later versions.] For this document, you will see that we were not successful.

Now, look closely at the cells in ADA480770 and ADA509654. You'll see that the labels are not actually identical. For example, in cell 1, aside from a change in capitalization (“The” ⇒ “the”), which would not affect our matching, there are two wording changes (“for whom” ⇒ “for who” and “Center” ⇒ “Establishment”) that, together, would likely prevent our matching the label text. So, we will next add an alternate label for the same cell.

Click upon the ”+” button in our current cell label 1 row to add a new row after it. In the new row, place the same name (“1”) and the alternate text for that label. Again, copy-and-paste from the PDF of ADA509654 will speed this up significantly. [In a future version, we hope to support copy-and-paste from the IDM tab.]

You'll find similar small rewordings in the labels of both cells 2 and 3. So repeat the process for each of them, supplying the alternate wordings from ADA509654. At the end, you should have a total of 6 labels, looking something like this:

Again, save your work and try applying the template. This time, you should get the CitationClassification for both documents.

3.5. The Second Row

From here on, you can work row by row, adding a new row of labels, then a new row of geometry rules, then trying out what you have done to be sure it works.

So let's add the cell label for the authors cell, #4. This time, being forewarned, you should not be surprised to see that the wording of the label is quite different in the two documents, In fact, one document instructs the person filling out the form to enter names starting with the first name; the other instructs to the person to enter them with the last name first.

Add the two alternate labels for cell 4. You should wind up with this:

With that label in place, we have all that we need to define the location of the title text in the form. Add a new geometry rule indicating that an “UnclassifiedTitle” can be found below cell label 3 and above cell label 4.

Go to the Extract menu and try applying the template. If all is well, you should successfully extract the title for both documents:

3.6. The Third Row

On to the third row. Start by entering in the new cell labels for cells 5, 6a, and 6b. Again, you will find that the wording on these is changed from one document to the next.

(By now you might be wondering why we bother treating these two documents as being samples of the same form. Could we not regard these as two separate forms, processed by two separate templates? The answer is that, yes, we could do that. But our experience so far has indicated that these kinds of rewording are quite common. Treating each set of rewordings as a separate form would greatly increase the number of form templates. By combining them, on the other hand, we only have to figure out and enter the geometry rules once for the whole set. In fact, that leads to our operational definition of when two documents actually contain the same form: if the forms contain the same metadata fields arranged geometrically (above, below, left & right) the same way, then they are really the same form.)

The geometric rule for the authors field has a slight wrinkle. The metadata field name will be PersonalAuthor, and the text is clearly below cell label 4 and has no labels to the right or the left. But what label should we say that this text is “above of”? We could go with any one of the labels in the third row. But whichever one we chose, we would be leaving out information about the placement of the other two. Such an omission makes the template more “fragile” in the real world where sometimes OCR errors or other problems could cause a cell label to get garbled. So the preferred solution is to indicate that this text would be above of all three, which is done with the above of entry “5|6a|6b”.

Save your work again and try applying the template. You should find that the correct author string is extracted for both documents.

You may recall that applying a template to a document is not even close to being the final step in the metadata extraction process. The “raw” extracted metadata is subjected to post-processing and then validated. Click on the Validated tab to see the effects that these steps would have on the metadata extracted so far.

The most obvious change is that there are now multiple PersonalAuthor entries. The PersonalAuthor field is perhaps the most heavily post-processed metadata field in the DTIC version of the Extract program. Attempts are made to discard any text that is not part of a person's name (e.g., addresses, email addresses), to split lists of names into separate names, and to rewrite each individual name into last-name-first format.

Other fields that get heavily post-processed include fields with dates and availability/distribution statements.

Another difference between the Extracted and Validated tab displays is the scoring that is introduced when the extracted data is validated. The confidence scores attached to each field and to the overall metadata record range from 0 to 1. A zero score indicates that the program is certain that there is something wrong with that extraction. A value of 1 indicates that the program is certain (as certain as it can be) that the value has been extracted correctly. When the confidence value is below 0.5, a warning message is attached that indicates at least one of the reasons why the program has low confidence in the value. If the warning is simply unvalidated, it means that we have no rules in the program for validating that particular metadata field. Any other warnings should be at least examined by a human. Sometimes they will indicate real problems. In other cases (e.g., the warning that this title is longer than the norm for DTIC documents), a human would conclude that the value is correct despite the warning.

3.7. Excluding Lines from All Cells

Although there are more rows to complete, let's turn our attention to the next section of the left column, Exclude these lines in margins. Looking at both of our documents, you can see that the forms are split across multiple pages. But each page has certain text in the margins that is not actually part of a cell.

This could cause problems later if that text got absorbed into a cell. For example, if we indicated that the text in cell 12 was above 13, then if the page just happened to break the way it has in ADA509654, we could mistakenly wind up with the words “UNCLASSIFIED SECURITY CLASSIFICATION OF FORM” appended twice (once for the bottom of the page and once for the top of the next page) to the value extracted for cell 12. And we can't treat those as cell labels and say that cell 12 is above the label UNCLASSIFIED, first because that line of text can appear legitimately in other places in the form and, more importantly, other documents usign the same form might happen to have the page break fall somewhere else, either between two different rows or even in the middle of a row (e.g., abstracts are sometimes broken across pages).

Instead, we tell the program to ignore any lines like these in the margins by making entries in the Exclude… section:

Technically, these patterns are regular expressions, not just plain text strings, and it's possible for regular expressions to encode some subtle variations in what would and would not match. You may need to consult a programmer familiar with regular expressions if simply entering text strings does not work for you.

3.8. Finishing Up

We're done for now. In the left column, move down to the Comments section. You can put anything in here that you like. Typically, you might put your name (as author of this template), the date, and the ADA designations of the documents you used as the samples for preparing the template.

Save your template. If you look at the saved file with a web browser, NotePad, or some other editor, you should see something like this:

<?xml version="1.0" encoding="UTF-8"?>
<template name="canadian-demo">
<!--Steven Zeil, May 30, 2010
From sample documents ADA480770 and ADA509654
-->
 <form>
   <fixed>
    <field num="1">
      <line>1. ORIGINATOR (The name and address of the organization preparing the document,
            Organizations for whom the document was prepared, e.g. Centre sponsoring a
            contractor's document, or tasking agency, are entered in section 8.)</line>
    </field>
    <field num="1">
      <line>1. ORIGINATOR (the name and address of the organization preparing the document.
            Organizations for who the document was prepared, e.g. Establishment sponsoring a
            contractor's report, or tasking agency, are entered in Section 8.)</line>
    </field>
    <field num="2">
      <line>2. SECURITY CLASSIFICATION (Overall security classification of the document 
            including special warning terms if applicable.)</line>
    </field>
    <field num="2">
      <line>2. SECURITY CLASSIFICATION (overall security classification of the document,
            including special warning terms if applicable)</line>
    </field>
    <field num="3">
      <line>3. TITLE (The complete document title as indicated on the title page. 
            Its classification is indicated by the appropriate abbreviation 
            (S, C, R, or U) in parenthesis at the end of the title)</line>
    </field>
    <field num="3">
      <line>3. TITLE (the complete document title as indicated on the title page. 
            Its classification should be indicated by the appropriate abbreviation 
            (S, C or U) in parentheses after the title).</line>
    </field>
    <field num="4">
      <line>4. AUTHORS (First name, middle initial and last name. If military,
            show rank, e.g. Maj. John E. Doe.)</line>
    </field>
    <field num="4">
      <line>4. AUTHORS (Last name, first name, middle initial. If military,
            show rank, e.g. Doe, Maj. John E.)</line>
    </field>
    <field num="5">
      <line>5. DATE OF PUBLICATION (Month and year of publication of document.)</line>
    </field>
    <field num="5">
      <line>5. DATE OF PUBLICATION (month and year of publication of document)</line>
    </field>
    <field num="6a">
      <line>6a NO. OF PAGES (Total containing information, including Annexes, Appendices, etc.)</line>
    </field>
    <field num="6a">
      <line>6a. NO. OF PAGES (total containing information, include Annexes, Appendices, etc)</line>
    </field>
    <field num="6b">
      <line>6b. NO. OF REFS (Total cited in document.)</line>
    </field>
    <field num="6b">
      <line>6b. NO. OF REFS (total cited in document)</line>
    </field>
  </fixed>
  <extracted>
    <metadata name="">
      <rule relation="aboveof" field="3" />
      <rule relation="belowof" field="1" />
      <rule relation="leftof" field="2" />
    </metadata>
    <metadata name="CitationClassification">
      <rule relation="aboveof" field="3" />
      <rule relation="belowof" field="2" />
      <rule relation="rightof" field="1" />
    </metadata>
    <metadata name="UnclassifiedTitle">
      <rule relation="aboveof" field="4" />
      <rule relation="belowof" field="3" />
    </metadata>
    <metadata name="PersonalAuthor">
      <rule relation="aboveof" field="5|6a|6b" />
      <rule relation="belowof" field="4" />
    </metadata>
  </extracted>
  <exclude>UNCLASSIFIED</exclude>
  <exclude>SECURITY CLASSIFICATION OF FORM</exclude>
</form>

</template>

There are still several rows left to finish, but you have now seen all the necessary steps for doing so, and should be able to do those on your own.

A final note: if you were to use this template with the main metadata extraction program, it would not, as it is, ever get used. That's because we have, in this tutorial, been recreating the already existing canadian template. Because that template would find all the cell labels we have entered, and would successfully match quite a few other labels as well, the extraction program would favor the existing template because it successfully matches a much larger number of labels.

4. Tips and Best Practices

  • Try to use clean documents that will not need OCR for your example documents.
  • Work through a form a row at a time. Check your work by applying the templates after each row's worth of changes.
  • Cell labels are not case-sensitive (“REPORT” and “Report” will match one another) and are tolerant of extra spaces (“Report Date” and “Report Date” will match one another.). In fact, all cell label matches are fuzzy, allowing a small percentage of changes. So if the template says that the correct cell label is “Names of Authors” and an OCR error changes this to “Names of Au1hors”, things will probably still match.
  • Be careful copying and pasting - copying from a PDF form can often grab bits of other lines that you weren't aiming for.
    • If you are copying and pasting and get garbage when you paste, then the PDF document probably is borderline illegal (missing or improperly specified fonts) or the document was OCR'd and you are actually copying the hidden text injected by the OCR program.
      • Either way, that document is probably going to be a problem. Use naother if you can.
  • If you are testing and a cell label for cell K appears in the metadata of another field F, that could be because you misspelled the cell label for cell K, or because you are missing or have incorrect geometry rules in K or F.
  • If you are testing and a cell you have defined seems to be missing entirely, look carefully at the label you gave for that cell. If you are sure that is correct, then look next at the geometry rules.
extract/formwriter/form_template_writers_manual.txt · Last modified: 2010/06/28 12:31 by zeil
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0