Non-Form Template Writers' Manual

June 28 2010

Digital Library Research Group, Old Dominion University

1. Introduction

This document describes the current state of the template language for processing of non-form documents in the Extract metadata extraction system. Updated versions of this manual can be found at http://www.cs.odu.edu/~extract/developersWiki/doku.php?id=extract:manuals

Readers unfamiliar with the overall purpose and organization of this system are referred to this paper. This document concentrates upon the portion of the overall system associated with extraction of raw metadata from documents that have already been checked for “report pages” (metadata-bearing forms) and found to lack any such.

Such documents are then passed on for attempted extraction of metadata according to rules described in one or more executable templates, each template designed to describe a distinct document layout.

The template language is derived from the PhD thesis by J. Tang and retains the overall structure defined there, though many additions and modifications to the lower-level details have been made since then.

At the time of extraction, documents (originally received as PDF files) have been converted to IDM, a tree-like structure in which a document is divided into pages, pages divided into regions, regions into smaller regions and paragraphs, and paragraphs into lines of words. Each line is marked with certain “features” describing the primary font employed within that line.

These features are:

* weight (bold or medium),

* slant (italic or normal)

* font size,

* allCaps (true or false),

* titleCase (true or false)

* start of paragraph (i.e., is there one or more empty line preceding this one?)

The “template engine” attempts to interpret the instructions encoded in a template in order to select, from the CleanML structure, text strings that represent meaningful metadata.

2. The Template Language

2.1 Overview

The essential structure of the template language is as follows:

  • A template contains a set of rules designed to extract metadata from a single class of similar documents.
    • Each template has a unique name, called the template ID.
    • Each template has a list of one or more page numbers indicating what portion of a document that template will examine.
    • Each template contains an abitrary number of rules.
  • A rule describes how to locate and extract a block of text from the document.
    • A rule consists of two main parts, the begin and end selectors, which describe how to locate the beginnign and the ending of the desired block of text, respectively.
    • The rule gives a name to the block of text that it extracts. In most cases, this is the name of a metadata field (e.g., UnclassifiedTitle, PersonalAuthor).
    • A rule can be marked as ignored, meaning that the block of text located by the rule will not be included in the final metadata. Ignored rules can still be useful as intermediate steps in aiding later rules to locate “real” metadata.
    • A rule can be marked as required, meaning that if the rule cannot locate a block of text, then we assume that this document is not actually in the class that this template is trying to describe, and the attempt to apply the template halts with no output. For example, if a class of related documents always have a date immediately after the title, then we would mark the rule to extract that date as “required”, because if there is no date after the title, the document we are looking at must not be in that class. On the other hand, if the date is present in some documents of the class but not in others, we would mark the rule as not required, so that the date is extracted when present but execution of the template continues whether the date is there or not.
    • A rule may also have a filter, which can select a portion of the located block of text to actually be extracted. For example, if a document contains a line “Report Date: 01/21/2009”, we might apply a filter to extract only the date “01/21/2009” and not the words “Report Date: ” that precede the desired value.
    • The begin and end selectors each contain a line selector expression.
  • Line selector expressions come in many forms, described below. Each offers a distinct way to search for a line of text that represents the beginning (or ending) of an interesting block of text. Examples include searching for specific strings in the document, searching for the end of a paragraph, or searching for lines followed by large blank vertical spaces.
    • Selector expressions are modified by an “inclusive” value, which indicates whether the desired beginning/ending location is supposed to be the line that actually matches the selector expression, or the line just before the matching line, or the line just after the matching line.
    • Selector expressions can also be modified by giving a “scope”, which alters the range of lines that will be searched using the selector expression. This is explained in more detail just below.

2.2 Applying Templates to Documents

As noted earlier, before a template is applied to a document, the document has been converted from the input format (PDF) into IDM. IDM is a hierarchical structure in which words are contained within lines, lines within paragraphs, paragraphs withing regions, and so on up to pages that are contained inside the document itself.

Templates, however, are based on a line-by-line view of the document. They are applied to a scroll, a linear list of the lines that make up the pages of the document.

When a template is applied, the template “engine” tries to apply the rules within the template, one after another, in the order they appear in the template. Each rule has a begin and end selector. So the engine first applies the begin selector. If that succeeds in finding the beginning of a block of text, then the engine applies the end selector. If that selector succeeds in locating the ending of a block of text, the lines of text are saved under the name given in the rule.

The template engine has a notion of a “current line” location within the document. When execution starts, the current line is the first line in the document. After any successful rule, the current line is set to the first line just after the text extracted by that rule. Begin selectors will normally search from the current line up to the end of the document. (More exactly, to the end of the set of pages designated for this template.) End selectors will normally search from the line matched by the begin selector to the end of the document. These search ranges can be modified by supplying a “scope” value on the begin or end selector. A scope denotes an alternative portion of the document that will be searched.

  • A scope of “document” in a begin selector resets the current line to the first line of the document and starts searching from there. The search continues to the end of the document. A scope of “document” in an end selector means that it should seach from the line matched by the begin selector to the end of the document. (This is how end selectors nornally work anyway, so there's not much point to putting a “document” scope on an end selector.
  • A scope of “page” in a begin selector resets the current line to the first line of the current page and starts searching from there. The search continues to the end of that page. A scope of “page” in an end selector means that it should search from the line matched by the begin selector to the end of the current page.
  • A scope may also give the name of any previously extracted block of text. In a begin selector this resets the current line to the first line of that block of text and starts searching from there. The search continues to the end of that block of text. In an end selector, such a scope means that it should search from the line matched by the begin selector to the end of that block of text.
    • If no block of text has been successfully extracted with the given name (i.e., there is no prior rule with that name or the rule with that name failed to locate a block of text), then execution of the template is halted with no output, just as if the rule with that name had been marked as required.

2.3 Details of the Template Language

Figure 1 shows a template example. Each desired metadata item is described by a rule set designating the beginning and the end of the metadata. The rules are limited by features detectable at the line level resolution. The names of the metadata fields can vary from one organization collection to another.

 <template pagenumber="3" templateID="arl_1">
  <CorporateAuthor>
    <begin inclusive="current">
       <stringmatch case="no" loc="beginwith">Army
         Research</stringmatch> 
    </begin>
    <end inclusive="before">
       <stringmatch case="no"
          loc="beginwith">ARL</stringmatch> 
    </end>

Fig. 1. Non-form Template fragment

The template language is expressed in XML. This is not a fundamental limitation. Almost all executable languages are processed by translation into a tree-structured description (generally referred to as an “abstract syntax tree). XML is a standard notation for exchanging tree-structured data, and so it is convenient for a research/exploratory project to introduce new languages directly in the syntax tree format and bypassing the otherwise distracting process of providing higher level translation. XML is not particularly easy to read and write, and a later section of this document describes a preliminary approach to provide a more writer-friendly way to generate templates.

In the remainder of this section we cover the XML elements that make up a template.

2.3.1 The <template>

The top-level element in any template is <template>. This serves as a container for one or more metadata field descriptions as described below. The template has two attributes:

Attribute Required Possible values Default Meaning
pagenumber yes Numeric page range: p or p1-p2 Pages of the original document examined by this template
templateID yes word Unique identifier for this template

2.3.2 Metadata Rules

Inside the <template> element are one or more rules. Most are named for a metadata field (e.g., “CorporateAuthor” in Figure 1). The actual names depend on the standards imposed by a particular document collection and vary from one collection to another.

Each rule will contain two elements that describe the beginning line where that metadata can be found and the final line for that metadata field value. The actual value extracted for the field will be the text in all lines so identified. Possible attributes (all optional) are:

Attribute Required Possible values Default Meaning
min no Non-neg number 1 minimum number of repetitions of this field that should be expected in the document
max no Non-neg number 1 maximum number of repetitions of this field that should be expected in the document
ignore no “yes” or “no” no If “yes”, this field is used merely as a convenience to identify a position within the document. No metadata value will actually be extracted into the output
require no “yes” or “no” no If “yes”, then failure to successfully locate and extract this field indicates that something is wrong (e.g., this template describes a different document layout than is actually present in this document). In such a case, execution of the template is halted with no output generated for any metadata fields
filter no Regular expression .* Used to select a portion of the raw text in the indicated lines. If the regular expression contains no parentheses, then the portion of the text matching the entire regular expression is extracted. If the regular expression contains parentheses, then the portion of the text matching the parenthesized sub-expressions is extracted

2.3.3 Begin & End Selectors

Each rule will contain one <begin> element and one <end> element. These describe the beginning and ending line of the text to be extracted for that metadata field. These are the most complicated part or the template language, and the syntax and structure of these rules have not evolved in an entirely consistent.fashion over time.

All <begin> and <end> elements will contain some sort of line selector expression. Sometimes this expression will be simply text. In other cases it is another XML element (e.g., the <stringmatch> elements in Figure 1).

Attributes of the begin/end rules themselves are:

Attribute Required Possible values Default Meaning
inclusive no “before”. “after”, “current” current Modifies the choice made by the basic line selector. If “before”, the selected line is actually the one before that indicated by the line selector. If “after”, the selected line is the one after that indicated by the line selector. If “current”, the selected line is the one indicated by the line selector
scope no “document”. “page”, any field name, ”” ”” Changes the range of lines that will be searched.

2.3.4 Line Selector Expressions

Inside each <begin> or <end> rule is a line selector expression. These describe tests that are applied to the lines of the document, until the indicated test succeeds.

Most such tests are applied relative to the “cursor” or current line of the extraction process. When we first start applying a template, the current line is the first line of the document. Subsequently, the current line for an 'end' rule is the line matched by the prior 'begin'. The current line for any subsequent 'begin' rules is the line just after the last successful 'end'.

Notation:

* mf: a metadata field name

* v1, v2, …: a vertical position on the page, expressed as a decimal number in the range 0.0-1.0 denoting the fraction of the page (by line count) with 0.0 being the top

* wc, an integer denoting a number of words

* s1, s2: sizes (of fonts), expressed in units internal to IDM

Selector Meaning
mf Names a previously extracted metadata field. (It is possible to extract multiple field values with the same field name, in which case this refers to the most recently extracted value with this name.) In a <begin> rule, selects the line chosen by the <end> rule of that metadata field. In an <end> rule, selects the line chosen by the <begin> rule of that metadata field.
beforeField(n, mf) selects the line n number of (non-blank) lines prior to an existing metadata field mf
beforeTag(n, mf) Deprecated in favor of 'beforeField' - selects the line n number of (non-blank) lines prior to an existing metadata field mf
begin The first line of the document
beginwithmonth Find a line begins with a month such as “Mar ” , “January”, etc
boldchange Deprecated - Find a line that differs from the current line in that one is in bold and the other is not (same as changeWeight)
changeWeight Find a line that differs from the current line in that one is in bold and the other is not (same as boldchange)
changeSizeOrWeight Find a line that differs from the current line in that one of the following is true:
 - Font size is different
 - One is bold and the other is not bold 
changeSizeOrWeightOrAllCaps Find a line whose features are different from those of the current line. A typoGraphy change occurs when any of the following are true:
 - Font size is different  
 - One is bold and the other is not bold 
 - One is Allupcase and the other not  
changeSizeOrWeightOrCaps Find a line whose features are different from those of the current line. A feature change occurs when any of the following are true:
 - Font size is different  
 - One is bold and the other is not bold 
 - One is Allupcase and the other not 
 - One is leadingcase and the other is not 
cityState For recognizing strings containing City, State or City, State Zip patterns
chooseFieldBegin(mf_1, mf_2, … mf_n) The selector attempts to located the listed meta fields left to right and selects the first which has a successful extraction. Depending on selector we then match the begining of that field.
chooseFieldEnd(mf_1, mf_2, … mf_n) The selector attempts to located the listed meta fields left to right and selects the first which has a successful extraction. Depending on selector we then match the end of that field.
containsName Find a line that appears to contain a personal name. Names may be last name first or first name first. [Replaces former unused nameformat selector]
containsOnlyName Find a line that appears to contain only a personal name. Names may be last name first or first name first. [Checks if line has 4 or more words, it must have a comma,period,colon,semi, paren, bracket, square bracket, or the word 'and']
current Matches the current line (see above)
dateformat(formats) Find a line that has a date with specified format. formats is a ”|”-separated list of date formats, with each format being any pattern that would be accepted by Java's java.text.SimpleDateFormat class. Example: dateformat(MMMM dd, yyyy|MM/dd/yyyy) would accept lines such as “January 23, 2001” or “01/23/2001”
dateformat Find a line that has a date with format “dd month yyyy” “month dd, yyyy” or “month yyyy”, where “month” means a month string such as “Jan ”, “September”, etc
end The last line of the document
endleftof(tag) Locates 1st line not to the left of the beginning of the tag (see 'leftof', below)
endrightof(tag) Locates 1st line not to the right of the beginning of the tag (see 'rightof', below))
featurechange deprecated - old name for changeSizeOrWeightOrCaps
firstpart Deprecated: matches the current line (same as “current” or “onesection” in an <end> rule)
largersize Find a line whose size is larger than current line (Lines with string length less than 10 are ignored.)
largeststrsize (v1,v2) Searches for the largest font size among lines between positions v1 and v2 that meet the following criteria:
 - Its length is larger than 11
 - It has more than 1 words
 - Average word length is between 4 and 13
 - Percentage of letters is larger than 0.7
looseLargeststrsize (v1,v2) Searches for the largest font size among lines between positions v1 and v2 that meets the following criteria:
 - Percentage of letters is larger than 0.7
lastpart Line of the previous field’s end
layoutchange deprecated - old name for changeSizeOrWeight
leftof(tag) Locates 1st line to the left of and overlapping vertically with the block of statements extracted as the metadata field tag.
onesection Generally used only in an <end> rule. Selects the same line as the <begin> rule. This is functionally equivalent to “current”.
pageChange Finds the first line in the next page, [Note that only pages whose page numbers are given the template will be available.]
paraEnd Finds a line preceding a line that was indicated as the start of a paragraph (by OCR)
ParaEnd Deprecated (in favor of “paraEnd”):Finds a line preceding a line that was indicated as the start of a paragraph (by OCR)
regexps(re) Find a line matching a regular expression re
rightof(tag) Locates 1st line to the right of and overlapping vertically with the block of statements extracted as the metadata field tag.
size (s1,s2) Return true if a line’s font size is between s1 and s2
sizechange (x) Find a line whose font size is different from that of the current line. To overcome OCR errors, a change with difference less than x is ignored
sizepctchange(x) Find a line whose font size is different from that of the current line by more than the x percent (0.0-1.00)
smallersize Find a line whose size is smaller than current line (Lines with string length less than 10 are ignored.)
stringmatch Match a special string – see below
titleCaseOrAllCaps(k) Find a line that is in all caps or in title case (according to the usual English rules for capitalizing titles, which allows articles, propositions, etc., to remain unapitalized) and that has k or more words. If the parameter is omitted, k==4 as a default.
typoGraphychange deprecated - old name for changeSizeOrWeightOrAllCaps
verticalSpace Find a line preceded by an empty line
verticalSpace(s) searches for any line that is followed by a vertical space greater than or equal to scale (s) * lineheight, where lineheight is multipiler times the height of the bounding box and multiplier ranges from 1.0..1.2
verticalSplit(k,n) Searches the current page for the k-1 largest internal vertical white spaces (ignoring the top and bottom margins), thereby splitting the page into k vertical blocks. Returns the line number beginning the (n)_st such block. By definition, if n == 0, returns the first nonempty line on the page. If n==k, returns the last nonempty line on the page.

2.3.5 String matching

Unlike most line selectors, <stringmatch> is formed as an XML element rather than as plain test within a <begin> or <end> rule. This is presumably because of the more elaborate set of options available for this selector. The actual text to be matched is inside the <stringmatch> element (e.g., “Army Research” in Figure 1).

Attribute Required Possible values Default Meaning
case no “yes” or “no” yes Yes: upper/lower case is significant No: upper/lowercase differences are ignored
loc yes “beginwith”, “onsection”, “contain”, “endwith” Modifies how much of the text in a line must match the provided text:
 - beginwith: the line must begin with the provided text  
 - endwith: the line must end with the provided text 
 - onesection: the entire line must match the provided text 
 - contain: the provided text must occur somewhere within the line 
fuzzy no Non-negative integer 0 Match succeeds even if the line differs from the provided text by this number of single-character changes (Levenshtein edit distance)

3. TemplateMaker Tool

TemplateMaker is a GUI-based to help in the template creation process. It allows a template author to edit templates, adding and modifying rules, and to quickly apply the modified template to a set of documents to see the effects of each rule. TemplateMaker can be used by authors who have no understanding of XML or who simply wish to avoid dealing with the finicky details of writing valid XML.

Because the names of the metadata fields vary from one document collection to another, the template maker toll will be customized slightly for each document collection. General functionality and usage is the same for all the tools.

The TemplateMaker is still considered experimental. The interface and functions are likely to change in later versions. Even in its current form, however, it remains a useful tool.

3.1. Tutorial: Using TemplateMaker

Whether you are creating a new template or modifying an existing one, your starting point should always be to locate a set of documents that illustrate the layout from which you want to extract metadata. It's best to have a least two sample documents, and more than that may be useful. For convenience, you should probably gather a copy of all your sample documents into a single directory.

In this tutorial, we will prepare a template for some rather typical “academic” papers, illustrated by ADA480699 and ADA481820.

1. Get a copy of these PDF files and put them into a convenient folder. Start the TemplateMaker program. You will see something like this:

The template maker window is divided into two main parts. On the left is the tempalte we are working with (currently empty) and on the right is a place to view the documents we will be using and the metadata we are able to extract from them.

2. Load the two sample documents. From the File menu, choose “Select Documents”. You will get a conventional file selection dialog. Navigate to the folder where you saved the two PDFs and select both of them.

There will be a short pause while the PDF files are read in and converted to our internal IDM format. Eventually, you will see the documents appear on the right of the window.

What you are seeing is not the original PDF, but a drawing of the IDM obtained from that PDF. Graphics will be missing, as the IDM concentrates on the document text. There may be some irregularities in the display and location of the text - many PDF documents will contain fonts that we cannot display directly on your screen, so the TemplateMaker tries to choose the best match it can.

You can use the “document:” selector on the top right to choose which sample document to view. You can see the tree-structure of the actual IDM on the IDM tab. the remaining tabs will be empty at the moment.

3. Turning our attention to the left part of the window, let's start on the template. In this case, let's set the template ID to 'academic-paper”. Because the metadata we want is right on page 1, we enter “1” into the pages box.

4. Time for our first rule. Click the +Rule button to add a rule to the template. Resize the window if necessary. You can see that a rule has been added, with sections for both the begin and the end selectors.

Now, the first thing we want to go after is the title at the top of the page. First, we name it: in the “Field Name” selector, choose UnclassifiedTitle.

Because the titles in both documents start in the first line of the document, the easiest way to match the beginning location of the title is to use the begin selector. So in the “begin” column, in the top selector box (which currently probably contains “beforeField…”), select begin. Set the “Inclusive” selector below that to “current”. (I always find it useful to start with “current” and then modify it later, if necessary.)

Now, how do we tell where the title ends? There are actually several possibilities. We could search for the end of a paragraph (paraEnd), or for a line followed by a large vertical space (verticalSpace(1.0)), or for a line containing a name (containsName), or for a change in the fonts being used. We can try different things to see what works. It also helps to think ahead a bit and consider what might happen to other documents that use this layout. For example, some documents might have names in the title, making containsName a poor choice. And some people tend to be rather slipshod about how they spacetheir documents out, so it's possible that some authors might squeeze that block of author names close up to the title, which could make paraEnd and verticalSpace less than reliable.

So let's try an end selector of changeSizeOrWeightOrCaps (“weight” refers to bold/normal styles, “caps” refers to whether the line is in mixed case, title case, or all-caps). Set “Inclusive:” to “current”.

5. Now let's try our first rule to see how well it works. From the Extract menu, select “Apply template to docs”. The template will be executed on both of our sample documents. After a few seconds, you can see the results on the right by selecting the “Extracted” tab. Use the “document:” selector to move from one document to the other.

We have a bit of a problem. Our end selector chose the first line with a different font, which is actually the first line of the author names. We actually want our title to end just before that line. This is an easy fix. In the end selector, change “Inclusive” to “before”. Run the template again (Extract→Apply template to docs) and the problem is fixed:

6. OK, we've got something working. Time to save our work. From the File menu, select “Save Template” and save your work in any convenient directory. (Although the Save dialog will allow you to call this file whatever you want, we strongly recommend that you give it the same name as you have entered for the template ID, so call it “academic-paper.xml”.)

7. Now let's move on to the next field. Next we will try to get the block of author names. We want a new rule to be applied after we have extracted the title, so click the ”+Rule” button after the UnclassifiedTitle rule. For the Field Name, select “PersonalAuthor”.

For the begin selector, we would like to start our block of text just after the title. So instead of selecting one of the predefined rules, type “UnclassifiedTitle”.

Looking at the documents, we again have several choices for how to detect the end of this block. We could go with paraEnd or verticalSpace as we suggested for the first rule. changeSizeOrWeightOrCaps would also work. However, it looks a characteristic of this layout is the presnece of a centralized abstract that is announced by an “Abstract” header, so let's try to take advantage of that. In the end selector, shoose stringmatch. In the args box, type “Abstract”. For “string match” select beginwith.

Set both “Inclusive” boxes to current and apply the template:

OK, we have a bit extra on each end. Set the Inclusive box for the begin selector to after and for the end selector to before and try the template again:

That looks better. Now, there's a bunch of stuff in there that isn't really the authors' names. But the Extract program normally applies a round of “post-processing” to the extracted metadata to put it into the final format expected for a colelction (in the case, DTIC). Click on the “Validated” tab to see how the metadata will appear after post-processing.

You can see that the author names have been isolated, split into separate fields, and rewritten into “last-name, first-name” form.

You can also see confidence numbers associated with each field. These represent the Extract system's own internal estimate of whether the data looks acceptable. Data with scores of less than 0.5 will normally be flagged by the main metadata extraction program for human inspection.

8. The final piece of metadata available from these documents is the abstract itself. Click on the ”+Rule” button after our authors rule. Select “Abstract” as the field name. For the begin selector, we can use a stringmatch identical to the one we used to locate te end of the authors. Set Inclusive to after so that we do not actually include the “Abstract” header in the value.

For the end selector, we have choices. paraEnd is probably not a good idea - some abstracts could have more than one paragraph. A stringmatch for “Introduction” is plausible. We would have to make this an “endswith” match because one document actually has “1.” in front of the word. If you try this, however, you will find that it fails to extract the abstract for ADA481820, because the paper actually misspells the word “Introduction”. This could be fixed by mkaing the string match “fuzzy” - allowing a limited number of misspellings. Put a “1” into the fuzzy box, and this works:

However, we might consider whether some documents might start with an “Overview” instead of an “Introduction”, or have some other title for the first section after the abstract. In that case, a change of fotn is probably a better thing to search for.

9.We're done! Save your template. If you look at the saved file with a web browser, NotePad, or other editor, you will see something like this:

<template templateID="academic-paper" pagenumber="1">
  <!---->
  <UnclassifiedTitle ignore="no" require="">
    <begin inclusive="current">begin</begin>
    <end inclusive="before">changeSizeOrWeightOrCaps</end>
  </UnclassifiedTitle>
  <PersonalAuthor ignore="no" require="">
    <begin inclusive="after">UnclassifiedTitle</begin>
    <end inclusive="before">
      <stringmatch loc="beginwith" case="no">Abstract</stringmatch>
    </end>
  </PersonalAuthor>
  <Abstract ignore="no" require="">
    <begin inclusive="after">
      <stringmatch loc="beginwith" case="no">Abstract</stringmatch>
    </begin>
    <end inclusive="before">changeSizeOrWeightOrCaps</end>
  </Abstract>
</template>

4. Tips and Best Practices

  • Try to work with multiple example documents.
    • If you only have one example of the document layout, it's hard to tell which characteristics are peculiar to that document and which ones are general for all documents of that type.
    • If you only have one example of the document layout, it might not be worth the effort to create a new template. It might be a “singleton”, a unique layout that you will never see again.
  • There are often many ways to locate a desired element. Try them out.
    • Be aware that some selectors tend to be more “stable” than others. Give preference to selectors that are less likely to fail given the normal variations from document to document.
      • String matches, for example, can be very stable (especially if you make them “fuzzy”). If you see that all documents in a group have a certain identifying string, then that's a stable selection option.
      • On the other hand, vertical spacing tends to be unstable. That's because when authors face a page limit and are trying to squeeze an overly large paper into a small number of pages, they often try shaving down all the white spaces in the layout.
  • Use required fields to limit your templates to the documents that they are really supposed to be applied to. Deploying a new and overly forgiving template can cause conflicts with other existing templates if the new one winds up grabbing too many documents that it really should not have applied to.
    • If all your sample documents for a layout have, for example, a title, then make that title a required element. After all, if you don;t actually find the title, this is probably not the right template.
    • If you see that all documents in a group have a certain identifying string, then match on that string. Even if it's not part of your desired metadata, do a required match on it - this helps make sure that your template only gets used on documents of the desired class.
  • When working with the template editor, work in small steps, checking each change against your sample documents.
    • It's often helpful to turn off all “required” and “ignored” flags when working in the template editor. When developing new templates, required and ignored markers tend to hide useful information from you. Just remember to add them back in at the end.
  • If you are trying out a new selection rule and find that you get either one line of data too many, or one line too few, look closely at your inclusion settings for that rule.
extract/tempwriter/template_writers_manual.txt · Last modified: 2010/06/28 12:18 by zeil
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0