1. The following attributes will be expressed using a uniform set of integer measurement units: page@width, page@height, bounding box attributes @l, @r, @t, @b (wherever occurring), @fs (font size) (wherever ocurring), para@li, para@ri
  2. New required attribute page@meter=“n” where n is the number of measurement units contained in 1 meter.
    • Question: Why specified for each page instead of for entire document?
    • Answer: Allows us to easily splice together pages from different sources (e.g., if we get a doc where some pages, possibly containing the sf298 form, are textPDF while the remaining pages came from OCR
    • Example: Omnipage output uses measurement units of 1/300 in, so a page from Omnipage OCR would be marked as <page meter=“11811” …>
    • A common calculation will be converting from measurement units to pts when describing fonts, which would be computed as @fs * 2835 / (page@meter). That is, if @fs is in “measurement units”, 2835 = the number of pts in a meter, and the new attribute page@meter is in measurement units/meter
  3. New optional attribute page@source=“string” where the string names the primary processor used to produce that page from PDF. Likely values would be “Omnipage”, “Abbyy”, “textPDF”
  4. The bounding box attributes @l, @r, @t, @b should be required on wd, not optional.
  5. Rename the region atttributes @left, @top, @right, @bottom, as @l, @t, @r, @b for consistency with other elements.
  6. Add new element <phrase> that can be used in the same context as <wd> with the exact same attributes. The difference between them is semantic:
    • a “wd” denotes a unit of text that is bounded on each side by whitespace or by the edge of the bounding box of its container.
    • A “phrase” may or may not be so separated. A “wd” should normally not contain internal whitespace. A “phrase” may.
  7. Allow regions to contain other regions (as well as paragraphs, tables, etc).
  8. Add an optional attribute @base=“yvalue” to wd, phrase, & line. When present, indicates the y-value of the baseline on which the text was written.
    • @base will be expressed in the same measurement units as @t & @b.
  9. Certain attributes are considered inherited. If the attribute is legal in some element E and in one or more of its descendents, the attribute may be supplied a value in E. Any descendents lacking an explicit value fo that attribute are treated as if they have the value given in E.
    • Inherited attributes are: @base, @fs, @ff, @style
  10. The use of <vert-white-space> elements is deprecated as these are easily derived from a comparison of the @t and @b values of adjacent components of the region.
  11. Elements within a region/para/line/table/ are sorted geometrically in a fashion intended to approximate the “natural” reading order. Note that the use of nested regions can be used to “force” an ordering over complicated columns and rows.
extract/idmchanges2008.txt · Last modified: 2009/12/21 16:49 by zeil
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0