We describe our architecture in data flow terms - the system can be envisioned as a series of transforms (the ellipses) of data flows (the arrows), with transformation inputs and outputs coming from or going to either other transforms or to data stores (the boxes).

Here is our current architecture, at a high level.

Some of the transforms shown here are themselves composed of more detailed dataflow graphs.

Input Processing

The “Input Processing & OCR” transform currently looks like:

Over the next few months, we plan to introduce faster and more accurate processing for textPDF documents, which will change the above flow to

The page extraction (via pdftk) transform disappears. In its place is the Text PDF extraction transform which loads the PDF document, attempts to extract formatted text from the desired pages (still defaulting to 1st and last 5) and checks to see if the pages do indeed appear to be text PDF (a.k.a. “born digital” or “native PDF”). Text PDF pages are rendered into an IDM-like format called “raw IDM” in the above diagram. Any of the desired pages that appear likely to be image PDF (i.e., scanned images of pages) are written into a new PDF file containing only those specific pages. (pdftk will no longer be used for this purpose, as such page selection is easily available via the PDFBox library that is used to do the text extraction.) That PDF file is passed on for OCR and conversion from the OCR-engine-specific format into raw IDM. It’s quite likely that many documents will have all of their pages handled as text PDF, that a somewhat smaller number will have all of their pages treated as image PDF, and that some documents will be found to be a mixture of the two. In particular, someone could insert a text PDF version of an SF298 form into a scanned document, or might scan an SF298 form and insert that into a text PDF document.

The raw IDM versions of the text PDF pages and the image PDF pages are merged and segmented, organized into words, lines, paragraphs, regions, etc.


Raw IDM differs from the true IDM format in two ways:

* the text may be grouped into phrases containing multiple words and/or parts of words

* all organizational structure required by IDM below the page level is optional. The only requirement is that each page contain at least one region.

Note that raw IDM is a superset of IDM – every IDM document is also a raw IDM document, but many raw IDM documents would not qualify as IDM.


extract/system_architecture.txt · Last modified: 2009/07/22 12:55 by zeil Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0