MXY-Tree – An approach based on relative location
F. Cesarini  encoded a document’s cover page into an MXY-Tree and used it for document classification. An MXY-Tree  recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A good feature of these approaches is that they are not sensitive to the absolute position of blocks and the absolute spaces among blocks, because these approaches mainly model the relative relationships among the blocks. Therefore, they are suitable for document pages like the samples shown in Figure 1, i.e. they contain blocks with different absolute locations but similar relative relations. However, these approaches are sensitive to block identification, i.e. block boundary detection. Figure 2 shows two samples with bad block boundary detections when using Scansoft’s Omnipage OCR tool. For these two similar samples, the OCR tool generated two different structures. Using an approach such as MXY-Tree will fail to catch the similarity of these two documents, because their MXY-Trees are quite different. For example, the top part of the left sample cannot be vertically cut into two parts, but the top part of the right sample can.
Figure 1. Samples with Similar Structure
Figure 2. Samples of Bad Block Boundary Detection
M*N Grid – An approach based on absolute location
J. Hu et al.  used another way to measure page layout similarity. They partitioned a page into an m by n grid in which each cell was either a text cell (more than half of it was overlapped by a text block) or white space cell. With partitions by absolute positions, this approach measures page similarity by a block’s absolute position. This approach is not as sensitive to the block boundary detection. However, this approach is sensitive to absolute position. This will cause problem when pages with the similar style but with blocks of different sizes (e.g. a page with one author block may be different from a page with 10 author blocks). Figure 3 demonstrates the limitation of this approach. In Figure 3, a black cell represents a text cell, and a white cell represents a white space cell. The page “A” looks more similar to the page “B” than the page “C”. However, if we use J. Hu’s approach, the similarity between the page “A” and the page “C” is 0.7, and the similarity between the page “A” and page “B” is 0.4. In the other words, measured by J. Hu’s approach, the page “A” is more similar to the page “C” than the page “B”.
Figure 3. Problem with Absolute Position Match
We used both MXY-Tree  and M*N grid  in our document classification. A document page is converted to an MXY-Tree as well as an M*N grid. The sequences of block types (graphic or text) from MXY-Trees, and the similarity of two MXY-Trees is based on the edit distance between them. Each cell in the m by n grid was identified as a text cell, or white space cell based on whether more than half of its area is overlapped by texts or not. Given two m by n grids for two pages, we call a hit if a cell from one grid is in the same type with the cell from the other grid in the same location (i.e. either both are text cells or both are white space cells). The similarity between two pages can be computed as. We classify documents into a class if they have similar MXY Trees and similar m by n grids. In other words, documents are similar only when they have similar structures and most of their blocks occur in similar positions. We also check whether the graphic blocks in pages have similar size and locations.
We applied the classification module to a document set with a different number of document pages. In our experiments, we applied it to document sets with 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000, and 7413 document pages respectively. The documents in each document set are randomly selected from our test bed. We recorded the number of classes that were generated by our classification module for each document set. For a document set with a small number of document pages, we repeated the process several times and used the average number of classes. The classification results are shown in Figure 4. It demonstrated that only a small number of classes are required for processing most documents in a relatively large collection. One limitation of this approach is that it tends to generate many singleton classes partially due to imprecise OCR information.
Figure 4. Document Classification Result
This approach is to classify documents into groups based on their publishers. The underlying assumption is that publisher information provides clues for document classifications. In the other words, documents in a collection from the same publisher tend to use similar layout and typographic features.
Basically, this approach is based on special
text (e.g. organization name). It allows users to define what special text will
be used for document classification. A page may contain more than one
organization names and some may not be significant for document classification.
For example, a paper with several creators from different organizations may
have multiple organization names, and these names may contain little clue for
the document classification. To address this problem, we allow users to specify
the location and the font size of a special text. For example, users can
specify that a page with text “
<xs:element name="textStr" minOccurs="0" maxOccurs="unbounded" type="TextStr"/>
<xs:attribute name="name" type="xs:string" />
<xs:element name="stringmatch" minOccurs="0" maxOccurs="unbounded" type="StringMatch"/>
<xs:element name="patternmatch" minOccurs="0" maxOccurs="unbounded" type="StringMatch"/>
<xs:element name="inRelativeArea" minOccurs="0" maxOccurs="unbounded" type="InRelativeArea"/>
<xs:element name="relativeTo" minOccurs="0" maxOccurs="unbounded" type="RelativeTo"/>
<xs:attribute name="name" type="xs:string" />
<xs:attribute name="fontsize" type="xs:int" />
<xs:attribute name="relation" type="xs:string" />
<xs:attribute name="field" type="xs:string" />
<xs:attribute name="x1" type="xs:double" />
<xs:attribute name="x2" type="xs:double" />
<xs:attribute name="y1" type="xs:double" />
<xs:attribute name="y2" type="xs:double" />
<xs:attribute name="case" type="xs:boolean"/>
<xs:attribute name="loc" type="xs:string"/>
<xs:attribute name="distance" type="xs:int"/>
Figure 5. XML Schema for Classification with Affiliation Information (part)
A part of a sample file of document class
definition is shown in Figure
6. In this sample, we defined two classes: class “au”
and class “au2”. According to this file,
a document page in the class “au” contains the two special texts: one is “AIR
COMMAND AND STAFF COLLEGE” or “
<stringmatch loc="equal" isCase="true" distance="0">AIR COMMAND AND STAFF COLLEGE|AIR WAR COLLEGE</stringmatch>
Figure 6. Sample of Class Definition (part)
It worth noting that even though this approach is designed for affiliation information, a user can use it to define any special text that is significant for document classification.
Image Based Classification
Purpose: The purpose of this proposal is to investigate the utility of using an off the shelf Content-Based Image Retrieval (CBIR) system to generate class identifications for template-based metadata extraction.
Approach: We will generate an image based on the content of each page in the document by rendering the text content as simplified graphic lines. (i.e. Render the page so it appears as a “birds-eye” view of the page.) We will investigate using a CBIR like the GNU Image-Finding Tool (Gift) [ http://www.gnu.org/software/gift/ ] to group and recognize the page images. We will load the CBIR with representative page images and correlate the pages to specific page types. From these page type we will identify those which contain metadata and also which uniquely identify different document classes. From this we should be able to then use the template method to extract the metadata.
Plan of Attack:
1. Develop program to generate the simplified images for each page.
2. Install and test capabilities of CBIR program. (Gift or some other program)
3. Develop procedures to load the images to the CBIR and coorelate to specific files and document classes.
4. Investigate capability of CBIR to cluster pages into manageable groups for classes
5. Identify several classes and develop templates for extraction.
6. Develop system to exploit the clustering and test its ability to properly identify and extract metadata from the classes
Measures of success: Once we have a suitably large enough sample of images in the CBIR, if we are able to successfully match 80% of page types, further investigation and development would be warranted.
Pages Of INTerest (POINT) location problem
Problem statement: In order to apply any metadata extraction process we need to locate those pages that have metadata to be extracted. The pages of interest include such pages as traditionally referred to as 'cover' pages, 'front' pages, 'title' pages but also include table of content pages, reference pages, form pages and the like.
Rule based: Develop a set of rules that if triggered together will determine whether or not a page is a candidate POINT. For example, a rule might be that a page is more than half white space.
Advantages and disadvantages: The rules developed are fairly easy to detect and calculate during the reading of a file. Depending on the combining logic used, failure of one or more rules allows for fail-fast detection. In experiments we attempted to optimize a set of 10 rules to detect manually identified POINTs We found that we could detect 3 mutually exclusive sets of POINTs, a traditional white-space dominant page, a scientific page with title header, and form pages. We attempted to optimize the rules for each set and found that the number of false-positives increases as you liberalize the rules to catch a greater percentage of POINTs.
Average document page: Add to the rule based approach additional rules based on various statistical data found in the document (number of words, font size, dominant font, etc. of a page)
Advantages and disadvantages: Data such as the difference in average font size for the page are very good discriminators for the traditional white-space set of pages. Unfortunately the cost of calculating the statistics is fairly high since it requires reading in the entire document vice only looking at the first and last 5 pages. This cost can be mitigated if the entire file is processed regardless to transform into an alternate document model for extraction.
 Cesarini F, Lastri M, Marinai S, and Soda G. Encoding of modified X-Y trees for document classification. In Proc. Sixth ICDAR, pages 1131–1136, 2001.