Julien Lerouge

Julien Lerouge

Senior Data Scientist @ QuickSign
  • Deep learning
  • Image processing
  • Document analysis & understanding (classification, OCR, NLP)

Publication

Page Retrieval System in Digitized Historical Books Based on Error-Tolerant Subgraph Matching

1LATIS Laboratory, Sousse University, National Engineering School of Sousse, 4023, Sousse Erriadh, Tunisia
2Normandie Université, LITIS EA 4108, University of Rouen, 76801, Saint-Etienne du Rouvray, France
3L3I Laboratory, University of La Rochelle, av M. Crépeau, 17042 La Rochelle Cedex 1, France

Abstract :

Developing smart ways of interacting with scanners is one of the emerging needs identified by numerous digitization professionals. To achieve better interaction with scanners, the research community in historical document image analysis is particularly interested in providing reliable tools for computer-aided indexing and retrieval of historical document images. Thus, we propose in this article a method able to retrieve from a digitized historical book, pages having layout and/or content which meet the user-defined query. Amongst the user-defined queries we focus on the transition pages (e.g. title pages of chapter, end-of-chapter and end-of-act) and pages containing a particular content component or a group of patterns (e.g. ornaments, illustrations and drop caps) in our work. The method adopted in this work is firstly based on using low-level features (texture, shape and geometric descriptors) to represent each page in the form of a graph-based signature. Then, a set of costs is estimated using an error-tolerant subgraph isomorphism algorithm in order to measure the similarity between the user-defined query formulated in terms of a pattern graph and the different subgraphs of the book page signatures and to find book pages similar to the user-defined query. To illustrate the effectiveness of the proposed method, a thorough experimental study has been conducted with quantitative observations obtained from a large number of queries having different contents and structures.