Julien Lerouge


PIVAJ: an article-centered platform for digitzed newspapers

1LITIS EA 4108, BP 12, University of Rouen, 76801, Saint-Etienne du Rouvray, France

Abstract :

PIVAJ is a platform for archived digitized newspaper emphasizing articles: extracting them from digitized documents by automated page layout analysis, OCRing them, indexing their text transcription to allow users to search for content. Crowdsourcing is used to improve the quality of the indexing, by correcting the transcription and by tagging articles with keywords. The platform has been used to give Web access to 550 000 articles generated from a digitized local newspaper. Current developments include further improvements to its OCR as well as graphical interfaces for the management of the platform.