关键词:文档数据库管理;文件视觉相似性;相似性的金字塔;Isomap
摘 要:Managing large document databases has become an important task. Sorting documents with respect to their visual similarity and layout features, and visualization of the whole document database is a desirable application. A user may wish to search for documents in a database that are similar to a query in their stylistic features, or he/she may want to browse the whole database. In these tasks clustering similar documents and organizing the document database with respect to the clusters is preferable to presenting documents in a random order. In this paper, we propose organization of single-page documents in a 3-D hierarchical structure called a similarity pyramid. The pyramid is constructed from a stack of document database embeddings on a 2-D surface with the help of a nonlinear dimensionality reduction algorithm called Isomap. The mapping algorithm preserves similarity dis-tances between documents by mapping documents that are close to each other in a feature space to points on low-dimensional surface that are close to each other. Higher levels of the pyramid consist of document image icons that represent a large group of roughly similar documents, whereas lower levels contain document image icons representing small groups of very similar documents. A user can browse the database by moving along a certain level of a pyramid by moving between different levels.