![]() Produced by the text recognition process. This is why a similarity-based comparison comes useful to detect small differences between pages that are Since many alphanumeric symbols share similar, or identical, physical characteristics, differentiation often posesĪ challenge. The uppercase letter O is often misidentified as the numeral 0, or uppercase letter S as the numeral 5 and etc. For example, depending on the font, the lowercase letter l can look exactly like the numeral 1 In most common cases, a scanned page may contain between 1 to 10 recognition errors where certain letters are The number of errors depends on scanning resolution and original document quality. It is essential to understand that text recognition in scanned documents is prone to errors and The OCR is a process of recognizing text in scanned documents and making them searchable. The scanned documents need to be OCRed prior to using them for any text-based processing. Using Scanned Paper Documents Quite often this operation is used to find duplicate pages in the scanned paper documents. It is not advised to use this method on scanned paper documents. This method does not compare any invisible text that may be present on the page. It is the best method to detect duplicates in most document types.Ĭompare Visual Appearance of the Pages This method compares pages "as images" and detects pages that look exactly the same. It computes page similarityīased on text content only and completely ignores text appearance, layout, images and graphics The plug-in provides two different methods for detecting duplicate or near-duplicate pages: Compare Page Text Only Use this method to compare page text regardless of its visual appearance. ![]() Delete duplicate pages from the document. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |