|
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| Complete PDF Indexing - Normal and Scanned |
|
While most people may think there is a single flavor of PDF, in fact there are many:
- PDF Normal
When you create a PDF file from a desktop application, such as from Microsoft Word or Quark Express, the PDF contains internal information that can be used by a search indexer. This includes text, font and positioning information. What is important in indexing PDF normal documents is speed of reading and indexing - exactly where ScanSoft technology excels.
- PDF Image
When a PDF file is created from a scanner, or from an online FAX service such as eFax, the PDF does not contain text information (other than the file name). The PDF is an image of the document - similar to a photograph. ScanSoft's OCR technology re-creates the text information from the image content - without changing the original file. This is important, especially if the image document has legal implications, such as a receipt, contract or correspondence.
- PDF Image+Text
This kind of PDF file is a hybrid of the two types described above. The visual document is a PDF image, but there is a hidden layer of text behind the image, which ideally matches the image. This isn't always the case, since a user can edit the text without changing the image representation. ScanSoft's PDF Overlay matching technology looks at the image and the hidden text, and compares the two.
|
|
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| Formats and Languages Indexed |
|
ScanSoft's OmniPage Capture SDK, which was used to develop the OmniPage Search Indexer, supports a wide array of file formats - import and export. The product also provides OCR for over 120 languages, including Latin, Cyrillic and Asian (Chinese, Japanese and Korean). ScanSoft is also developing Arabic OCR under contract to a government agency.
In order to reduce the download size for the OmniPage Search Indexer, ScanSoft has limited the initial beta release to the following:
- Formats - PDF image, image+text, normal; JPG/JPEG; TIF/TIFF; PaperPort MAX
- Languages - English (US/UK), French, German, and Italian.
|
|
|
 |
 |
|
|
|
 |
|