Pozotron is Reading my PDF Incorrectly

In addition to OCR issues, a few other things can cause Pozotron not to import your script the way you intend.

Article Summary

PDF Formatting Issues
Font ligatures
Graphics in your script
Copy and paste text for text formatting issues
PDFs with columns
PDFs with double-page spreads
PDFs with captions and footnotes

Have OCR issues? See: How to fix a PDF error with Optical Character Recognition (OCR)

PDF formatting issues

Occasionally, Pozotron cannot parse a PDF script correctly.

If Pozotron’s pre-scan of your script shows that there may be formatting issues with your PDF, it may return an error message suggesting you get a cleaner version of a script to upload.

If possible, get a script that hasn’t been compressed or hasn’t been converted to a PDF without embedding the native fonts used with it.

PDFs can look fine to the eye, but if the content can’t be read as text, then search functions won’t work inside the document.

Pro Tip: To know if your PDF will be read correctly, try copying and pasting it into a word document or similar. If the text comes out normal, then you are good to go! If it comes out looking incorrect, then you may have issues and should contact our support team at help@pozotron.com.

A quick formatting test

One way to check what Pozotron’s transcription will look like is by uploading your script and then exporting the full text (using the ‘full text’ option in the export menu).

Using that option will export a .html file of the raw text from Pozotron’s transcription.

If any of the words haven’t been transcribed properly, you’ll be able to see that as you scan through the document. There is no charge for this test; it’s available from the main menu as a functional courtesy for anyone to review.

Font ligatures

Occasionally pdfs have an issue with their fonts called ligatures, where 2 text characters touch each other.

This lack of whitespace surrounding the entirety of each text character prevents Pozotron’s pdf parser from properly detecting each character separately.

Ligatures are most often seen with the letter combinations of “fi”, “ff” and “if”, but this depends on the font used in the script.

When a ligature is present, Pozotron will show the two characters as an unreadable section of text (usually a blank space with a red underline) and note an annotation for an added word.

To resolve this, follow the steps found on our PDF OCR Error support page.

Graphics in your script

For the most part, Pozotron will just skip over graphics in a PDF script. However, there are times when text is presented as a graphic and this can cause a slight discrepancy.

For example, if “Introduction”, “Chapter One” or “No Trespassing” are the headers/headlines/chapter names of a section, but are illustrations or graphics of that text instead of actual letters, Pozotron will skip over those because they aren’t readable as text.

That’s why we always have a one-button click to open up the original script so you can double-check if the software thinks you’re adding those words to the project.

Additionally, there are cases where a part of the script is presented as handwritten (as a handwritten letter or sign, for example), but if that writing is not parseable by a text editor (like Word), then Pozotron won’t pick it up either.

Copy and paste text for text formatting issues

If you’re not sure if the text is computer-readable or not, you can do a quick copy-and-paste test.

Drag over all of the text on a PDF page and hit ‘copy’. Then, in Word or another text editor, hit ‘paste’.

If the pasted text is not what you expect to see, then there is a missing font, compression issue, or security issue in your PDF script.

If a text editor doesn’t understand the content, Pozotron’s reader can’t either.

PDFs with columns

When PDFs are laid out in columns, Pozotron can have difficulty figuring out where those column breaks are as it reads from left to right.

That means books with columns are not great candidates for the digital accuracy check.

If there are graphics interspersed with these columns, the likelihood of a clean transcript goes down further.

It is possible to duplicate pages with columns and then black out or redact one side at a time to effectively put together a linear version of the script, and Pozotron has no issue with this.

PDFs with double-page spreads

Some PDFs include two book pages on a single PDF page (this looks like a ‘double spread’).

The solution to this is to duplicate the double-spread pages and then crop out all of one side on the even pages and all of the other sides on the odd pages.

This process can be done quite quickly in Adobe Acrobat and is used very effectively to make single-spread scripts that work well with Pozotron.

Click here for a video on how to fix the double-page spread issue.

PDFs with captions and footnotes

Captions and footnotes are fine inside Pozotron if they are read in the order they appear in the book.

Reading footnotes inline (bringing them up from the bottom and placing them where the cue is within the text) will cause issues with the matching analysis.

Skipping footnotes or captions will also cause the misread detection process to trigger.

Blacking out or redacting footnotes and captions in advance that the narrator won’t be reading is your best bet for accurate reporting, though ignoring the errors on the annotation page also works fine for final reporting.