Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

slayda · #42 10-21-2010, 12:22 PM

Once you have the XHTML files, is there a way to 'easily' extract the images in order to do your own OCR on them. I can't find a way to create any image format from the XHTML pages. I believe I could do a better job OCRing them than Amazon has done.

Edit: I figured it out. Once you have the .xhtml files from the TopazFiles2SVG.pyw you can open Sigil and add them as stated in the first post (but without much explanation). Here is more explanation - When you start a new project, you'll notice some folder names in the left frame. One of these is "Text". Add all the .xhtml files here. If they don't automatically load, add the image files from the img folder from TopazFiles2SVG.pyw to the Sigil Images folder. Save this as an ePub book. However, since the pages are really scanned images, this ePub ebook will be very large. I used Calibre to create a PDF from the ePub so I can use my own OCR program to create a text version of the ebook.

I hope this helps to understand the final steps of the process. I realize that these steps are simple compared to the much more complex programming behind the original poster's steps for obtaining the image files (i.e. the .xhtml files) but, for me, it helped to generate my final deDRMed ebook.