How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindle)

(http://stream-recorder.com/forum/showthread.php?t=5426)

How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindle)

Quote:

Topaz is an Amazon format for Kindle devices. It differs from the AZW format in that it can have embedded fonts in the file itself. A .tan sidefile is used to store metadata and bookmarks and other user generated content on the eBook. The metadata is used to help the library mode to reference information about the eBook itself.

While not much is currently known about the internal format used in a Topaz file there is some likelihood that it is related to the standard AZW format. It uses a different compression than standard MOBI files and it can have embedded fonts in the file allowing more complex display using font sets and characters that are not standard to Amazon Kindle. It is also likely to remove other restrictions found in MOBI files such as image size limitations although some of these may have been removed in AZW as well.

According to one publishing industry blogger, Topaz is an implementation of the open EPUB standard. It follows the OEBPS 2.0 specs, and probably the later IDPF guides. It’s a proprietary implementation which means they use ePUB as the source but then convert it to their internal format.

AZW1 - is an eBook in the Topaz (TPZ) format that has been delivered via Whispernet.

TPZ - is an eBook in the Topaz format that that been delivered via Internet download.

The following is experimental and it will probably not work for you but…

ALSO: Please do not use any of this to steal. Theft is wrong.

This is only meant to allow conversion of Topaz books for other book readers you own.

Here are the steps:

First you must use the python scripts in topazscripts.zip to do the translation from Topaz to HTML

The files you should have after unzipping are:

cmbtc_dump.py – (author: cmbtc) unencrypts and dumps to files all of the sections, properly numbered and named

decode_meta.py – converts metadata0000.dat to human readable text

convert2xml.py – converts page*.dat, other*.dat, and glyphs*.dat files to their “pseudo” xml descriptions.

flatxml2html.py – converts a “flattened” xml description to html using the ocrtext and markup as its basis.

stylexml2css.py – converts stylesheet “flattened” xml from other0000.dat into css (as best it can – mainly supporting paragraph style classes)

genxml.py – main program to convert everything to xml

genhtml.py – main program to generate “book.html”
You must remove the DRM from the Topaz book and build a directory of its contents using the following commands:

cmbtc_dump.py -d -o TARGETDIR [-p pid] YOURTOPAZBOOKNAMEHERE

This should create a directory called “TARGETDIR” in your current directory.

It should have the following files in it:

metadata0000.dat – metadata info
other0000.dat – information used to create a style sheet
dict0000.dat – dictionary of words used to build page descriptions
page – directory filled with page*.dat files
glyphs – directory filled with glyphs*.dat files
You should convert the files in “TARGETDIR” to their xml descriptions
Please note, this python program uses “decode_meta.py” and “convert2xml.py” so don’t move them.

genxml.py TARGETDIR
Next attempt a conversion to html where “TARGETDIR” is the directory that was created in step 2.
Please note, this python program uses “decode_meta.py”, “convert2xml.py”, “flatxml2html.py”, and “stylexml2css.py” so don’t move them.

genhtml.py TARGETDIR

Once it completes:

You should have created the file “book.html” inside of TARGETDIR

You should also have created the directory xml inside of TARGETDIR
which has the full xml descriptions of the pages and glyphs for later
(better) conversion attempts.

One warning … this is not the best long-term solution because much of the layout is only really correct if drawn to the screen (as an svg). Until that solution exists, this should get you something that you can load into Sigil and clean up and make an ePub that you can then convert to other formats

Code:

http://www.pastie.org/760591

http://www.mediafire.com/?qmzjmt25yzf

http://rapidshare.com/files/336800633/topazscripts.zip.html

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

Code:

http://pastie.org/761169.txt

seems to be more up-to-date.

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

How do you use "http://pastie.org/761169.txt"? (I've tried using unswindle but it doesn't work on topaz.)

TIA

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

I don't have a Kindle myself (or which device is necessary for Topaz e-books) but I'm a bit interested in encryption-related topics.

You need Python for this script. I'm not a Windows user anymore but there are precompiled binaries which should work fine.

Download the script, start the command line and type: "python script.py filename"

It accepts the following parameters:

Quote:

print("\nCMBDTC.py [options] bookFileName\n")
print("-p Adds a PID to the list of PIDs that are tried to decrypt the book key (can be used several times)")
print("-r Prints or writes to disk a record indicated in the form name:index (e.g \"img:0\")")
print("-o Output file name to write records")
print("-v Verbose (can be used several times)")
print("-i Print kindle.info database")

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

The following set of tools can also be used to remove DRM from Amazon Topaz eBooks:

TopazExtract_Kindle_iPhone.pyw,
TopazFiles2XML.pyw,
TopazFiles2SVG.pyw,
TopazFiles2HTML.pyw

tools_v1.6b.zip.

Code:

http://www.mediafire.com/?mn3vmttbwrt

The scripts should work with Kindle and iPhone Amazon Topaz Files (.tpz, .azw1). The files are really images of pages with OCR performed on them. Using the tools you can get SVG images of the pages, and the OCRed HTML version for clean-up.

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

WOW, thanks so much for that download! I successfully converted my purchased topaz from Amazon, into an HTML, and then used Calibre to convert it to .epub for use on my HTC Hero (Android Phone) using Aldiko book reader. Thanks sooo much!!!

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

I'm still unable to convert the files (xhtml) that I have, though the ebook itself has been stripped of DRM. I've tried merging the files with Adobe Acrobat Pro and Calibre, without success.

Can someone post the steps to do so? Thanks so much.

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

Is there anyone able to help me with an error message? I have successfully converted 3 books but am having trouble with the 4th. it strips the drm but when I go to convert it to xml I get the following error at page 256 "Error - -1501 outside of string table limits" i did some unsuccessful googling, so if anyone can help me I would appreciate it.

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

Running this I keep getting the error "Can not find dict0000.dat file" What am I doing wrong? Thanks.

Re: How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindl

How to remove DRM from Topaz ebooks:

Install Python
Open command prompt / terminal and run:

Code:

python cmbtc_dump_nonK4PC.py -d -o TARGETDIR -p 12345678 YOURTOPAZBOOKNAMEHERE

where
- 12345678 - the first 8 characters of your PID
- "TARGETDIR" - target directory (can be ommited)
- YOURTOPAZBOOKNAMEHERE - filename of your Topaz ebook (with the .tpz extension)
Then, again in the command prompt / terminal, run:

Code:

python gensvg.py TARGETDIR
Then create HTML file from the SVG file by running the following in the command line / terminal:

Code:

python genhtml.py TARGETDIR

You should get "book.html" file in the TARGETDIR directory.
Convert "book.html" to any other format using Calibre.