How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindle)

any ANONYMOUS forum user · #1 01-18-2010, 03:07 AM

Quote:

Topaz is an Amazon format for Kindle devices. It differs from the AZW format in that it can have embedded fonts in the file itself. A .tan sidefile is used to store metadata and bookmarks and other user generated content on the eBook. The metadata is used to help the library mode to reference information about the eBook itself.

While not much is currently known about the internal format used in a Topaz file there is some likelihood that it is related to the standard AZW format. It uses a different compression than standard MOBI files and it can have embedded fonts in the file allowing more complex display using font sets and characters that are not standard to Amazon Kindle. It is also likely to remove other restrictions found in MOBI files such as image size limitations although some of these may have been removed in AZW as well.

According to one publishing industry blogger, Topaz is an implementation of the open EPUB standard. It follows the OEBPS 2.0 specs, and probably the later IDPF guides. It’s a proprietary implementation which means they use ePUB as the source but then convert it to their internal format.

AZW1 - is an eBook in the Topaz (TPZ) format that has been delivered via Whispernet.

TPZ - is an eBook in the Topaz format that that been delivered via Internet download.

The following is experimental and it will probably not work for you but…

ALSO: Please do not use any of this to steal. Theft is wrong.

This is only meant to allow conversion of Topaz books for other book readers you own.

Here are the steps:

First you must use the python scripts in topazscripts.zip to do the translation from Topaz to HTML

The files you should have after unzipping are:

cmbtc_dump.py – (author: cmbtc) unencrypts and dumps to files all of the sections, properly numbered and named

decode_meta.py – converts metadata0000.dat to human readable text

convert2xml.py – converts page*.dat, other*.dat, and glyphs*.dat files to their “pseudo” xml descriptions.

flatxml2html.py – converts a “flattened” xml description to html using the ocrtext and markup as its basis.

stylexml2css.py – converts stylesheet “flattened” xml from other0000.dat into css (as best it can – mainly supporting paragraph style classes)

genxml.py – main program to convert everything to xml

genhtml.py – main program to generate “book.html”
You must remove the DRM from the Topaz book and build a directory of its contents using the following commands:

cmbtc_dump.py -d -o TARGETDIR [-p pid] YOURTOPAZBOOKNAMEHERE

This should create a directory called “TARGETDIR” in your current directory.

It should have the following files in it:

metadata0000.dat – metadata info
other0000.dat – information used to create a style sheet
dict0000.dat – dictionary of words used to build page descriptions
page – directory filled with page*.dat files
glyphs – directory filled with glyphs*.dat files
You should convert the files in “TARGETDIR” to their xml descriptions
Please note, this python program uses “decode_meta.py” and “convert2xml.py” so don’t move them.

genxml.py TARGETDIR
Next attempt a conversion to html where “TARGETDIR” is the directory that was created in step 2.
Please note, this python program uses “decode_meta.py”, “convert2xml.py”, “flatxml2html.py”, and “stylexml2css.py” so don’t move them.

genhtml.py TARGETDIR

Once it completes:

You should have created the file “book.html” inside of TARGETDIR

You should also have created the directory xml inside of TARGETDIR
which has the full xml descriptions of the pages and glyphs for later
(better) conversion attempts.

One warning … this is not the best long-term solution because much of the layout is only really correct if drawn to the screen (as an svg). Until that solution exists, this should get you something that you can load into Sigil and clean up and make an ePub that you can then convert to other formats

Code:

http://www.pastie.org/760591
http://www.mediafire.com/?qmzjmt25yzf
http://rapidshare.com/files/336800633/topazscripts.zip.html

See also:
ebook DRM removal tools archive

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Switch to Hybrid Mode Threaded Mode

Audio/video stream recording forums

How to convert Topaz ebooks to HTML (Remove DRM from TPZ and AZW1 books for Kindle)