This file attempts to describe the format of a .rb file -- the book format that is downloaded into NuvoMedia's hand-held wonder, the Rocket eBook.
Note: All multi-byte integers are stored in Vax/Intel order (the opposite of network byte order). Most integers are 4 bytes (an int32), but there are some minor exceptions (as detailed below).
The first 4 bytes of the file seem to be a magic number (in hex): B0 0C B0 0C. I like to think of this as a hexidecimal pun on the word "book" (repeated). [Matt Greenwood has reported seeing a magic number of "B0 0C F0 0D" in another type of ReB-related file -- i.e. "book food".]
The next two bytes appear to be a version number, currently "02 00". I assume this means major version 2, minor version 0.
The next 4 bytes are the string "NUVO", followed by 4 bytes of 00h. (I have also seen an old title that had 0s in place of the "NUVO".)
This brings us up to offset 0Eh, at which point we have a 4-byte representation of the date the book was created (Matt Greenwood pointed this out to me -- thanks!). The year is encoded as an int16. On older version of the RocketLibrary was encoding the year's full value (e.g. 1999 was "CF 07" and 2000 was "D0 07"), but a more recent version is now using the tm_year value verbatim -- i.e. it's storing 100 for the year 2000 ("64 00"). The year is followed by an int8 for the 1-relative month number, and an int8 for the day of the month.
After that is 6 bytes of 00h. These may be reserved for setting the time of creation (at a guess).
Then, at offset 18h, we have an int32 that contains the absolute offset of the "table of contents" (the directory of files contained within this .rb file). In all of the .rb file's I've seen, this remains constant with a value of 128h. However, I have tested an atypical .rb file where I placed the TOC at the end of the file (after all the file contents), and it worked fine.
Immediately following this is an int32 with the length of the .rb file (so we can check if the file is complete or not).
All the bytes from here (offset 20h) up to offset 128h appear to only be used by an encrypted title. In a non-encrypted title, they are always 0.
The table of contents typically comes next (at offset 128h). It starts with an int32 count of the number of "file" entries in the ToC. Each entry consists of a "file" name (zero-padded to 32 bytes), followed by 3 int32s: the length of this entry's data segment, the absolute offset of the data in the .rb file, and a flag. The known flag values are: 1 (encrypted), 2 (info file), and 8 (deflated). File names are tweaked as needed to ensure that they are all unique. The current RocketWriter software uses a unique 6-digit number, a dash, up to 8 characters from the filename, and then the re-mapped suffix for the data (.html, .hidx, .png, .info, etc.).
Often the first item in the ToC is the info file, but it doesn't have to be. This "file" contains information on the author, title, the root-file's name, etc. (See appendix A). This data is never encrypted nor compressed, so this entry's flag value is always "2".
An image file is always stored as a B&W image in PNG format. Since it has its own compression, it is stored without any additional attempt at deflation. I have also never seen an encrypted image, so its flag value is 0.
An HTML file contains the tags and text that were re-written by the RocketWriter into a consistent syntax (this presumably makes the HTML renderer in the ReB itself simpler). HTML files are typically compressed (See appendix B). Every HTML file appears to use the suffix .html no matter what the file name was on import, but I have seen older files with .htm used as the suffix.
For every HTML file there is a corresponding .hidx file that contains a summary of the paragraph formatting and the position of the anchor names in the associated .html file (See appendix C). This file is sometimes compressed, depending on length (See appendix B).
There are also reference titles that have a .hkey file that contains a list of words that can be looked up in the associated .html file (See appendix D).
Immediately following the ToC is the data for each piece mentioned in the ToC, in the same order as it appeared in the ToC.
Finally, the end of the file appears to be padded with 20 bytes of 01h.
The info file consists of a series of lines that contain "name=value" strings. Each line is terminated by a single newline. Here are the values that the RocketWriter generates:
COMMENT=Info file for <title> TYPE=2 TITLE=<title> AUTHOR=<author> URL=ebook:<long, unique string used for the file's name by the librarian> GENERATOR=<e.g. RocketLibrarian 1.3.216> PARSE=1 OUTPUT=1 BODY=<name of root HTML file (as it appears in the ToC)> MENUMARK=menumark.html SuggestedRetailPrice=<usually empty>
Encrypted titles have a few more entries (including those listed above):
ISBN=<ISBN number, including dashes> REVISION=<digits> TITLE_LANGUAGE=<en-us> PUB_NAME=<Publisher's name> PUBSERVER_ID=<digits> GENERATOR=<e.g. RocketPress 1.3.121> VERSION=<digits> USERNAME=<rocket-ID> COPY_ID=<digits> COPYRIGHT=<copyright> COPYTITLE=<another copyright?>
A reference title also has an indication that there is a .hkey file present:
Compressed files have a data section in the .rb file with the following format:
The first int32 is a count of the number of 4096-byte chunks of data we broke the uncompressed file into (the last chunk can be shorter than 4096 bytes, of course).
This is immediately followed by an int32 with the length of the entire uncompressed data.
After this there are <count> int32s that indicate the size of each chunk's compressed data.
Following these length int32s is the output from a deflation (the algorithm used in gzip) for each 4096-byte chunk of the original data. It appears that you must use a window-bit size of 13 and a compression level of "best" to be compatible with the Rocket eBook's system software.
The .hidx file's purpose is to allow the renderer to quickly look up the format of each paragraph (useful for random access to the data), and the position of the anchor names.
The first section lists the various paragraph-producing tags. It is headed by a line of "[tags <count>]", where <count> is the number of tags that follow this header. The tags are listed one per line, and have an implied enumeration from 0 to N-1 (which the other tags and the upcoming paragraph sections reference).
The first tag is typically (always?) "<HTML> -1". The number trailing the tag indicates what other tag (or sequence of tags, one per line) in which we are nested. So, if we have a <BR> nested inside a <P ALIGN="center">, it would be listed separately from a <BR> that was nested inside a normal paragraph, and each one would have a different trailing index number.
Following the tag section is the paragraph section. The heading is "[paragraphs <count>]", and is followed by a line for each paragraph. These lines consist of a character offset into the .html file for the start of the paragraph followed by a 0-relative offset into the tag section (indicating what kind of formatting to use for the indicated paragraph).
The character offset into the .html file points to the first bit of text after the associated tag.
The last section details the anchor names. The heading is "[names <count>]", and each item that follows is a quoted string of the anchor name, followed by a character offset into the .html file where we'll find that name. If there are no names in the assicated HTML section, the heading is included with a 0 count (i.e. "[names 0]").
The character offset into the .html file points to the start of the anchor tag (not after the tag, like the offsets in the "paragraphs" section).
The lines are terminated by newlines (in standard unix fashion).
[tags 10] <HTML> -1 <BODY> 0 <P ALIGN="right"> 1 <P ALIGN="left"> 1 <P> 1 <H3 ALIGN="center"> 1 <P ALIGN="center"> 1 <BR> 6 <H2 ALIGN="center"> 1 <BR> 1 [paragraphs 42] 160 9 164 9 184 8 220 8 261 6 316 5 359 1 379 6 410 6 460 7 511 7 564 7 616 7 668 7 720 7 773 7 827 7 880 7 933 7 988 7 1043 7 1100 7 1157 7 1214 7 1270 7 1328 7 1385 7 1442 7 1497 7 1556 7 1561 7 1635 1 1656 5 1690 6 1737 7 1773 5 1798 4 1826 3 2663 1 2668 4 2689 2 2730 8 [names 1] "ch1" 2689
The .hkey file contains a list of words, one per line, sorted in a strict ascii sequence, each one followed by a tab and the offset in the .html file of the word's data. I presume that the .hkey file must share the same name prefix as its related .html file, but that is not known for certain.
The lines are terminated with a newline (in standard unix fashion).
a 5 apple 38 b 84 book 104
Each of these offsets points to a paragraph tag in the associated .html file. I have only seen this sequence of tags used so far:
<P><BIG><B>word</B></BIG> other stuff</P>
The offset in the .hkey file points to the start of the <P> tag.