FILE FORMAT
A file format is
a particular way that information is encoded for storage in a computer file.
Any list of file formats would contain both proprietary and open source. Since
a disk drive, or indeed any computer storage, can store only bits, the computer
must have some way of converting information to 0s and 1s and vice-versa. There
are different kinds of formats for different kinds of information. Within any
format type, e.g., word processor documents, there will typically be several
different formats. Sometimes these formats competye with each other. File
formats can be divided into proprietary and open formats.
In
computer programming, a data type is a classification identifying one of
various types of data, such as floating-point, integer, or Boolean, that
determines the possible values for that type; the operations that can be done
on values of that type; the meaning of the data; and the way values of that
type can be stored. Data types are used within type systems, which offer
various ways of defining, implementing and using them. Different type systems
ensure varying degrees of type safety. Formally, a type can be defined as
"any property of a programme we can determine without executing the
program"
Some file formats are designed for
very particular sorts of data: PNG files, for example, store bitmapped images
using lossless data compression. Other file formats, however, are designed for
storage of several different types of data: the Ogg format can act as a
container for many different types of multimedia, including any combination of
audio and/or video, with or without text (such as subtitles), and metadata. A
text file can contain any stream of characters, encoded for example as ASCII or
Unicode, including possible control characters. Some file formats, such as
HTML, Scalable Vector Graphics, and the source code of computer software are
also text files with defined syntaxes that allow them to be used for specific
purposes.
Overview
Almost all programming languages explicitly include the notion of
data type, though different languages may use different terminology. Common
data types may include:
- integers,
- booleans,
- characters,
- floating-point numbers,
- alphanumeric strings.
For example, in the Java programming
language, the "int" type represents the set
of 32-bit integers ranging in value from -2,147,483,648 to
2,147,483,647, as well as the operations that can be performed on integers,
such as addition, subtraction, and multiplication. Colors, on the other hand,
are represented by three bytes denoting the amounts each of red,
green, and blue, and one string representing that color's name; allowable
operations include addition and subtraction, but not multiplication.
Most programming languages also allow the
programmer to define additional data types, usually by combining multiple
elements of other types and defining the valid operations of the new data type.
For example, a programmer might create a new data type named "complex
number" that would include real and imaginary parts. A data type also
represents a constraint placed upon the interpretation of data in a type
system, describing representation, interpretation and structure
of values or objects stored in computer memory. The type
system uses data type information to check correctness of computer
programs that access or manipulate the data.
Classes of data types
Machine data types
All data in computers based on digital
electronics is represented as bits (alternatives 0 and 1) on the
lowest level. The smallest addressable unit of data is usually a group of bits
called a byte (usually an octet, which is 8 bits). The unit
processed by machine code instructions is called a word (as
of 2011, typically 32 or 64 bits). Most instructions interpret the word as a binary
number, such that a 32-bit word can represent unsigned integer values from 0
to or signed
integer values from to . Because
of two's complement, the machine language and machine doesn't need to
distinguish between these unsigned and signed data types for the most part.
There is a specific set of arithmetic
instructions that use a different interpretation of the bits in word as
a floating-point number.
Machine data types need to be exposed or
made available in systems or low-level programming languages,
allowing fine-grained control over hardware. The C programming language,
for instance, supplies integer types of various widths, such as short and long.
If a corresponding native type does not exist on the target platform, the
compiler will break them down into code using types that do exist. For
instance, if a 32-bit integer is requested on a 16 bit platform, the compiler
will tacitly treat it as an array of two 16 bit integers.
Several languages
allow binary and hexadecimal literals, for convenient
manipulation of machine data.
In higher level programming, machine data types are often hidden
or abstracted as an implementation detail that would render
code less portable if exposed. For instance, a generic numeric type
might be supplied instead of integers of some specific bit-width.
The boolean type
The Boolean type represents the
values: true and false. Although only two values are possible,
they are rarely implemented as a single binary digit for efficiency reasons.
Many programming languages do not have an explicit boolean type, instead
interpreting (for instance) 0 as false and other values as true.
Numeric types
Such as:
- The integer data types, or "whole numbers". May be subtyped according to their ability to contain negative values (eg. unsigned in C and C++). May also have a small number of predefined subtypes (such as short and long in C/C++); or allow users to freely define subranges such as 1..12 (eg. Pascal/Ada).
- Floating point data types, sometimes misleadingly called reals, contain fractional values. They usually have predefined limits on both their maximum values and their precision.
- Fixed point data types are convenient for representing monetary values. They are often implemented internally as integers, leading to predefined limits.
- Bignum or arbitrary precision numeric types lack predefined limits. They are not primitive types, and are used sparingly for efficiency reasons.
String and text types
Such as
- Alphanumeric character. A letter of the alphabet, digit, blank space, punctuation mark, etc.
- Alphanumeric strings, a sequence of characters. They are typically used to represent words and text.
Character and string types can store sequences
of characters from a character set such as ASCII. Since most character
sets include the digits, it is possible to have a numeric string, such as"1234".
However, many languages would still treat these as belonging to a different
type to the numeric value 1234.
Character and string types can have different
subtypes according to the required character "width". The original
7-bit wide ASCII was found to be limited, and superseded by 8 and 16-bit sets,
which can encode a wide variety of non-latin alphabets (Hebrew, Chinese)
and other symbols. Strings may be either stretch-to-fit or of fixed size, even
in the same programming language. They may also be subtyped by their maximum
size.
Note: strings are not primitive in all languages, for instance C:
they may be composed from arrays of characters.
Type systems
A type system associates types with each
computed value. By examining the flow of these values, a type system attempts
to prove that no type errors can occur. The type system in
question determines what constitutes a type error, but a type system generally
seeks to guarantee that operations expecting a certain kind of value are not
used with values for which that operation does not make sense.
A compiler may use the static type of
a value to optimize the storage it needs and the choice of algorithms for
operations on the value. In many C compilers the float data
type, for example, is represented in 32 bits, in accord with
the IEEE specification for single-precision floating point numbers. They
will thus use floating-point-specific microprocessor operations on
those values (floating-point addition, multiplication, etc.).
The depth of type constraints and the manner of
their evaluation affect the typing of the language.
A programming language may further associate an operation with
varying concrete algorithms on each type in the case of type
polymorphism. Type theory is the study of type systems, although the
concrete type systems of programming languages originate from practical issues
of computer architecture, compiler implementation, and language design.
Type systems may be
variously static or dynamic, strong or weak
typing, and so forth.
The RAR Format
1 Introduction
This document, a work-in-progress, describes the RAR format. It
serves a similar role that the ZIP App Note does for the ZIP format.
NOTE 1: This documentation MUST NOT be used to create RAR-compatible archive programs like WinRAR. It is only for the purposes of writing decompression software (i.e. unrar) in various languages. It was reverse-engineered from the UnRAR source located at this page with Eugene Roshal's permission.
NOTE 2: This documentation will initially focus on what I believe is Version 3 of the RAR format.
NOTE 1: This documentation MUST NOT be used to create RAR-compatible archive programs like WinRAR. It is only for the purposes of writing decompression software (i.e. unrar) in various languages. It was reverse-engineered from the UnRAR source located at this page with Eugene Roshal's permission.
NOTE 2: This documentation will initially focus on what I believe is Version 3 of the RAR format.
1.1 License
2 General Format of a
.RAR File
Overall .RAR file format:
signature 7 bytes (0x52 0x61 0x72 0x21 0x1A 0x0
7 0x00)
[1st volume header]
...
[2nd volume header]
...
...
[nth volume header]
...
In general, a modern
single-volume RAR file has MAIN_HEAD structure followed by
multiple FILE_HEAD structures.
2.1 Volume Header Format
The Base Header Block is:
header_crc 2 bytes
header_type 1 byte
header_flags 2 bytes
header_size 2 bytes
The header_size indicates how many total bytes the header
requires. The header_type field determines how the remaining bytes
should be interpreted.
2.1.1 Header Type
The header type is 8 bits (1 byte) and can have the following
values:
Value
|
Type
|
0x72
|
MARK_HEAD
|
0x73
|
MAIN_HEAD
|
0x74
|
FILE_HEAD
|
0x75
|
COMM_HEAD
|
0x76
|
AV_HEAD
|
0x77
|
SUB_HEAD
|
0x78
|
PROTECT_HEAD
|
0x79
|
SIGN_HEAD
|
0x7a
|
NEWSUB_HEAD
|
0x7b
|
ENDARC_HEAD
|
2.1.1.1 MARK_HEAD
TBD
2.1.1.2 MAIN_HEAD
The remaining bytes in the volume header for MAIN_HEAD are:
HighPosAv 2 bytes
PosAV 4 bytes
EncryptVer 1 byte (only present if
MHD_ENCRYPTVER
is set)
2.1.1.3 FILE_HEAD
The remaining bytes in the FILE_HEAD structure are:
PackSize 4 bytes
UnpSize 4 bytes
HostOS 1 byte
FileCRC 4 bytes
FileTime 4 bytes
UnpVer 1 byte
Method 1 byte
NameSize 2 bytes
FileAttr 4 bytes
HighPackSize 4 bytes (only present if LHD_LARGE
is
set)
HighUnpSize 4 bytes (only present if LHD_LARGE
is
set)
FileName (NameSize) bytes
Salt 8 bytes (only present if
LHD_SALT is set)
ExtTime Structure See Description (only present if LHD_
EXTTIME is set)
Packed Data (Total Packed Size) bytes
If the LHD_LARGE flag is set, then the archive is large and
64-bits are needed to represent the packed and unpacked size. HighPackSize is
used as the upper 32-bits and PackSize is used as the lower 32-bits for the
packed size in bytes. HighUnpSize is used as the upper 32-bits and UnpSize is
used as the lower 32-bits for the unpacked size in bytes.
ExtTime Structure
Follow the following pseudo-code to read in the ExtTime Structure.
TODO: Fill in details of what this structure represents.
TODO: Fill in details of what this structure represents.
var extTimeFlags =
readBits(16)
mtime:
mtime_rmode =
extTimeFlags >> 12
if ((mtime_rmode &
8)==0) goto ctime
mtime_count =
mtime_rmode & 0x3
mtime_reminder =
readBits(8*mtime_count);
ctime:
ctime_rmode =
extTimeFlags >> 8
if ((ctime_rmode &
8)==0) goto atime
Set ctime to
readBits(16)
ctime_count = ctime_rmode
& 0x3
ctime_reminder =
readBits(8*ctime_count)
atime:
atime_rmode =
extTimeFlags >> 4
if ((atime_rmode &
8)==0) goto arctime
Set atime to
readBits(16)
atime_count =
atime_rmode & 0x3
atime_reminder =
readBits(8*atime_count)
arctime:
arctime_rmode =
extTimeFlags
if ((arctime_rmode &
8)==0) goto done_exttime
Set arctime to
readBits(16)
arctime_count =
arctime_rmode & 0x3
arctime_reminder =
readBits(8*arctime_count)
done_exttime
2.1.1.4 COMM_HEAD TBD
2.1.1.5 AV_HEAD TBD
2.1.1.6 SUB_HEAD TBD
2.1.1.7 PROTECT_HEAD TBD
2.1.1.5 AV_HEAD TBD
2.1.1.6 SUB_HEAD TBD
2.1.1.7 PROTECT_HEAD TBD
2.1.1.8 SIGN_HEAD TBD
2.1.1.9 NEWSUB_HEAD TBD
2.1.1.10 ENDARC_HEAD TBD
2.1.1.9 NEWSUB_HEAD TBD
2.1.1.10 ENDARC_HEAD TBD
2.1.2 Header Flags
The header flags are 16 bits (2 bytes). Depending on
the type of Volume Header, the flags are interpreted differently.
The Main Header Flags are:
The Main Header Flags are:
Value
|
Flag
|
0x0001
|
MHD_VOLUME
|
0x0002
|
MHD_COMMENT
|
0x0004
|
MHD_LOCK
|
0x0008
|
MHD_SOLID
|
0x0010
|
MHD_PACK_COMMENT or
MHD_NEWNUMBERING
|
0x0020
|
MHD_AV
|
0x0040
|
MHD_PROTECT
|
0x0080
|
MHD_PASSWORD
|
0x0100
|
MHD_FIRSTVOLUME
|
0x0200
|
MHD_ENCRYPTVER
|
The File Header Flags are:
Value
|
Flag
|
0x0001
|
LHD_SPLIT_BEFORE
|
0x0002
|
LHD_SPLIT_AFTER
|
0x0004
|
LHD_PASSWORD
|
0x0008
|
LHD_COMMENT
|
0x0010
|
LHD_SOLID
|
0x0100
|
LHD_LARGE
|
0x0200
|
LHD_UNICODE
|
0x0400
|
LHD_SALT
|
0x0800
|
LHD_VERSION
|
0x1000
|
LHD_EXTTIME
|
0x2000
|
LHD_EXTFLAGS
|
2.1.2.1 MHD_VOLUME Value 0x0001. TBD
2.1.2.2 MHD_COMMENT Value 0x0002. TBD
2.1.2.3 MHD_LOCK Value 0x0004. TBD
2.1.2.4 MHD_SOLID Value 0x0008. TBD
2.1.2.5 MHD_PACK_COMMENT Value 0x0010. TBD
2.1.2.6 MHD_AV Value 0x0020. TBD
2.1.2.7 MHD_PROTECT Value 0x0040. TBD
2.1.2.8 MHD_PASSWORD Value 0x0080. TBD
2.1.2.9 MHD_FIRSTVOLUME Value 0x0100. TBD
2.1.2.10 MHD_ENCRYPTVER
2.1.2.2 MHD_COMMENT Value 0x0002. TBD
2.1.2.3 MHD_LOCK Value 0x0004. TBD
2.1.2.4 MHD_SOLID Value 0x0008. TBD
2.1.2.5 MHD_PACK_COMMENT Value 0x0010. TBD
2.1.2.6 MHD_AV Value 0x0020. TBD
2.1.2.7 MHD_PROTECT Value 0x0040. TBD
2.1.2.8 MHD_PASSWORD Value 0x0080. TBD
2.1.2.9 MHD_FIRSTVOLUME Value 0x0100. TBD
2.1.2.10 MHD_ENCRYPTVER
Value 0x0200. Indicates
whether encryption is present in the archive volume.
2.1.2.11 LHD_SPLIT_BEFORE Value 0x0001. TBD
2.1.2.12 LHD_SPLIT_AFTER Value 0x0002. TBD
2.1.2.13 LHD_PASSWORD Value 0x0004. TBD
2.1.2.14 LHD_COMMENT Value 0x0008. TBD
2.1.2.15 LHD_SOLID Value 0x0010. TBD
2.1.2.16 LHD_LARGE Value 0x0100. Indicates if the archive is large. In this case, 64 bits are used to describe the packed and unpacked size.
2.1.2.17 LHD_UNICODE Value 0x0200. Indicates if the filename is Unicode.
2.1.2.12 LHD_SPLIT_AFTER Value 0x0002. TBD
2.1.2.13 LHD_PASSWORD Value 0x0004. TBD
2.1.2.14 LHD_COMMENT Value 0x0008. TBD
2.1.2.15 LHD_SOLID Value 0x0010. TBD
2.1.2.16 LHD_LARGE Value 0x0100. Indicates if the archive is large. In this case, 64 bits are used to describe the packed and unpacked size.
2.1.2.17 LHD_UNICODE Value 0x0200. Indicates if the filename is Unicode.
2.1.2.18 LHD_SALT Value 0x0400. Indicates if the 64-bit salt value
is present.
2.1.2.19 LHD_VERSION Value 0x0800. TBD
2.1.2.20 LHD_EXTTIME Value 0x1000. The ExtTime Structure is
present in the FILE_HEAD
header.
2.1.2.21 LHD_EXTFLAGS Value 0x2000. TBD
3 Unpacking
Once the header information and packed bytes have been extracted,
the packed bytes must then be unpacked. RAR uses a variety of algorithms for
this. Chief among these are Lempel-Ziv
compression and Prediction by Partial Matching. The details of the
unpacking are specified in the following subsections based on the values of
UnpVer and Method as decoded in the FILE_HEAD structure.
If Method is 0x30 (decimal 48), then the packed bytes are the unpacked bytes and no decompression/unpacking is necessary (i.e. the file was not compressed).
If Method is 0x30 (decimal 48), then the packed bytes are the unpacked bytes and no decompression/unpacking is necessary (i.e. the file was not compressed).
Otherwise:
UnpVer
Value (decimal)
|
Algorithm
To Use
|
15
|
Unpack15
|
20
|
Unpack20
|
26
|
|
29
|
Unpack29
|
36
|
3.1 Unpack15 TBD
3.2 Unpack20 TBD
3.3 Unpack29 The structure of packed data consists of N
number of blocks. If the first bit of
a block is set, then process the block as a PPM block. Otherwise, this is an LZ
block.
3.3.1 LZ Block
The format of a LZ block
is:
isPPM 1 bit
keepOldTable 1 bit
huffmanCodeTable (variable size)
3.3.1.1 Huffman Code
Tables
The Huffman Encoding tables consist a series of bit lengths. For a
more thorough treatment of the concepts of Huffman Encoding, see the
Deflate spec. The RAR format uses a set of twenty bit lengths to construct
Huffman Codes. The Huffman Encoding tables in RAR files consist of at most
twenty entries of the format:
BitLength 4 bits
ZeroCount 4 bits (only present if
BitLength is 15)
If BitLength is 15, then the next 4 bits are read as ZeroCount. If
the ZeroCount is 0, then the bit length is 15, otherwise (ZeroCount+2) is the
number of consecutive zero bit lengths that are in the table. For instance, if
the following 4-bit numbers are present:
0x8
|
indicates a bit-length of 8
|
0x4
|
indicates a bit-length of 4
|
0x4
|
indicates a bit-length of 4
|
0x2
|
indicates a bit-length of 2
|
0xF
|
these two 4-bit numbers specify a
bit-length of 15
|
0x0
|
|
0xF
|
these two 4-bit numbers specify a
run of 5 zeros
|
0x3
|
|
0x9
|
indicates a bit-length of 9
|
0x3
|
indicates a bit-length of 3
|
0xF
|
these two 4-bit numbers specify a
run of 8 zeros
|
0x6
|
This example describes a Huffman Encoding Bit Length table of:
Code
|
Bit
Length
|
Code
|
BitLength
|
1
|
8
|
11
|
9
|
2
|
4
|
12
|
3
|
3
|
4
|
13
|
0
|
4
|
2
|
14
|
0
|
5
|
15
|
15
|
0
|
6
|
0
|
16
|
0
|
7
|
0
|
17
|
0
|
8
|
0
|
18
|
0
|
9
|
0
|
19
|
0
|
10
|
0
|
20
|
0
|
Once the twenty bit lengths are obtained, the Huffman Encoding
table is constructed by using the following algorithm:
1) Count the number of
codes for each code length. Let
LenCount[N] be the number of codes of length
N, where
N = {1..16}.
2) Find the decode
length and positions:
N = 0
TmpPos[0] = 0
DecodePos[0] = 0
DecodeLen[0] = 0
for (I = 1; I < 16; I++)
{
N = 2 * (N+LenCount[I])
M = N << (15-I)
if (M > 0xFFFF) M = 0xFFFF
DecodeLen[I] = (unsigned int)M
TmpPos[I] = DecodePos[I] =
DecodePos[I-1] +
LenCount[I-1]
}
3) Assign numerical
values to all codes:
for (I = 0; I < TableSize; I++)
{
if (BitLength[I] != 0)
DecodeNum[ TmpPos[BitLength[I]
& 0xF]++ ] = I
}