Parity Volume Set Specification 1.0
is sadly no longer available.
Parity Volume Set specification 1.0 [2001-10-14]
Parity Volume Set specification 1.0 [2001-10-14]
by Stefan Wehlus - initial idea by Tobias Rieper, further suggestions and format
tweaking by Kilroy Balore, Willem Monsuwe and Karl Vogel.
The following document describes a file format called Parity Volume Set.
Parity Volume Set provides RAID like parity records for a given number of files.
With this parity records it is possible to restore missing files of the set.
You have 10 data files.
With a program using this specification you can calculate parity
volumes for these files.
The parity volumes contain a reed solomon checksum of the data of the
data files. So if a file is missing or corrupt, it can be reconstructed out of
the remaining data files and the parity volumes. Any file of this set can be
reconstructed with any parity volume. You just need as much parity volumes, as
files you are missing.
For instance, in the example above, Foobar.d03 is
lost. So it is possible to restore Foobar.d03 using the remaining data files and
one parity volume. It doesn't matter, which parity volume (p01/p02/p03) -
everyone will do...
In case you miss 2 files, you need 2 parity volumes. What
parity volumes and their combination (p01+p02/p01+p03/p02+p03) doesn't matter
too - you just need two of them. (They have to be different: renaming a copy of
foobar.p01 to foobar.p02 and using it with foobar.p01 to restore will not
A parity volume set consists of two parts. First there is a .PAR file (the
It contains information about the files, which are "saved" in
the volume sets:
- what files are stored in the set
- their checksums
- their size
- their filename
- a general comment for the set
Second there are the parity volume files. They are named .P00, .P01,
P02..... PXX. (After .P99, there will be .Q00...)
same information as the PAR file (besides the comment) and
- the calculated
parity data of the stored files
Format specification for a .PAR/.Pxx file
(The data is stored in "Intel notation".)
"PAR" followed by 5 null bytes
- The version number consists of two parts:
low Dword (bit 0..31) is the version - $00010000 for v01.00
Dword (bit 32..63) is a identifier for the generating program. The
client can use it to see, which program generated the parity volume set.
This may come in handy, if your program needs to know, if it was the
generator itself. (Maybe to see, if it can use some proprietary bits in
the status register or to notify, if new versions are out...)
coding is "program"-"version"-"subversion"-"subsubversion".
codes so far:
00 - undefined
01 - Mirror
02 - PAR
Mirror 0.2.1 has $01000201 as program identifier.
-a MD5 hash of the rest of the file starting from
0x0020 to the end
-used to detect a corrupt file
-used as a identifier for the parity volume set
Make a array of bytes (one dimendion, size=used files*16).
Then put the MD5 hashes (not the MD516k hashes) there, starting with the
first file. Use only the files, which are included in the parity data
(status bit 0 = 1). Then calculate the MD5 hash of this array.
-the number of the parity volume
-0 for the PAR
-1 for .p01, 2 for .p02, 3 for .p03 etc.
||number of files
-the number of files stored in the parity volume
(only the input data files - the parity volumes are not counted)
The files, which are not stored in the parity data, but in the file list
for CRC checking, are counted too. So this is the number of files in the
||start offset of the file list
-the start offset of the list of
the stored files
||file list size
-the size of the file list in bytes
||start offset of the data
-the start offset of the data area
-the size of the data area in bytes
-This is the start of the list of the saved files.
-it consists of file entrys
A file entry has the following format:
-the size of the file list entry in bytes
starts at 1, this record is counted too
-64 bits for status flags or numbers
only bit 0 and 1 are used:
bit0 = 0 - file is not saved in the parity
bit0 = 1 - file is saved in the parity volume set
= 0 - file is not checked successfully yet
bit1 = 1 - file was
-the size of the stored file in bytes
- a MD5 hash of the whole file
||16k MD5 hash
-a MD5 hash of the first 16384 Bytes of the file
-if the file is smaller then 16k, only the existing bytes are used
- the full filename with extension in Unicode
- paths are not allowed
- no end indicators, you
have to compute the size out of the entrysize (chars = (entrysize -
0x38) DIV 2)
After the file list, the data area starts.
For the PAR file, the data
area contains a comment for the volume set. The comment is stored in uniode,
control characters are allowed.
For the Pxx files, the data area contains
the checksum data.
Comupting of the checksum data:
You have n files
and want to create m parity data. So you create a data field D1..Dm,
where D1 is the first byte of file1, D2 is the first byte of
file2, D3 is the first byte of file 3... Dn is the first byte of
file n. For this field you create a field of m parity checksums
C1..Cm. C1 is stored as the first parity byte of .p01, C2 is
stored as the first parity byte of .p02, C3 is stored as the first parity
byte of .p03... Cm is stored as the first parity byte of .pm.
process the second byte of all files and so on, until you reached the end of the
biggest input file. If a file has no bytes left, use 0 for the calculation. So
the size of the parity data will be the size of the biggest file stored in the
parity volume set.
For calculating the checksums, the Reed-Solomon
algorithm is used. A good description of this algorithm for programmers is
The "wordsize" is 8 bit, so you calculate with Bytes only. (This restricts the
number of files + parity volumes to be less than 256.)
Why the 16k MD5 hashes?
One demanded feature is, that the program will find
its files even if the filenames are changed. So maybe, if the program won't find
the files due to renaming, you click the "search" button and then it checks the
MD5 hashes of all files in the directory to find his files. But with large
numbers of large files this will take a long time. With the 16k hashes, the
program only needs to check the first 16k of each file. This will be much
faster. If it finds a file, it can make a full hash too, just to be sure...
Why the file is saved - status flag?
Maybe the PAR file is only used to
provide file comments and checksums. The files are not saved in a PXX volume
Why the file is checked - status flag?
So you can build in a
function, that your program skips already successfully checked files. Repeated
checking of the same set will be faster this way. (Like with QuickSFV)
A file present in the file list doesn't have to be stored in the parity data
at all times. So, some (or all) files can be just for checksum control or file
comments in the PAR file. (In case, only a .SFV or .NFO functionality is
The saved files can be of any type and size.
no informations about the parity volumes stored in the header. So the program
has to scan the directory for present parity volumes. They can be identified
with the set hash.
There is no explicit rule for the order of the entrys
in the filelist, but I recommend to sort the filenames in ascending order.
This document has no copyright. The introduced principles, the file format
specification and it's details can be used free by everyone for any type of
software. There will be no license or other limitations.
It doesn't matter,
if you are writing open, closed or shared source, public domain, freeware,
shareware or commercial software - you can use this spec.
But if you use the
PAR/PXX file format and want to make changes to the format, you have to discuss
here to maintain compatibility.