I like to browse stackoverflow while I think over problems, and sometimes that leads to some interesting questions. Recently someone asked for help on unpacking the contents of an “old” (2005) PC game’s data files to extract the music.

First, lets do a lazy check to see if the poster’s assumption that the file is compressed is correct:

» strings -n 20 pak1.apk
jeep_small_rocket_desert
item_missile_big_heat
...

We can see hundreds of readable strings, so it’s highly likely that this is not compressed. Yay.

The original question gives us a small log snippet which is important because it makes figuring out the first bit of the header much simpler:

---- Initializing file system ----

pak1.apk - 733 files
pak2.apk - 45 files
pak4.apk - 1 files
F_Init: 
3 data files found.

This tells us two key things:

  1. These three files combine to form a single virtual filesystem.
  2. We now know how many entries we should be able to find in each file to verify our parser is correct.

Lets take a look at the first 16 bytes of each file:

00000000: 0000 803f 9999 0000 2833 2500 dd02 0000  ...?....(3%.....
00000000: 0000 803f 9999 0000 3807 4d00 2d00 0000  ...?....8.M.-...
00000000: 0000 803f 9999 0000 3c07 0100 0100 0000  ...?....<.......

We can see some 3 pretty obvious patterns in the first 16 bytes. The first 8 bytes are probably a magic number identifying this as an apk file. The next 4 bytes are really, really close to the total length of the file, minus a bit (probably len(file) - len(header) - len(footer)). The next 4 bytes are obvious thanks to the log post from the original question: 733, 45, 1. It’s the # of files in each archive.

At this point we’re kind of stuck. The blocks following this aren’t trivial to figure out. Lets take a look a the smallest and simplest of the 3 files, pak4.

00010720: f8e9 0000 0000 0000 0000 5452 5545 5649  ..........TRUEVI
00010730: 5349 4f4e 2d58 4649 4c45 2e00 2cc0 5adf  SION-XFILE..,.Z.

Very intesting. Near the end of the file we can see the string TRUEVISION-XFILE. + 0x00. This is the magic number for TGA files, an image format. We know where the footer is, but this doesn’t tell us how large the file is only where it ends. Since we know there’s only 1 entry in this apk file lets try to brute force it. We’ll start at the end of the file and keep moving the start of the file until we manage to open a valid TGA file.

import sys
import warnings
from cStringIO import StringIO

from PIL import Image

FOOTER = b'TRUEVISION-XFILE'


def main():
    in_file = sys.argv[1]

    with open(in_file, 'rb') as fin:
        in_file = fin.read()

    start = in_file.find(FOOTER)
    # Ends with a '.' then 0x00.
    end = start + len(FOOTER) + 2

    count = 0
    greatest_found_idx = None
    while count < len(in_file):
        attempted_tga = StringIO(in_file[end-count:end])

        try:
            with warnings.catch_warnings(record=True) as w:
                i = Image.open(attempted_tga)
                if len(w):
                    continue
        except IOError:
            pass
        else:
            greatest_found_idx = end - count
        finally:
            count += 1

    print(greatest_found_idx)
    i.save('image.png')


if __name__ == '__main__':
    sys.exit(main())

It’s definitely not pretty, but lets give it a shot:

» python brute.py pak4.apk
1040

Terrain image

Neat, looks like we’ve found one of the model textures. We’ve also figured out where the TGA starts in the apk file, at offset 1040. If we remove the parts of the header we know about (16 bytes) we get exactly 1024 bytes of unknown data at the top. Taking a look at the other pak files we can see a clear change in entropy and structure 1040 bytes into each file so it’s fairly likely to be part of the header.

A second pattern becomes obvious here. The footer given in the 2nd header field is variable length, being larger the more files are stored in the apk.

(len(file) - footer_offset) / file_count = 76

This holds true for every apk, so we have a footer that consists of a 76 byte entry for each file in the apk. However just like that odd 1024 byte block at the start, the entropy is extremely high here and there’s no logical structure. Because the size is always consistent, we know there’s no compression going on here. So the contents are probably “encrypted”. However, again the size of each entry is constant which you won’t really see with complex encryption schemes.

It’s probably a simple substition cipher.

Bingo

Lets screw around in python.

# Lets read the entire file in to play with it.
In [32]: f = open('pak4.apk', 'rb').read()
# Read the 1024 byte block in the header we aren't too sure about (but I'm
# starting to have a hunch)
In [33]: s2 = f[16:1040]
# Read the (only) footer entry (remember pak4 only has one file)
In [34]: s4 = f[-76:]

… ~100 or so misguided attempts later …

In [46]: ''.join([chr(ord(c) ^ ord(s2[i])) for i, c in enumerate(s4)])
Out[46]: 'models\\mapobjects\\elektro\\power_plant.tga\x00...'

Hurray! Everything clicks now. The 1024 block at the end of the header is the XOR cipher used for the file table in the footer. We simply XOR each byte in the footer by the respective byte in the cipher.

Okay, lets combine everything we’ve learned and make a useful class to handle this all for us:

#!/usr/bin/env python
# -*- utf-8 -*-
import sys
import struct


class PakParser(object):
    #: File prefix.
    MAGIC_NUMBER = b'\x00\x00\x80\x3F\x99\x99\x00\x00'
    TABLE_ENTRY_SIZE = 76

    def __init__(self):
        self.file_count = 0
        self.cipher_table = None
        self.file_table = {}

    def load(self, file_obj):
        if not file_obj.read(len(self.MAGIC_NUMBER)) == self.MAGIC_NUMBER:
            raise IOError('invalid magic number')

        # The start of the file listing table (given from the start of the
        # file) and the number of file entries.
        file_table_offset, self.file_count = struct.unpack(
            '<II',
            file_obj.read(8)
        )

        # The cipher table comes after the header fields and is always exactly
        # 1kb.
        self.cipher_table = file_obj.read(1024)

        # Each entry in the file listing table is 76 bytes.
        file_obj.seek(file_table_offset, 0)

        self.file_table = dict(
            (filename.strip('\x00'), (offset, size))
            for filename, offset, size in (
                self.unpack_table_entry(
                    self.decipher(
                        file_obj.read(self.TABLE_ENTRY_SIZE),
                        i * self.TABLE_ENTRY_SIZE
                    )
                )
                for i in xrange(self.file_count)
            )
        )

    def decipher(self, chunk, offset):
        return ''.join(
            chr(
                ord(c)
                ^
                ord(self.cipher_table[(i + offset) % 1024])
            )
            for i, c in enumerate(chunk)
        )

    def unpack_table_entry(self, entry):
        # Filename (64 bytes)
        # Offset (4 bytes)
        # Size (4 bytes)
        # The unknown 4 bytes seem to always be 0.
        return struct.unpack_from('<64sIII', entry)[:3]

    def extract_file(self, file_obj, file_name):
        offset, size = self.file_table[file_name]
        file_obj.seek(offset, 0)
        return file_obj.read(size)


def main():
    with open(sys.argv[1], 'rb') as fin:
        pak = PakParser()
        pak.load(fin)

        sys.stdout.write(pak.extract_file(fin, sys.argv[2]))

if __name__ == '__main__':
    sys.exit(main())

And here’s how you use the example:

python pakparser.py pak2.apk "sounds\\wavegun.wav" > wavegun.wav