I like to browse stackoverflow while I think over problems, and sometimes that leads to some interesting questions. Recently someone asked for help on unpacking the contents of an “old” (2005) PC game’s data files to extract the music.
First, lets do a lazy check to see if the poster’s assumption that the file is compressed is correct:
We can see hundreds of readable strings, so it’s highly likely that this is not compressed. Yay.
The original question gives us a small log snippet which is important because it makes figuring out the first bit of the header much simpler:
---- Initializing file system ---- pak1.apk - 733 files pak2.apk - 45 files pak4.apk - 1 files F_Init: 3 data files found.
This tells us two key things:
- These three files combine to form a single virtual filesystem.
- We now know how many entries we should be able to find in each file to verify our parser is correct.
Lets take a look at the first 16 bytes of each file:
00000000: 0000 803f 9999 0000 2833 2500 dd02 0000 ...?....(3%..... 00000000: 0000 803f 9999 0000 3807 4d00 2d00 0000 ...?....8.M.-... 00000000: 0000 803f 9999 0000 3c07 0100 0100 0000 ...?....<.......
We can see some 3 pretty obvious patterns in the first 16 bytes. The first 8 bytes are probably a magic number identifying this as an
apk file. The next 4 bytes are really, really close to the total length of the file, minus a bit (probably
len(file) - len(header) - len(footer)). The next 4 bytes are obvious thanks to the log post from the original question:
1. It’s the # of files in each archive.
At this point we’re kind of stuck. The blocks following this aren’t trivial to figure out. Lets take a look a the smallest and simplest of the 3 files, pak4.
00010720: f8e9 0000 0000 0000 0000 5452 5545 5649 ..........TRUEVI 00010730: 5349 4f4e 2d58 4649 4c45 2e00 2cc0 5adf SION-XFILE..,.Z.
Very intesting. Near the end of the file we can see the string
0x00. This is the magic number for TGA files, an image format. We know where the footer is, but this doesn’t tell us how large the file is only where it ends. Since we know there’s only 1 entry in this
apk file lets try to brute force it. We’ll start at the end of the file and keep moving the start of the file until we manage to open a valid TGA file.
It’s definitely not pretty, but lets give it a shot:
» python brute.py pak4.apk 1040
Neat, looks like we’ve found one of the model textures. We’ve also figured out where the TGA starts in the
apk file, at offset 1040. If we remove the parts of the header we know about (16 bytes) we get exactly 1024 bytes of unknown data at the top. Taking a look at the other pak files we can see a clear change in entropy and structure 1040 bytes into each file so it’s fairly likely to be part of the header.
A second pattern becomes obvious here. The footer given in the 2nd header field is variable length, being larger the more files are stored in the
(len(file) - footer_offset) / file_count = 76
This holds true for every
apk, so we have a footer that consists of a 76 byte entry for each file in the
apk. However just like that odd 1024 byte block at the start, the entropy is extremely high here and there’s no logical structure. Because the size is always consistent, we know there’s no compression going on here. So the contents are probably “encrypted”. However, again the size of each entry is constant which you won’t really see with complex encryption schemes.
It’s probably a simple substition cipher.
Lets screw around in python.
… ~100 or so misguided attempts later …
Hurray! Everything clicks now. The 1024 block at the end of the header is the XOR cipher used for the file table in the footer. We simply XOR each byte in the footer by the respective byte in the cipher.
Okay, lets combine everything we’ve learned and make a useful class to handle this all for us:
And here’s how you use the example: