Protect Python Code with Bytecode Obfuscation

薛弘厚

2023-12-01

This is a English version of my original post in Chinese.

中文版本请看这里

Please keep the original address of the post when reprinting. http://blog.csdn.net/ir0nf1st/article/details/61962197

There are several ways to protect Python source. Some of them are not so effective and some of them are effective but with side effects. This post gives a brief analysis of these ways and then gives out a solution to effectively protect Python code without side effect.

<0x01> Source code obfuscation

I tried two source code obfuscation tools. One is pyminifier and the other one is http://pyob.oxyry.com/. These two tools work in a similar way, they rename the class/function/variables and even scramble some Python constants (for example, True, False and None). Once the source is obfuscated, it becomes difficult for human reading and understanding. But this kind of obfuscation can barely confront with simple text searching&replacement. Generally a source code obfuscator will have no effect on source protection until it supports abstract syntax tree analysis and modification. Check this post for a deeper analysis on Python source obfuscation using ASTs.

<0x02> Packing source code into executable

py2exe and PyInstaller can pack Python source and Python Interpreter into a executable, so that your Python code can be executed on a target machine without Python installation. py2exe packs source code and its dependency files into a zip file. Unzip the file and all pyc files are there and ready to be de-compiled. PyInstaller is more secure than py2exe, it supports encryption of the source with AES, but the plain-text AES key also can be easily found in the packed file.

Another choice is Cython. It enables Python and C co-existency. You can call your module written in C from Python. The C module is built into native binary code, the binary code could be X86-PE on Windows platform or ARM-elf on ARM machine running with Linux. Reverse engineering on a C module is somewhat more difficult than on Python module, so that Cython may protect your C code but not Python code. Two side effects of using Cython are:

1. The C module is built into native binary code and makes your whole application no longer platform in-dependable.

2. Developing with C is a little more difficult then with Python.

If you are willing to pay the price, Cython can be a good choice to protect your C source code.

<0x03> Define private Python bytecode set

With the same Python version number, Python compiler, interpreter, dissembler and decompiler works on the same set of bytecode. With different versions, the bytecode set changes, that's one of the reason that why pyc file which is generated by Python 2.X compiler can't be executed by Python 3.X interpreter.

By introducing your own set of bytecode, normal tools will no longer be able to dissemble or de-compile your pyc file generated by your private Python compiler and then your secret is protected. The price is that your Python application must be shipped with your private Python interpreter together.

<0x04> Bytecode obfuscation

With bytecode obfuscation, normal dissembler and de-compiler can be tricked easily without harming the execution of your real application code. Here is an example to trick Uncompyle6 and dis.

A simple Python application below:

print 'Hello World'

Dissemble it by dis and gives out the following code:

>>> import marshal,dis
>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> code_obj = marshal.load(fd)
>>> fd.close()
>>> dis.dis(code_obj)
  1           0 LOAD_CONST               0 ('Hello World')
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               1 (None)
              8 RETURN_VALUE
>>>

Explanation of the above dissembled instructions:

0 LOAD_CONST     0 ('Hello World') #Loads co_consts[0] to TOS(Top Of the Stack). co_consts[0] contains constant string 'Hello World'
3 PRINT_ITEM                       #Prints TOS to sys.stdout.
4 PRINT_NEWLINE                    #Prints a new line on sys.stdout. This instruction is auto-generated as the last operation of a print statement
5 LOAD_CONST     1 (None)          #Loads co_consts[1] to TOS. co_consts[1] contains None
8 RETURN_VALUE                     #Returns with TOS to the caller of the function. These two instructions are auto-generated.

Insert a JUMP_ABSOLUTE instruction before the entry by crafting the according pyc file and then dissemble it with dis:

  1           0 JUMP_ABSOLUTE            3
        >>    3 LOAD_CONST               0 ('Hello World')
              6 PRINT_ITEM
              7 PRINT_NEWLINE
              8 LOAD_CONST               1 (None)
             11 RETURN_VALUE

Now the application logic in the crafted pyc file is just as the same as the original one but with a small control flow changing. Uncompyle6 can't deal with it and gives out the following output:

<<< Error: Decompiling stopped due to <class 'uncompyle6.semantics.pysource.ParserError'>

And then I will poison the crafted pyc file by crafting it once again. Dissemble the poisoned pyc files with my own Python dissembler:

1           0 JUMP_ABSOLUTE        [71 06 00]     6 
            3 LOAD_CONST           [64 FF FF] 65535 (FAKE!）
      >>    6 LOAD_CONST           [64 00 00]     0 (Hello World)
            9 PRINT_ITEM           [47 -- --]
           10 PRINT_NEWLINE        [48 -- --]
           11 LOAD_CONST           [64 01 00]     1 (None)
           14 RETURN_VALUE         [53 -- --]

The second instruction will load co_consts[65535] to TOS. At this case, the length of tuple co_consts is two, subscript 65535 will reach out of the range and makes the second instruction invalid. Due to the existence of the first JUMP_ABSOLUTE instruction, the second instruction will never be executed and the Python interpreter will not be affected.

Normal dissembler like dis tries to elaborate information as much as possible and it doesn't know the control flow. When it looks into the second instruction, it will try to read the contents at co_consts[65536] and raises an unhandled 'IndexError' exception:

>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> import marshal,dis
>>> co = marshal.load(fd)
>>> dis.dis(co)
  1           0 JUMP_ABSOLUTE            6
              3 LOAD_CONST           65535
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\dis.py", line 43, in dis
    disassemble(x)
  File "C:\Python27\lib\dis.py", line 96, in disassemble
    print '(' + repr(co.co_consts[oparg]) + ')',
IndexError: tuple index out of range
>>>

Now Uncompyle6 and dis are no longer be able to work with this slightly crafted simple pyc file so that the code is effectively protected.

<0x05> More obfuscating tricks

<0x05 0x01> Fake branch

Intentionally constructed branch not only tricks the machine but also tricks human and can be used to protect code against manual reverse engineering. Here are some examples:

#flag can be a result of some computation
#or pre-defined constant hiding somewhere
#or even a return value of a function call
if flag is condition:
    normal_processing()
else:
    useless_but_complicated_obfuscating_code()
    or_even_invalid_code()

try:
    some_processing()
    raise_exeception = __import__('module_does_not_exist')
    #the above code will rise a 'ImportError' exception and the control flow will transfer to the except branch
    useless_but_complicated_obfuscating_code()
except:
    continue_normal_processing()

try:
    some_processing()
    raise_exeception = __import__('sys').non_exist_function()
    #the above code will rise a 'AttributeError' exception and the control flow will transfer to the except branch
    useless_but_complicated_obfuscating_code()
except:
    continue_normal_processing()

try:
    some_processing()
    raise_exeception = 1/0
    #the above code will rise a 'ZeroDivisionError' exception and the control flow will transfer to the except branch
    useless_but_complicated_obfuscating_code()
except:
    continue_normal_processing()

<0x05 0x02> Overlapping Instruction

Overlapping instruction are more widely used on CISC machine which have variable instruction length. Here are some examples of X86 overlapping instruction.

#single overlapping instruction
00: EB 01           jmp  3
02: 68 c3 90 90 90  push 0x909090c3

#actual execution
00: EB 01           jmp  3
03: C3              retn

#multiple overlapping instruction
00: EB02                    jmp  4
02: 69846A40682C104000EB02  imul eax, [edx + ebp*2 + 0102C6840], 0x002EB0040

#actual execution
00: EB02       jmp  4
04: 6A40       push 040
06: 682C104000 push 0x40102C
0B: EB02       jmp  0xF

#overlapping itself
00: EBFF    jmp 1
02: C0C300  rol bl, 0

#actual execution
00: EBFF    jmp 1
01: FFC0    inc eax
03: C3      retn

Comparing to simple jump instruction, overlapping instruction obfuscates control flow a step further. It can be used more effectively against human. Python bytecode is similar to RISC instruction, overlapping instruction can be still constructed. Here are examples of Python overlapping bytecode:

#single overlapping instruction
 0 JUMP_ABSOLUTE        [71 05 00]     5 
 3 PRINT_ITEM           [47 -- --]
 4 LOAD_CONST           [64 64 01]     356
 7 STOP_CODE            [00 -- --]

#actual execution
 0 JUMP_ABSOLUTE        [71 05 00]     5 
 5 LOAD_CONST           [64 01 00]     1

#multiple overlapping instruction
 0 EXTENDED_ARG         [91 00 64] 
 3 EXTENDED_ARG         [91 00 53]
 6 JUMP_ABSOLUTE        [71 02 00]

#actual execution
 0 EXTENDED_ARG         [91 00 64] 
 3 EXTENDED_ARG         [91 00 53]
 6 JUMP_ABSOLUTE        [71 02 00]
 2 LOAD_CONST           [64 91 00]
 5 RETURN_VALUE         [53 -- --]

<0x06> Confrontation with manual reverse engineering

As it already showed, bytecode obfuscation can trick machine easily, but it's still hard to trick experienced engineer. More complicate control flow may enhance difficulty of reverse engineering a little but not much. Experienced engineer will definitely use control flow analysis tool against your code.

Code scrambling can protect against human further. Your real application code was encrypted and stored as one or more constant strings in the pyc file. A piece of code firstly descramble the real application code at runtime and then execute it. Carefully designed scrambling algorithm can protect the code from static analysis until the algorithm itself is cracked. Here is a simple example of code scrambling:

Scramble the above 1.pyc

>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> import marshal
>>> co = marshal.load(fd)
>>> fd.close()
>>> code_string = marshal.dumps(co)
>>> scrambled_code = code_string.encode('zlib').encode('base64')
>>> print scrambled_code
eJxLZoACRiB2AOJifiBRyMaQ8v9/CgODu0cKI0OwBhNIghtIeKTm5OQrhOcX5aT4aYC0oRHFXCAi
MbcgJ9VIr6CyhAPItcnNTynNSbUD2VACUgQAIHcTlg==

>>>

Copy the scrambled_code string into the following Python source descramble.py:

scrambled_code_string = 'eJxLZoACRiB2AOJifiBRyMaQ8v9/CgODu0cKI0OwBhNIghtIeKTm5OQrhOcX5aT4aYC0oRHFXCAiMbcgJ9VIr6CyhAPItcnNTynNSbUD2VACUgQAIHcTlg=='
exec __import__('marshal').loads(scrambled_code_string.decode('base64').decode('zlib'))

Execute descramble.py:

>python descramble.py
Hello World

Of course, this scrambling algorithm(encode into zlib and then encode into base64) is too easy to protect the code and does not even need 'crack' but the whole example shows the essential of code scrambling.