This is a English version of my original post in Chinese.
中文版本请看这里
Please keep the original address of the post when reprinting. http://blog.csdn.net/ir0nf1st/article/details/61962197
There are several ways to protect Python source. Some of them are not so effective and some of them are effective but with side effects. This post gives a brief analysis of these ways and then gives out a solution to effectively protect Python code without side effect.
I tried two source code obfuscation tools. One is pyminifier and the other one is http://pyob.oxyry.com/. These two tools work in a similar way, they rename the class/function/variables and even scramble some Python constants (for example, True, False and None). Once the source is obfuscated, it becomes difficult for human reading and understanding. But this kind of obfuscation can barely confront with simple text searching&replacement. Generally a source code obfuscator will have no effect on source protection until it supports abstract syntax tree analysis and modification. Check this post for a deeper analysis on Python source obfuscation using ASTs.
py2exe and PyInstaller can pack Python source and Python Interpreter into a executable, so that your Python code can be executed on a target machine without Python installation. py2exe packs source code and its dependency files into a zip file. Unzip the file and all pyc files are there and ready to be de-compiled. PyInstaller is more secure than py2exe, it supports encryption of the source with AES, but the plain-text AES key also can be easily found in the packed file.
Another choice is Cython. It enables Python and C co-existency. You can call your module written in C from Python. The C module is built into native binary code, the binary code could be X86-PE on Windows platform or ARM-elf on ARM machine running with Linux. Reverse engineering on a C module is somewhat more difficult than on Python module, so that Cython may protect your C code but not Python code. Two side effects of using Cython are:
1. The C module is built into native binary code and makes your whole application no longer platform in-dependable.
2. Developing with C is a little more difficult then with Python.
If you are willing to pay the price, Cython can be a good choice to protect your C source code.
With the same Python version number, Python compiler, interpreter, dissembler and decompiler works on the same set of bytecode. With different versions, the bytecode set changes, that's one of the reason that why pyc file which is generated by Python 2.X compiler can't be executed by Python 3.X interpreter.
By introducing your own set of bytecode, normal tools will no longer be able to dissemble or de-compile your pyc file generated by your private Python compiler and then your secret is protected. The price is that your Python application must be shipped with your private Python interpreter together.
With bytecode obfuscation, normal dissembler and de-compiler can be tricked easily without harming the execution of your real application code. Here is an example to trick Uncompyle6 and dis.
A simple Python application below:
print 'Hello World'
Dissemble it by dis and gives out the following code:
>>> import marshal,dis
>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> code_obj = marshal.load(fd)
>>> fd.close()
>>> dis.dis(code_obj)
1 0 LOAD_CONST 0 ('Hello World')
3 PRINT_ITEM
4 PRINT_NEWLINE
5 LOAD_CONST 1 (None)
8 RETURN_VALUE
>>>
Explanation of the above dissembled instructions:
0 LOAD_CONST 0 ('Hello World') #Loads co_consts[0] to TOS(Top Of the Stack). co_consts[0] contains constant string 'Hello World'
3 PRINT_ITEM #Prints TOS to sys.stdout.
4 PRINT_NEWLINE #Prints a new line on sys.stdout. This instruction is auto-generated as the last operation of a print statement
5 LOAD_CONST 1 (None) #Loads co_consts[1] to TOS. co_consts[1] contains None
8 RETURN_VALUE #Returns with TOS to the caller of the function. These two instructions are auto-generated.
Insert a JUMP_ABSOLUTE instruction before the entry by crafting the according pyc file and then dissemble it with dis:
1 0 JUMP_ABSOLUTE 3
>> 3 LOAD_CONST 0 ('Hello World')
6 PRINT_ITEM
7 PRINT_NEWLINE
8 LOAD_CONST 1 (None)
11 RETURN_VALUE
Now the application logic in the crafted pyc file is just as the same as the original one but with a small control flow changing. Uncompyle6 can't deal with it and gives out the following output:
<<< Error: Decompiling stopped due to <class 'uncompyle6.semantics.pysource.ParserError'>
And then I will poison the crafted pyc file by crafting it once again. Dissemble the poisoned pyc files with my own Python dissembler:
1 0 JUMP_ABSOLUTE [71 06 00] 6
3 LOAD_CONST [64 FF FF] 65535 (FAKE!)
>> 6 LOAD_CONST [64 00 00] 0 (Hello World)
9 PRINT_ITEM [47 -- --]
10 PRINT_NEWLINE [48 -- --]
11 LOAD_CONST [64 01 00] 1 (None)
14 RETURN_VALUE [53 -- --]
The second instruction will load co_consts[65535] to TOS. At this case, the length of tuple co_consts is two, subscript 65535 will reach out of the range and makes the second instruction invalid. Due to the existence of the first JUMP_ABSOLUTE instruction, the second instruction will never be executed and the Python interpreter will not be affected.
Normal dissembler like dis tries to elaborate information as much as possible and it doesn't know the control flow. When it looks into the second instruction, it will try to read the contents at co_consts[65536] and raises an unhandled 'IndexError' exception:
>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> import marshal,dis
>>> co = marshal.load(fd)
>>> dis.dis(co)
1 0 JUMP_ABSOLUTE 6
3 LOAD_CONST 65535
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\dis.py", line 43, in dis
disassemble(x)
File "C:\Python27\lib\dis.py", line 96, in disassemble
print '(' + repr(co.co_consts[oparg]) + ')',
IndexError: tuple index out of range
>>>
Now Uncompyle6 and dis are no longer be able to work with this slightly crafted simple pyc file so that the code is effectively protected.
#flag can be a result of some computation
#or pre-defined constant hiding somewhere
#or even a return value of a function call
if flag is condition:
normal_processing()
else:
useless_but_complicated_obfuscating_code()
or_even_invalid_code()
try:
some_processing()
raise_exeception = __import__('module_does_not_exist')
#the above code will rise a 'ImportError' exception and the control flow will transfer to the except branch
useless_but_complicated_obfuscating_code()
except:
continue_normal_processing()
try:
some_processing()
raise_exeception = __import__('sys').non_exist_function()
#the above code will rise a 'AttributeError' exception and the control flow will transfer to the except branch
useless_but_complicated_obfuscating_code()
except:
continue_normal_processing()
try:
some_processing()
raise_exeception = 1/0
#the above code will rise a 'ZeroDivisionError' exception and the control flow will transfer to the except branch
useless_but_complicated_obfuscating_code()
except:
continue_normal_processing()
#single overlapping instruction
00: EB 01 jmp 3
02: 68 c3 90 90 90 push 0x909090c3
#actual execution
00: EB 01 jmp 3
03: C3 retn
#multiple overlapping instruction
00: EB02 jmp 4
02: 69846A40682C104000EB02 imul eax, [edx + ebp*2 + 0102C6840], 0x002EB0040
#actual execution
00: EB02 jmp 4
04: 6A40 push 040
06: 682C104000 push 0x40102C
0B: EB02 jmp 0xF
#overlapping itself
00: EBFF jmp 1
02: C0C300 rol bl, 0
#actual execution
00: EBFF jmp 1
01: FFC0 inc eax
03: C3 retn
#single overlapping instruction
0 JUMP_ABSOLUTE [71 05 00] 5
3 PRINT_ITEM [47 -- --]
4 LOAD_CONST [64 64 01] 356
7 STOP_CODE [00 -- --]
#actual execution
0 JUMP_ABSOLUTE [71 05 00] 5
5 LOAD_CONST [64 01 00] 1
#multiple overlapping instruction
0 EXTENDED_ARG [91 00 64]
3 EXTENDED_ARG [91 00 53]
6 JUMP_ABSOLUTE [71 02 00]
#actual execution
0 EXTENDED_ARG [91 00 64]
3 EXTENDED_ARG [91 00 53]
6 JUMP_ABSOLUTE [71 02 00]
2 LOAD_CONST [64 91 00]
5 RETURN_VALUE [53 -- --]
>>> fd = open('1.pyc', 'rb')
>>> fd.seek(8)
>>> import marshal
>>> co = marshal.load(fd)
>>> fd.close()
>>> code_string = marshal.dumps(co)
>>> scrambled_code = code_string.encode('zlib').encode('base64')
>>> print scrambled_code
eJxLZoACRiB2AOJifiBRyMaQ8v9/CgODu0cKI0OwBhNIghtIeKTm5OQrhOcX5aT4aYC0oRHFXCAi
MbcgJ9VIr6CyhAPItcnNTynNSbUD2VACUgQAIHcTlg==
>>>
Copy the scrambled_code string into the following Python source descramble.py:
scrambled_code_string = 'eJxLZoACRiB2AOJifiBRyMaQ8v9/CgODu0cKI0OwBhNIghtIeKTm5OQrhOcX5aT4aYC0oRHFXCAiMbcgJ9VIr6CyhAPItcnNTynNSbUD2VACUgQAIHcTlg=='
exec __import__('marshal').loads(scrambled_code_string.decode('base64').decode('zlib'))
Execute descramble.py:
>python descramble.py
Hello World
Of course, this scrambling algorithm(encode into zlib and then encode into base64) is too easy to protect the code and does not even need 'crack' but the whole example shows the essential of code scrambling.