Last week I took a recess of my OpenERP crusade and spent some time trying to figure out the problem of extracting an Abstract Syntax Tree for the query expression out of its compiled byte code.
A query expression like the following:
>>> this = iter('') >>> query = (parent for parent in this ... if parent.age > 40 and parent.children ... if all(child.age < 5 for child in parent.children))
Is compiled into byte-code to something like this (in Python 2.7):
>>> import dis >>> dis.dis(query.gi_code) 1 0 LOAD_FAST 0 (.0) >> 3 FOR_ITER 60 (to 66) 6 STORE_FAST 1 (parent) 2 9 LOAD_FAST 1 (parent) 12 LOAD_ATTR 0 (age) 15 LOAD_CONST 0 (40) 18 COMPARE_OP 4 (>) 21 POP_JUMP_IF_FALSE 3 24 LOAD_FAST 1 (parent) 27 LOAD_ATTR 1 (children) 30 POP_JUMP_IF_FALSE 3 3 33 LOAD_GLOBAL 2 (all) 36 LOAD_CONST 1 (<code object at 0x1a21030, file "", line 3>) 39 MAKE_FUNCTION 0 42 LOAD_FAST 1 (parent) 45 LOAD_ATTR 1 (children) 48 GET_ITER 49 CALL_FUNCTION 1 52 CALL_FUNCTION 1 55 POP_JUMP_IF_FALSE 3 58 LOAD_FAST 1 (parent) 61 YIELD_VALUE 62 POP_TOP 63 JUMP_ABSOLUTE 3 >> 66 LOAD_CONST 2 (None) 69 RETURN_VALUE
Extracting the original query expression out of the compiled byte code is sometimes referred as “decompilation” or “uncompilation”. Others prefer calling it “Reverse Engineering”. Anyway you call it, it is a hard task. And initially we simply avoid it.
When I met PonyORM I found that our idea of having queries expressed via comprehensions was already implemented. Despite my initial enthusiasm, I was forced to put the project in pause.
Last week I revisited the problem, but trying to decouple PonyORM’s from Python 2.7 is not an easy task. It depends on modules that no longer exists in Python 3.0, and their APIs are not easy to replicate. I decided to stop trying.
First, I thought that deriving a dynamic algorithm based on Petri Nets would be easy to do in a couple of days. My first draft solved the issue of decompiling the byte-code for chained comparison expressions like
a < b < c. Here is one the drawings:
However, I found myself struggling with the Petri Net when it was not a DAG due to the absolute jumps in the byte-code for generator expressions.
Before proceeding to find a solution, I went back to search mode and looked for related articles and/or software. I stumbled upon an “old” package uncompyle2. This package has the nice property that is very easy to understand and it’s very easy to adapt. Moreover it’s based on published papers and a thesis you can download and read.
The whole idea is to apply compilers theory to the problem. This kind of ideas is very appealing to me. They have a context-free grammar that serves the purpose of building a recognizer for the python 2.7 byte-code.
So you can see that there are four productions for a generator expression:
# Generator Expressions are expression expr ::= genexpr # This is the outlook of generator expression as an argument of a # function. genexpr ::= LOAD_GENEXPR MAKE_FUNCTION_0 expr GET_ITER CALL_FUNCTION_1 # This one I don't know why: a generator expression is statement? stmt ::= genexpr_func # The outlook of a bare generator expression. genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK
If you try to apply this to the byte-code shown above you will fail to see the
LOAD_GENEXPR in the original byte-code. This is because it does not actually exists. It is produced by the uncompyle2′s tokenizer if the argument to the byte-code is a code-object itself with the “<genexpr>” name. This is simply done to simplify the grammar. Also the
MAKE_FUNCTION_0 is produced by the tokenizer to mean the actual byte code
MAKE_FUNCTION with argument 0. Same goes to
JUMP_BACK. These are called “customizations” and must be dealt with in the parser, but they are easy to understand.
Since Python 3.3 the MAKE_FUNCTION byte code is always preceded by two LOAD_CONST, the first one loads the code-object and the other loads the name. So, I simply change the grammar to meet those expectations:
@override(sys.version_info < (3, 3)) def p_genexpr(self, args): ''' expr ::= genexpr genexpr ::= LOAD_GENEXPR MAKE_FUNCTION_0 expr GET_ITER CALL_FUNCTION_1 stmt ::= genexpr_func genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK ''' @p_genexpr.override(sys.version_info >= (3, 3)) def p_genexpr(self, args): ''' expr ::= genexpr genexpr ::= LOAD_GENEXPR LOAD_CONST MAKE_FUNCTION_0 expr GET_ITER CALL_FUNCTION_1 stmt ::= genexpr_func genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK '''
So, that’s settled:
xotl.ql will use this for the decompilation module.
I have already started to port the needed parts and I expect to have this done by the end of August (I must return to OpenERP).
The plan is:
- Port and test the uncompyle2 toolkit. That will be release 0.3.0.
- Revise and publish the AST. Since the AST will be thing coupling
xotl.qlwith translators it must have a high degree stability. Moreover the AST must remain the same across Python versions.This probably will span several releases, up to 1.0, in which time the AST will be declared stable and only changed in incompatible ways major versions jumps.
- Rewrite the translator in
xotl.ql.translation.pyto the AST. This probably will be synchronized with the changes in the AST itself.
|||I have no interest in source code reconstruction, so I have not tested (and I won’t) anything beyond the extracted AST.|