Baby steps towards reverse engineering in the Pythonic Query Language

Last week I took a recess of my OpenERP crusade and spent some time trying to figure out the problem of extracting an Abstract Syntax Tree for the query expression out of its compiled byte code.

A query expression like the following:

>>> this = iter('')
>>> query = (parent for parent in this
...          if parent.age > 40 and parent.children
...          if all(child.age < 5 for child in parent.children))

Is compiled into byte-code to something like this (in Python 2.7):

>>> import dis
>>> dis.dis(query.gi_code)
  1           0 LOAD_FAST                0 (.0)
        >>    3 FOR_ITER                60 (to 66)
              6 STORE_FAST               1 (parent)

  2           9 LOAD_FAST                1 (parent)
             12 LOAD_ATTR                0 (age)
             15 LOAD_CONST               0 (40)
             18 COMPARE_OP               4 (>)
             21 POP_JUMP_IF_FALSE        3
             24 LOAD_FAST                1 (parent)
             27 LOAD_ATTR                1 (children)
             30 POP_JUMP_IF_FALSE        3

  3          33 LOAD_GLOBAL              2 (all)
             36 LOAD_CONST               1 (<code object  at 0x1a21030, file "", line 3>)
             39 MAKE_FUNCTION            0
             42 LOAD_FAST                1 (parent)
             45 LOAD_ATTR                1 (children)
             48 GET_ITER
             49 CALL_FUNCTION            1
             52 CALL_FUNCTION            1
             55 POP_JUMP_IF_FALSE        3
             58 LOAD_FAST                1 (parent)
             61 YIELD_VALUE
             62 POP_TOP
             63 JUMP_ABSOLUTE            3
        >>   66 LOAD_CONST               2 (None)
             69 RETURN_VALUE

Extracting the original query expression out of the compiled byte code is sometimes referred as “decompilation” or “uncompilation”. Others prefer calling it “Reverse Engineering”. Anyway you call it, it is a hard task. And initially we simply avoid it.

When I met PonyORM I found that our idea of having queries expressed via comprehensions was already implemented. Despite my initial enthusiasm, I was forced to put the project in pause.

Last week I revisited the problem, but trying to decouple PonyORM’s from Python 2.7 is not an easy task. It depends on modules that no longer exists in Python 3.0, and their APIs are not easy to replicate. I decided to stop trying.

First, I thought that deriving a dynamic algorithm based on Petri Nets would be easy to do in a couple of days. My first draft solved the issue of decompiling the byte-code for chained comparison expressions like a < b < c. Here is one the drawings:

Handwritten Petri Net for Python byte-code

Handwritten Petri Net for Python byte-code

However, I found myself struggling with the Petri Net when it was not a DAG due to the absolute jumps in the byte-code for generator expressions.

Before proceeding to find a solution, I went back to search mode and looked for related articles and/or software. I stumbled upon an “old” package uncompyle2. This package has the nice property that is very easy to understand and it’s very easy to adapt. Moreover it’s based on published papers and a thesis you can download and read.

The whole idea is to apply compilers theory to the problem. This kind of ideas is very appealing to me. They have a context-free grammar that serves the purpose of building a recognizer for the python 2.7 byte-code.

So you can see that there are four productions for a generator expression:

#  Generator Expressions are expression
expr ::= genexpr

# This is the outlook of generator expression as an argument of a
# function.

#  This one I don't know why: a generator expression is statement?
stmt ::= genexpr_func

# The outlook of a bare generator expression.
genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK

If you try to apply this to the byte-code shown above you will fail to see the LOAD_GENEXPR in the original byte-code. This is because it does not actually exists. It is produced by the uncompyle2′s tokenizer if the argument to the byte-code is a code-object itself with the “<genexpr>” name. This is simply done to simplify the grammar. Also the MAKE_FUNCTION_0 is produced by the tokenizer to mean the actual byte code MAKE_FUNCTION with argument 0. Same goes to CALL_FUNCTION_1 and JUMP_BACK. These are called “customizations” and must be dealt with in the parser, but they are easy to understand.

For example I’ve modified the package (still untested beyond an IPython shell) so that byte-code for the generator expression in Python 3.3 can be parsed [1].

Since Python 3.3 the MAKE_FUNCTION byte code is always preceded by two LOAD_CONST, the first one loads the code-object and the other loads the name. So, I simply change the grammar to meet those expectations:

@override(sys.version_info < (3, 3))
def p_genexpr(self, args):
    expr ::= genexpr
    stmt ::= genexpr_func
    genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK

@p_genexpr.override(sys.version_info >= (3, 3))
def p_genexpr(self, args):
    expr ::= genexpr
    stmt ::= genexpr_func
    genexpr_func ::= LOAD_FAST FOR_ITER designator comp_iter JUMP_BACK

So, that’s settled: xotl.ql will use this for the decompilation module.

I have already started to port the needed parts and I expect to have this done by the end of August (I must return to OpenERP).

The plan is:

  1. Port and test the uncompyle2 toolkit. That will be release 0.3.0.
  2. Revise and publish the AST. Since the AST will be thing coupling xotl.ql with translators it must have a high degree stability. Moreover the AST must remain the same across Python versions.This probably will span several releases, up to 1.0, in which time the AST will be declared stable and only changed in incompatible ways major versions jumps.
  3. Rewrite the translator in to the AST. This probably will be synchronized with the changes in the AST itself.


[1] I have no interest in source code reconstruction, so I have not tested (and I won’t) anything beyond the extracted AST.

Announcing xoeuf or “OpenERP for humans”

Yesterday, I pushed to github our toolkit that helps us ease some tasks when programming with OpenERP. It’s name: xoeuf (pronounced like “hef”, just try to make it sound Frenchy). The name comes from the French for egg “œuf” and our usual “x”. The egg comes from Python’s tradition for binary distribution eggs. It’s too late to change to “wheels“.

Since documentation is still lacking there are some things that are noteworthy about xoeuf:

  • Unfold the powers of the shell. xoeuf allows to test code in a normal python (or better IPython) shell:
    >>> from xoeuf.pool import some_database_name as db
    >>> db.salt_shell(_='res.users')
    >>> self
    >>> len(, uid, []))

    This feature works directly by opening a connection to your configured PostgreSQL server. So be sure to either have set OPENERP_SERVER environment variable or to have the standard configuration file in your home directory.

  • Model extensions for common programming patterns (xoeuf.osv.model_extensions). Those “methods” are automatically weaved into models when salting the shell (the salt_shell we saw above):
    >>> self.search_read(cr, uid, [], ('login', ))
        [{'login':  ... }, .... ]

    But you can use them as function in your code:

    from xoeuf.osv.model_extensions import search_read
    res_users = self.pool['res.users']
    res = search_read(res_users, cr, uid, [], ('login', ))
  • Get sane: spell things by name when writing. I have already mentioned that writing things in OpenERP requires some good eyes to see the meaning of something like “[(6, 0, [1, 4])]“. The xoeuf.osv.writers allow to simply tell that you want to “replace all currently related objects with these new ones”:
    from xoeuf.osv.model_extensions import get_writer
    with get_writer(some_modelobj, cr, uid, ids) as writer:
        writer.replace('related_attr_ids', 1, 4)

    This will simply invoke the normal write method with the right magical numbers.

So go ahead a try it and tell us.

Announcing the OpenERP corner.

I’m starting a new “column” in this blog. I call it the “OpenERP corner”. It’s going to be about anything I think is lacking in the OpenERP Book or somewhat misguiding in it’s technical documentation as well.

I think this column might help others like me seeking for orientation. It’s not intended to be accurate and will probably start some debate. That’s what I want.

I cannot reach for accuracy because OpenERP is a moving target. It changes on a daily basis. Also my blog time is being reduced to less than an hour per week and covering any OpenERP topic will take several hours. Anyway the “most accurate” place for seeking current information would be the help forum. And since I’m offline 99% of the time (guess why) I cannot participate much there.

The debate stuff is more like a hope. Engaging in a debate (if not lead to a flame war of taste) is always enlightening. I might as well be wrong when I say something, so that leaves space to be corrected (and taught).

So let’s stop now into this introduction and start writing my first post for the “OpenERP corner”.

See you in a couple of weeks.

The productivity of tomorrow trumps that of today

That’s probably harsh, but I think it is absolutely right. Doing crappy software today to be more productive today will make you less productive tomorrow. It’s this simple. And that’s cumulative too; meaning that if you neglect your future productivity, it will be slowly diminishing until a point of negative competitive disadvantage where you’ll find yourself struggling to keep up instead of innovating.

And it’s so damn exhausting to explain why…

Software complexity does not come from the tools, but from the mental framework required (and imposed at times) to understand it. So don’t ever think measuring Cyclomatic Complexity (CC) and other similar metrics will yield something close to the true measure of the quality of your code.

There are only two hard things in Computer Science: cache invalidation and naming things.

—Phil Karlton

def _check(l):
    if len(l) <= 1:
       return l
    l1, l2 = [], []
    m = l[0]
    for x in l[1:]:
       if x <=m :
    return _check(l1) + [m] +_check(l2)

This code has a nice CC of 4 which is very nice; yet it will take you at least a minute to figure out what it does. If only I had chosen to name the function quicksort

>>> _check([84, 95, 89, 4, 77, 24, 95, 86, 70, 16])
[4, 16, 24, 70, 77, 84, 86, 89, 95, 95]

A quiz in less than 5 seconds: What does it mean in OpenERP’s ORM the following lines of code?

group.write({'implied_ids': [(3,]})

This line of code has a CC of 1: as good as it gets, isn’t it? But it’s so darn difficult to read that unless you have brain wired up to see “forget the link between this group and the implied_group”… To be fair there is someone out there in the OpenERP universe that cared a bit about this:

# file addons/base/ir/

CREATE = lambda values: (0, False, values)
UPDATE = lambda id, values: (1, id, values)
DELETE = lambda id: (2, id, False)
FORGET = lambda id: (3, id, False)
LINK_TO = lambda id: (4, id, False)
DELETE_ALL = lambda: (5, False, False)
REPLACE_WITH = lambda ids: (6, False, ids)

But no one else is using it!

And, yes, it’s exhausting going over this.

Itches I have with OpenERP’s code base

These notes have been in my undecided-outbox since late January. I think it’s time to publish them, even unpolished as they are.

OpenERP has such a wonderful coding community that its documentation is ages behind. So from time to time (perhaps too often) you need to look at the source to fully grasp what’s troubling you. But the source is, well, at little bit under-cared for, if I may say.

Here are some itches I have with code:

  1. Some methods are so long (several hundreds of lines) that take about half an hour to grok all the details.

    And these are not core-framework methods I’m talking (complaining) about, but application-level methods. See for instance the reconcile method in addons/account/ (137 lines of code as of OpenERP’s 2014-02-26 code base).

  2. Many of them have neither a useful docstring nor comments to guide your reading.
  3. Too much code duplication. Probably because generalization opportunities are being disregarded or missed.

    The piece of code I’m going to dissect below is duplicated in even in the same file in two different methods. And it’s kind of a business rule.

  4. Too hard to read methods or, worst, pieces of them.

Let’s illustrate some of these itches in 5 lines of code. In the reconcile method, the lines to be reconciled must belong to the same company. This is the piece of that method that does that:

company_list = []
for line in self.browse(cr, uid, ids, context=context):
    if company_list and not in company_list:
        raise osv.except_osv(_('Warning!'), _('To reconcile the entries company should be the same for all entries.'))

This is hardly readable. In fact, before I could enunciate the intention of the code it took me a backwards reading of the code. To make things worse, the same piece of code is found in the same file inside the method reconcile_partial of the same class! So the cognitive load is duplicated instantly.

The following is IMO a more readable option is the following:

lines = self.browse(cr, uid, ids, context=context)
if lines and len({line.company_id for line in lines}) > 1:
   raise ...

If you want to be more efficient and yet more readable (or maybe you want to support Python 2.6 and don’t use the set comprehension) you can do:

if lines:
    first_line, last_lines = lines[0], lines[1:]
    if any(line.company_id != first_line.company_id
           for line in last_lines):
        raise ...

If you read that, it clearly says “if any line’s company is not the same as the first line’s company” then raise an error.

And even better:

def ensure_same_company(lines):
    if lines:
        first_line, last_lines = lines[0], lines[1:]
        if any(line.company_id != first_line.company_id
               for line in last_lines):
            raise ...

And then reuse the function in every place where is needed. And the name “ensure” has a pre-condition formulation that is simple to understand by reading, not the code but the name only.

Why do I prefer comparing the company_id attribute directly is more debatable; but I have two good reasons:

  • Generality. If the company_id were actually an id (it is not in the original code) this code would work unchanged.

    The browse_record object implements the __ne__ protocol and does it right. The cost of calling the __ne__ should not be a performance sink given the normal use of the application (enforce this rules at the UI level as well).

  • Respect the principle of least astonishment. What’s is the id of a company’s id?

The return inside the for statement for “performance gain” is a false principle. The any built-in function is way faster. See it by yourselves:

>>> sample = random.sample(range(100000000), 900)

>>> def unique(sample):
...    x = []
...    for y in sample:
...        if x and y in x:
...            return False
...        else:
...            x.append(y)
...    return True

>>> def unique2(sample):
...     if sample:
...         first, rest = sample[0], sample[1:]
...         if any(y != first for y in rest):
...             return False
...         else:
...             return True
...     else:
...         return True

>>> %timeit unique(sample)
100 loops, best of 3: 10.2 ms per loop

>>> %timeit unique2(sample)
100000 loops, best of 3: 17.2 µs per loop

Notice that the any implementation is is orders of magnitude faster than the for loop implementation — from the nanoseconds to the milliseconds. Yes, I admit it. The gain is probably not for v. any but the usage of x in y. So let’s try a for loop that avoid the in test:

>>> def unique3(sample):
...     if sample:
...         first, rest = sample[0], sample[1:]
...         for y in rest:
...             if y != first:
...                 return False
...     return True

>>> %timeit unique3(sample)
100000 loops, best of 3: 16 µs per loop

Now, this implementation is faster, but very, very close to each other… Anyway, the clarity of any beats IMO that 1.6 µs the for loop saves.

“Software made for Belgians”

The title of this post is a (rather innocent) joke-like phrase that our team has coined. The phrase has its origins in a couple of events:

First, we’ve heard that Belgium has been declared Internet broadband connections a human right. Probably not exactly true. However, according to this several countries have promoted similar laws or statements.

Second, since we are now mainly using OpenERP (mostly made in Belgium) and we suffer from a 128 kbits per second Internet connection… Yes, you have read correctly and I made no mistake: 128 kilo-bits per second… And sometimes we need to access our OpenERP server over that connection. The fact is that OpenERP has a big fat upfront-loaded JS and makes lots of Ajax request that, summed up, amount to more than 1400 KB.

If you’re quick on math, you have guessed by now that it takes ages, eons and wasted human lives to load this application on a clean cache.

So, “Software made for Belgians” is any software that is built with the assumption (either consciously or not) that Internet access is available and is reasonably fast.

There are many pieces of software that have this property. For instance, npm opens many connections to download dependencies and this fails a lot under unreliable/slow connections. I’ve had npm sessions that I’ve needed to look for required packages and then do all the job of tracing requirements myself in order to install a single package.

You could think I’m against this kind of software. You’d be wrong. I’m against slow connection expensive [1] connections.

Software is built for a given set of requirements and following standard guidelines and assumptions. These days, a fast Internet connection is practically a must. When you are a freelance developer and you charge by the hour you should not need to waste 10 minutes waiting for to load.


[1] Our country’s (sole) ISP has announced that it will (drastically) reduce the Internet connection prices. Our ADSL 128kbps connection that currently costs more than $ 900 (again, no mistake; it is that big the invoice for a shity connection) will cost about $ 110. Seriously…! Of course, that’s kinda relief; and we’ll switch to a better connection (still less that 1Mbps) for the same amount we’re paying right now… But, that’s just insane.

Ah, these prices are in CUC (Cuban Convertible Peso) but you may think about USD dollars.

Also these are “enterprise” prices anyway… There are no prices for “natural persons” beyond $ 4.50 per hour in a public room…

Composing deferreds — Building UI patterns

Note: This post was mainly written before New year, new projects. Nevertheless I keep the original wording and thus some references are made as if “News…” post was not published (nor even thought off).

In my last post I have talked about “Backboning” my current web application project. We have already deployed our first version (the one without backbone) yet. We have discovered errors, we have fixed them and we have learned about how our current AJAX calls behave under high latency (our server is far, far away and our office very low bandwidth).

Our clients have hardly noticed the latency issues, cause most of them consume our web app with a better connection; but we must be prepared. Specially cause one of the future goals is to be accessible from mobiles.

We have advanced some in our refactoring; roughly one feature have been completely refactored, and several other features are partially done.

At the same time we have created a branch for introducing some patterns in our application. That branch should serve well for both our current state and for our refactored version. This post is about those patterns and how we are approaching them.

Continue reading

What about Spine or Backbone?

Note: This post was mainly written before New year, new projects. Nevertheless I keep the original wording and thus some references are made as if “News…” post was not published (nor even thought off).

After my last post I didn’t rest idle, but I went to download and read some of the noted libraries/frameworks. I wanted to learn as much as possible, an even considering to introduce some of those in my current project.

So, this post is about a work in progress: me studying some JS libraries/frameworks; and also me with a couple of priorities for my current project that help me evaluate them. So let’s start by my project in order to provide some context to my evaluation.

Continue reading

New year, new projects…

2013 came to its end and 2014 it’s just beginning. Last year I accomplished many work projects and had to put others on hold. Many of the triumphs are due a post yet; but I promise I will before February. This is a summary of things done last year:

  • We completed the first iteration of our client’s web site, and put some Backbone and RequireJS there to overcome the complexity of introducing changes and reusing concepts.

    I have 3 posts on the drafts folder about this project (or related topics I’ve dealt with when working on the project):

    • “What about Spine or Backbone?”
    • “Composing deferreds”, and
    • “Bye, bye coffeeness. Hello es6-ness…?”

    I do expect to have the first 2 in a matter of days, but the last one is still a very early draft.

  • We completed the first iteration of a Query Language for Python as described in one of my posts.

    More about this in a second.

  • We (as a team) have completed the first phase of the implementation of OpenERP in our client’s enterprise.

There were also things we wanted to do but couldn’t. For these, a bit of explanation is needed.

What I wanted to do but couldn’t because…

Complete a second iteration of xotl.ql after knowing PonyORM

After the first iteration of xotl.ql was published, we came to know about the existence of PonyORM. Pony is an ORM that also uses Python generator expressions to express queries. The main difference is that they inspect the object’s code (disassembling) to reconstructs the original query syntactical form, and we use only Python protocols to gain insight about that. As a consequence they can do things we can only simulate.

We then wanted to test a disassembling algorithm, but we have a couple of requirements PonyORM does not support:

  • Separate the query syntactical level with the translation of the query to a given object model.
  • Support both Python 3.3 and Python 2.7.

PonyORM does have a dedicated module for disassembling, but that only works on Python 2.7.

Nevertheless we have to put that project on hold, and I have to move to lead the develop of the our client’s web site. Also we needed to concentrate our efforts in doing more client-facing changes in OpenERP to comply their requirements, and spending time in a query language does not have high value-vs-cost ratio, at the current time.

We will return to that project the moment we need our query language realized.

What I’m doing now

I’ve started the year receiving the responsibility of moving our client’s Accounting Department to use the Accounting Module of OpenERP. It’s a huge undertaking, despite the tiny size of the department. Old habits, migrating data from legacy systems, doing some usability changes, and complying with other enterprise-wide requirements will be consuming most of my time.

For the time, I’m “absorbing” the OpenERP’s framework. But even in a few days I have been able to setup my working environment with zc.buildout (using a mixture of mr.developer and anybox.recipe.openerp).

I integrate my Emacs Python environment to work with buildout setups. Probably I will dedicate a post about this. The result is that I can use flycheck with epylint, and jedi to work “inside” my buildout projects without much hassle. I still have to learn how to use GUD with PDB, but that’s another issue.

Progressive Enhancement, a matter of context

A few days ago a friend sent me a link, so that I gave him my opinion about it. He was worried about the deemed agreement between developers of JS libraries/frameworks for Rich JS Apps, about progressive enhancement being dead:

It’s no longer good enough to build web apps around full page loads and then “progressively enhance” them to behave more dynamically. Building apps which are fast, responsive and modern require you to completely rethink your approach.

and later on in the post:

Agreement: Progressive enhancement isn’t for building real apps. All the technologies follow from the view that serious JavaScript applications require proper data models and ability to do client-side rendering, not just server rendering plus some Ajax and jQuery code.

Quote from Jeremy Ashkenas, the Backbone creator: “At this point, saying ‘single-page application’ is like saying ‘horseless carriage’” (i.e., it’s not even a novelty any more).

He is interested in my point of view, mainly cause a few months ago we were working in a project, and I used the Progressive Enhancement pattern (PE) to comply with several requirements. But if PE is dead, then I should evaluate if our decision back then was right or misinformed.

Fortunately, this post is quite old and other voices before me have been raised to “resurrect” the good-old progressive enhancement pattern.

Nevertheless I need to answer my friend so these are my notes about whats good about Progressive Enhancement:

  • Progressive Enhancement is a pattern and, like every pattern, there is not a unique way to implement it, and it depends ultimately from context.

    This is the main point againts the “dead PE” motto. The post by That Emil seems to point in the same direction but he rather counterattacks several implementation-level commonly used for doing PE.

    It seems that the whole of this argument is based of the idea that is a unique, quite difficult to achieve, way to implement this pattern. And they even say that the JS is only for enhancement, to this I’d say:

  • What is an enhancement? Is it only to take what’s already there somehow and improve it, or it can also be to provide more that was not already loaded?

    If I can’t do the second, then that category of enhancement (which does not allow to enhance your whole application) is rather poor and I won’t defend it. What I stand for is that PE is about enhancement in any direction you’d like to give a plus for your users; so, limiting from start to just enhance what is already there is keeping yourself from many good possibilities.

    The point here is that we should not address this issue fanatically: Maybe that kind of PE is dead, because it was just too restricted; but saying that all PE is dead is not the same that saying that this form of doing PE is too hard it won’t make my client profit and not doing won’t make client lose either. If we can demonstrate that, then go ahead and don’t do PE.

  • But if your client poses you challenges like:

    I need this site to be very modern; you’d need to use the latest of technology out there, cause this site is targeting developers and employers. We’d like this people to connect through this site to find jobs, to learn, etc… Ah! But also, since we are actually targeting Cuban people, we need the site to work well over very slow connections [averaging 56kbps].

    Uh! Well, actually that’s the kind of requirements I was charged to cope with, and I should deliver a working prototype in 10 days.

    To accomplish these requirements I created a four-tiers system for loading the site; tier 0 being the basic app with no JS requirements and quite a few assumptions about browser modernness; tier 1 needed JS: it included the load of modernizr.js and an association list of capabilites to features.

    These two tiers together formed, IMO, a Progressive Enhancement implementation for my site, and my approach was:

    • In tier 0 there is the information we need to convey to our users and that my client is not willing to left-out.

      Also this tier include a boot-like tiny script that loads tier 1 and records the time it takes to load, this time is used to estimate band-width. Notice that if our user does not support JS then, nothing more is loaded thus reducing drastically the total size.

    • Tier 1 tested for some CSS features like @fontface, or transitions, etc… Depeding on the recolected data we load severals parts of the tier 2 that are bundled together in the server. For instance if we decide to include some features like the full menu, that would lead to load a single JS with jQuery + Bootstrap + our-own-js, plus the needed CSS. Most of the time we just issue 2 requests: one for the bundled JS and on for the bundle CSS.
    • Tier 2 and Tier 3 contains code and resources which are too big to load from advance in poor connections.Tier 2 is mostly for libraries like jQuery, Bootstrap and few non-application-level custom JS.

      Tier 3 is for application-level stuff: that cool photo-gallery that slides automatically, etc…

    Using iprelay and tc we were able to test many connections conditions and to asses the impact of our decisions.

    Is this not a PE implementation?

Conclusions? I still believe PE is a good option, and that sometime (like that project of mine) you can’t even ignore it.