By disassembling your code you get an understanding of how Python treats your Python code and how the resulting bytecode looks like. While writing code together with other people, reading or watching tutorials you may come across phrases like "writing it this way is the same as writing it this way" or "this piece of code is equivalent to this one". By disassembling certain code pieces and inspecting the generated bytecode, you can see whether Python generates the same bytecode for certain code pieces or different bytecode. You may even be able to conclude some (mis-)behaviors.
Furthermore, disassembling Python code allows you to understand Python internals even better. So without further introduction let's jump into it!
The dis module supports the analysis of CPython bytecode by disassembling it. The CPython bytecode which this module takes as an input is defined in the file Include/opcode.h and used by the compiler and the interpreter.
Note: Bytecode is an implementation detail of the the CPython interpreter. According to the docs, there is no guarantee that no bytecode will be added, removed or changed between Python version. Furthermore, the use of the dis module should not be considered to work across Python releases or Python VMs. This article was written based on CPython 3.8.2.
We can separate the dis
-module roughly in four parts.
First, there is the Instruction
object representing a single bytecode instruction.
Second, the Bytecode
object, which is a wrapper used to access details of the compiled code.
Additionally, a set of analysis functions and a set of opcode collections are available.
We will have a look at all four of them.
The Python code you write is translated into a sequence of bytecode instructions during execution. A single instruction consists of:
dis.opmap
.None
None
True
if other code jumps to here, otherwise False
Let's take the RETURN_VALUE
instruction as an example.
Some of the values depend on the actual code, such as offset
and starts_line
.
The values below are from an example, we will have a look at later.
As the name suggests, it's the instruction used to represent a return
-statement.
The representation for the instruction may look like this:
Instruction(opname='RETURN_VALUE',
opcode=83,
arg=None,
argval=None,
argrepr='',
offset=16,
starts_line=None,
is_jump_target=False)
The human readable name for the instruction is RETURN_VALUE
and the corresponding opcode is 83
.
It has no numeric argument (arg
) to the operation hence no argval
and argrepr
.
The offset
, which is the start index of the operation within the bytecode sequence, is 16
.
The return
-statement in this example is not the first instruction in the line hence starts_line
is None
.
No code is jumping to this instruction resulting in is_jump_target=False
.
The RETURN_VALUE
instruction is just one of a bunch.
The available instructions can be grouped in six categories:
Note:
TOS
stands for Top of Stack and refers to the the element at the top of the stack.TOS1
refers to the second top-most element of the stack respectively.
ROT_TWO
(swaps the two top-most stack items).UNARY_POSITIVE
, which is implemented as TOS = +TOS
BINARY_ADD
operation is implemented as TOS = TOS1 + TOS
and is nothing else but a simple addition.TOS1
supports it.
Both TOS
and TOS1
are removed from the stack, too.
The resulting TOS
may be the original TOS1
, but doesn't have to.
For example the INPLACE_ADD
is implemented as TOS = TOS1 + TOS
(but as an in-place operation).async
and await
keywords.
This category consists of all coroutine-related instructions.RETURN_VALUE
instruction returning the TOS
to the caller of the function.The Bytecode
object is a wrapper for Python code and provides easy access to details of the compiled code.
It's a convenience wrapper for the bytecode analysis functions, e.g. get_instructions()
.
Iterating over a Bytecode
instance yields the bytecode operations as Instruction
instances.
Let's take a simple example:
def func():
return 5
The function at hand does nothing else but return the number 5
.
Let's extend the code snippet to create a Bytecode
based on the function.
# bytecode_intro.py
import dis
def func():
return 5
bytecode = dis.Bytecode(func)
You can access the compiled code object via bytecode.codeobj
.
The compiled code object can for instance be passed to the built-in eval()
function, to evaluate it.
Note: You shouldn't do that in productive code!
# code of bytecode_intro.py
result = eval(bytecode.codeobj)
print(result)
This will print the number 5
if executed.
To find out at which line in the file our code object starts, we can access the first_line
attribute (this might be None
in some cases).
# code of bytecode_intro.py
print(bytecode.first_line)
Executing the file will print 6
as the first line is a comment followed by a blank line, the third line is an import
-statement followed by two blank lines.
The two remaining instance methods are dis()
and info()
.
In essence, both are working like the corresponding dis.dis()
and dis.code_info()
functions, so we will have a look at them in the Bytecode Analysis section.
Additionally, the Bytecode
class provides a class method called from_traceback()
.
This method constructs a Bytecode
instance from a given traceback and sets the current_offset to the instruction responsible for the exception.
The class method can be used to have a closer look at breaking code.
The analysis functions of the dis
-module convert the input directly to the desired output.
This is useful if you only want to perform a single operation and don't need the intermediate analysis object.
The dis
-module provides nine analysis functions:
code_info()
show_code()
dis()
disb()
disassemble()
/disco()
get_instructions()
findlinestarts()
findlabels()
stack_effect()
We will have a look at each of them using the following code example:
# analysis_functions.py
import dis
def func():
x = []
x.append(5)
return x
The func()
function does nothing else but create an empty list, add the number 5
to the end of the list and return the list afterwards.
The code_info()
function returns a formatted multi-line string with detailed information about the code object for the supplied input.
Adding the line dis.code_info(func)
at the end of the example file and executing the script afterwards may result in something like this:
$ python analysis_functions.py
Name: func
Filename: analysis_functions.py
Argument count: 0
Positional-only arguments: 0
Kw-only arguments: 0
Number of locals: 1
Stack size: 3
Flags: OPTIMIZED, NEWLOCALS, NOFREE
Constants:
0: None
1: 5
Names:
0: append
Variable names:
0: x
Note: The contents of
code_info()
are highly dependent on the Python VM and Python version, so don't expect to always see the same information.
As we can see code_info()
shows us the name of the function as well as the filename (in the REPL the filename would be something like <stdin>
).
Furthermore, the total number of arguments is displayed as well as the number of positional-only and keyword-only arguments.
code_info()
reveals, that only one local variable is defined and that this must be the variable x
.
Two more constants are used (None
and 5
) as well as one method/function (append()
).
The next function is show_code()
.
In fact, this is only a convenient shorthand for print(code_info(x))
.
You can specify an optional file
if you don't want it to be printed to sys.stdout
.
A more often used function to gain insights into your code is dis()
.
As the name indicates it disassembles your code.
If you provide a class, it will disassemble all methods including class and static functions.
Furthermore, nested functions and code objects (comprehensions, generator expressions) are disassembled recursively.
Let's add dis.dis(func)
to our analysis_functions.py
script and execute it.
$ python analysis_functions.py
7 0 BUILD_LIST 0
2 STORE_FAST 0 (x)
8 4 LOAD_FAST 0 (x)
6 LOAD_METHOD 0 (append)
8 LOAD_CONST 1 (5)
10 CALL_METHOD 1
12 POP_TOP
9 14 LOAD_FAST 0 (x)
16 RETURN_VALUE
The numbers 7, 8, and 9 in the first column are the line numbers in the file. The second column (0, 2, 4, 6, and so on) are the instruction offsets. The third column shows the opnames and the fourth column the size of the stack. The last column shows the name of the method, variable or the constant that's being loaded.
In the example at hand, the function first builds a list and stores it in the variable x
.
In line 8 the variable x
is loaded as well as the method append()
.
After loading the constant 5
, which is also pushed onto the stack, the method is called (CALL_METHOD
).
Subsequently, the top of the stack (constant 5
) is popped.
We are now in line 9 of the script, load the variable x
and return it to the caller function.
As you can see, the dis()
function allows us to see how Python converts our code to bytecode instructions.
If you want to have a first example use-case, try to find out if and how a list comprehension and an equivalent for-loop differ in their bytecode instructions.
Now let's have a look at the distb()
function.
This function disassembles the top-of-stack function of a traceback.
If none was passed, the last traceback is used.
This way the instruction causing the exception is indicated.
To see an easy example, open the Python REPL by typing python
(or python3
on your machine) and try to execute dis.disb()
.
This results in an AttributeError
as the function disb()
does not exist.
>>> import dis
>>> dis.disb()
Traceback (most recent call last):
File "<input>", line 1, in <module>
dis.disb()
AttributeError: module 'dis' has no attribute 'disb'
Now we can call the distb()
function without a traceback, so it's using the last one, which occurred.
>>> dis.distb()
1 0 LOAD_NAME 0 (dis)
--> 2 LOAD_METHOD 1 (disb)
4 CALL_METHOD 0
6 PRINT_EXPR
8 LOAD_CONST 0 (None)
10 RETURN_VALUE
As we can see, Python failed to load the method disb()
as it simply did not exist.
The next function we will look at is disassemble()
(or disco()
if you prefer - it's an alias and points to disassemble()
).
To me it seems like to be a function like dis()
, where you need to supply an actual compiled code object instead of a reference to a function, module or something similar.
I extended the analysis_functions.py
script and came up with something like this:
# analysis_functions.py
import dis
def func():
x = []
x.append(5)
return x
bytecode = dis.Bytecode(func)
dis.disassemble(bytecode.codeobj)
This results in the same output as with the dis()
function example:
$ python analysis_functions.py
7 0 BUILD_LIST 0
2 STORE_FAST 0 (x)
8 4 LOAD_FAST 0 (x)
6 LOAD_METHOD 0 (append)
8 LOAD_CONST 1 (5)
10 CALL_METHOD 1
12 POP_TOP
9 14 LOAD_FAST 0 (x)
16 RETURN_VALUE
However, supplying a value to the lasti
argument adds a pointer to the specified offset.
So calling disassemble()
with last=4
points to the LOAD_FAST
instruction with the offset 4:
$ python analysis_functions.py
7 0 BUILD_LIST 0
2 STORE_FAST 0 (x)
8 --> 4 LOAD_FAST 0 (x)
6 LOAD_METHOD 0 (append)
8 LOAD_CONST 1 (5)
10 CALL_METHOD 1
12 POP_TOP
9 14 LOAD_FAST 0 (x)
16 RETURN_VALUE
If you have a look at the implementation of the functions [1], you can see that at some point dis()
is calling disassemble()
and both are calling the same private methods.
The get_instructions()
function returns a generator yielding a series of Instruction named tuples.
In fact it's the same as iterating over the earlier discussed Bytecode
instance.
We can add the following for
-loop to our analysis_functions.py
script to produce the hereafter result:
# analysis_functions.py code
for instruction in dis.get_instructions(func):
print(instruction)
$ python analysis_functions.py
Instruction(opname='BUILD_LIST', opcode=103, arg=0, argval=0, argrepr='', offset=0, starts_line=7, is_jump_target=False)
Instruction(opname='STORE_FAST', opcode=125, arg=0, argval='x', argrepr='x', offset=2, starts_line=None, is_jump_target=False)
Instruction(opname='LOAD_FAST', opcode=124, arg=0, argval='x', argrepr='x', offset=4, starts_line=8, is_jump_target=False)
Instruction(opname='LOAD_METHOD', opcode=160, arg=0, argval='append', argrepr='append', offset=6, starts_line=None, is_jump_target=False)
Instruction(opname='LOAD_CONST', opcode=100, arg=1, argval=5, argrepr='5', offset=8, starts_line=None, is_jump_target=False)
Instruction(opname='CALL_METHOD', opcode=161, arg=1, argval=1, argrepr='', offset=10, starts_line=None, is_jump_target=False)
Instruction(opname='POP_TOP', opcode=1, arg=None, argval=None, argrepr='', offset=12, starts_line=None, is_jump_target=False)
Instruction(opname='LOAD_FAST', opcode=124, arg=0, argval='x', argrepr='x', offset=14, starts_line=9, is_jump_target=False)
Instruction(opname='RETURN_VALUE', opcode=83, arg=None, argval=None, argrepr='', offset=16, starts_line=None, is_jump_target=False)
findlinestarts()
is used to find the offsets which are starts of lines in the source code.
Therefore, the generator function uses the co_firstlineno
and co_lnotab
attributes of the supplied code object.
The line starts are then generated as (offset, lineno)
pairs.
Let's create a Bytecode
instance for our func()
function and supply the code object to the findlinestarts()
function.
Subsequently, we iterate over it and print the generated elements.
# analysis_functions.py code
bytecode = dis.Bytecode(func)
for pair in dis.findlinestarts(bytecode.codeobj):
print(pair)
For the example at hand this results in the following printed pairs:
$ python analysis_functions.py
(0, 7)
(4, 8)
(14, 9)
The findlabels()
functions return a list of all offsets, which are jump targets in the raw compiled bytecode string.
It's used to set the is_jump_target
attribute of an Instruction
instance.
The last analysis function in the dis
-module is the stack_effect()
function.
It's not directly implemented in the dis
-module itself but in Modules/_opcode.c
[2].
According to the documentation it "[c]ompute[s] the stack effect of opcode with argument oparg." [3]
As I couldn't find a possible use-case in the day-to-day Python programming, I leave it at this point.
The dis
-module provides a set of opcode collections, which can be used for automatic introspection of bytecode instructions.
Currently, ten of them exist.
I don't want to look at all of them, but let's have a look at two of them so you get a feeling of what's waiting for you.
The opmap
collection is a dictionary mapping operation names to bytecodes.
If you remember the Instruction
object from the beginning, this mapping can be used to get the opname
for an opcode
.
As it's a quite large dictionary, I don't want to post it here.
If you want to see it, open a Python REPL session, import the dis
-module by typing import dis
and print the opmap
collection by typing dis.opmap
.
The second collection I want to introduce is the haslocal
collection.
It provides a sequence of bytecodes that access a local variable.
If you have your REPL still open and type dis.haslocal
, you get the sequence [124, 125, 126]
.
If you now combine it with opmap
, you know that only the instructions LOAD_FAST
, STORE_FAST
, and DELETE_FAST
are accessing a local variable.
Congratulations, you made it through the article and learned a lot about Python's dis
-module!
Now you know what an instruction is and what it consists of.
You noticed that the Bytecode
object is used to provide easy access to the details of the compiled code and had a look at all the bytecode analysis functions available in the module.
You want to explore your Python code, but don't know where to start? Take a small function you wrote, supply it to some of the analysis functions and try to understand what's happening. I hope you enjoyed reading the article and would be happy to receive feedback (contact information). Feel free to share the articles with your friends and colleagues. Stay curious and keep coding!