It’s important to know how to run basic benchmarks on your code so that you know if something is actually fast. The code I’m benchmarking in this post is an algorithm that extracts the digits from a number.
f(n) be the function that takes a number as input and returns a list of the numbers digits as output.
For example, given the digit:
f(123456) will return
Order doesn’t matter, so
[6,5,4,3,2,1] is equally valid.
I’ll first go over 3 possible implementations of
f(n) and then I’ll share the benchmark results.
Approach #1: Convert to sequence and iterate
def digits1(num): return [int(x) for x in str(num)]
Since strings are sequences and sequences are interable, we can convert our number into a string and iterate through each character in the string and cast it into an integer. This is a pretty concise readable implementation and likely one that I’ll reach for first if asked to do this task under time constraint (such as an interview).
Approach #2: Use
divmod builtin function
def digits2(num): res =  num, d = divmod(num, 10) res.append(d) while num > 0: num, d = divmod(num, 10) res.append(d) return res
divmod returns a tuple of the result of
% (modulo operator) which gives you the value of the right most digit and the result of
// (integer division) which gives you the value of the number without the right most digit. Repeating this operation basically lets you slice off the digits in a number from right to left.
Approach #3: Use
def digits3(num): res =  res.append(num % 10) num = num // 10 while num > 0: res.append(num % 10) num = num // 10 return res
The logic in this implementation is equivalent to our previous
divmod implementation. The main difference is that instead of calling the
divmod built-in, I’m performing the two operations (mod and integer division) directly in the function.
Before reading on, which one of these do you think will run the fastest?
I ran five sets of tests using the builtin timeit module1.
|Function||Run 1||Run 2||Run 3||Run 4||Run 5|
|digit3 (% and //)||8.071671009||9.609830856||7.962191105||7.73694396||8.207738876|
During each run, each function was executed on a sample input a million times. The numbers you see are the number of seconds it took to execute that function a million times.
Here’s the benchmarking source code:
import timeit for _ in range(5): print(timeit.timeit("digits1(num)", setup="from __main__ import digits1;num=1234567324235435346536245", number=1000000)) print(timeit.timeit("digits2(num)", setup="from __main__ import digits2;num=1234567324235435346536245", number=1000000)) print(timeit.timeit("digits2(num)", setup="from __main__ import digits2;num=1234567324235435346536245", number=1000000))
The string iteration implementation (digit1) is almost twice as slow as the direct arithmetic approaches. I knew it was probably going to be slower, but frankly this was a surprise.
So, why is the sequence iteration approach so much slower? Is it the initial conversion into a string? The actual iteration process? The conversion of each character back into an integer? All of the above?!
One approach to understanding the why of language performance is to look at a lower level of abstraction. For Python, one such level is the bytecode representation of source programs.
Python comes with a module
dis2 that outputs the sequence of bytecodes used by the CPython interpreter. I wanted to see if there was anything obvious in the bytecode that would point to this large execution time difference.
This is the bytecode for digit1 (the slow, sequence iteration code):
0 BUILD_LIST 0 3 LOAD_GLOBAL 0 (str) 6 LOAD_FAST 0 (num) 9 CALL_FUNCTION 1 12 GET_ITER 13 FOR_ITER 18 (to 34) 16 STORE_FAST 1 (x) 19 LOAD_GLOBAL 1 (int) 22 LOAD_FAST 1 (x) 25 CALL_FUNCTION 1 28 LIST_APPEND 2 31 JUMP_ABSOLUTE 13 34 RETURN_VALUE
This is the bytecode for digit2 (the faster arithmetic code):
7 0 BUILD_LIST 0 3 STORE_FAST 1 (res) 8 6 LOAD_GLOBAL 0 (divmod) 9 LOAD_FAST 0 (num) 12 LOAD_CONST 1 (10) 15 CALL_FUNCTION 2 18 UNPACK_SEQUENCE 2 21 STORE_FAST 0 (num) 24 STORE_FAST 2 (d) 9 27 LOAD_FAST 1 (res) 30 LOAD_ATTR 1 (append) 33 LOAD_FAST 2 (d) 36 CALL_FUNCTION 1 39 POP_TOP 10 40 SETUP_LOOP 50 (to 93) >> 43 LOAD_FAST 0 (num) 46 LOAD_CONST 2 (0) 49 COMPARE_OP 4 (>) 52 POP_JUMP_IF_FALSE 92 11 55 LOAD_GLOBAL 0 (divmod) 58 LOAD_FAST 0 (num) 61 LOAD_CONST 1 (10) 64 CALL_FUNCTION 2 67 UNPACK_SEQUENCE 2 70 STORE_FAST 0 (num) 73 STORE_FAST 2 (d) 12 76 LOAD_FAST 1 (res) 79 LOAD_ATTR 1 (append) 82 LOAD_FAST 2 (d) 85 CALL_FUNCTION 1 88 POP_TOP 89 JUMP_ABSOLUTE 43 >> 92 POP_BLOCK 13 >> 93 LOAD_FAST 1 (res) 96 RETURN_VALUE
I’m not familiar with all of the CPython opcodes so I couldn’t really tell at a glance if there was a specific opcode that was the culprit for digit1’s slow performance in comparison.
If we isolate the parts of the bytecode that are being executed in loops for both algorithms, they both have a series of
LOAD calls followed by a
This is interesting for a few reasons:
- Function calls are expensive!
- Both implementations are using function calls to global built-in functions (via
- Both implementations (
digit2) perform roughly K function calls such that K is the number of digits in a number
In computer science terms, these two functions have pretty similar Big O. The answer probably lies in the details of the actual C implementation of the functions.
To test this, I removed the integer conversion call to the
int builtin for the string sequence (digit1) implementation and re-ran our tests.
Here are the new results from a single new run:
|digit1 (without conversion)||1.820481062|
Woah, our slowest algorithm sped up by nearly 90%! In other words, if we don’t bother re-converting our characters back into integers, our first approach is way faster.
C Implementation Details
I dug around the C implementations (the functions that the CPython interpreter actually executes when interpreting the
CALL_FUNCTION opcodes) and found the most recent source code for both
int4 python builtin functions.
I didn’t get a chance to look too deeply, but one thing that sticks out for me is the sheer number of branching statements in
PyLong_FromString (the C implementation of
int). This isn’t too surprising, given how much work it takes to convert a single character not only to its numeric representation, but the numeric representation of the character digits value (which I think will make for a pretty interesting post on its own!).
Anyway, I hope that was informative.
PS: I didn’t speak too much on
digit3 in this post. I wonder … Is divmod faster than using operators?