It’s important to know how to run basic benchmarks on your code so that you know if something is *actually* fast. The code I’m benchmarking in this post is an algorithm that extracts the digits from a number.

Let `f(n)`

be the function that takes a number as input and returns a list of the numbers digits as output.

For example, given the digit: `123456`

, `f(123456)`

will return `[1,2,3,4,5,6]`

.

Order doesn’t matter, so `[6,5,4,3,2,1]`

is equally valid.

## Approaches

I’ll first go over 3 possible implementations of `f(n)`

and then I’ll share the benchmark results.

**Approach #1:** Convert to sequence and iterate

```
def digits1(num):
return [int(x) for x in str(num)]
```

Since strings are sequences and sequences are interable, we can convert our number into a string and iterate through each character in the string and cast it into an integer. This is a pretty concise readable implementation and likely one that I’ll reach for first if asked to do this task under time constraint (such as an interview).

**Approach #2:** Use `divmod`

builtin function

```
def digits2(num):
res = []
num, d = divmod(num, 10)
res.append(d)
while num > 0:
num, d = divmod(num, 10)
res.append(d)
return res
```

`divmod`

returns a tuple of the result of `%`

(modulo operator) which gives you the value of the right most digit and the result of `//`

(integer division) which gives you the value of the number without the right most digit. Repeating this operation basically lets you slice off the digits in a number from right to left.

**Approach #3:** Use `%`

and `//`

```
def digits3(num):
res = []
res.append(num % 10)
num = num // 10
while num > 0:
res.append(num % 10)
num = num // 10
return res
```

The logic in this implementation is equivalent to our previous `divmod`

implementation. The main difference is that instead of calling the `divmod`

built-in, I’m performing the two operations (mod and integer division) directly in the function.

Before reading on, which one of these do you think will run the fastest?

## Benchmarking Results

I ran five sets of tests using the builtin timeit module^{1}.

Function | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|

digit1 (sequence) | 16.08269596 | 17.53508377 | 12.51374602 | 12.92738509 | 12.57732296 |

digit2 (divmod) | 8.13628602 | 8.403110981 | 8.041028023 | 8.013195992 | 8.013605118 |

digit3 (% and //) | 8.071671009 | 9.609830856 | 7.962191105 | 7.73694396 | 8.207738876 |

During each run, each function was executed on a sample input a million times. The numbers you see are the number of seconds it took to execute that function a million times.

Here’s the benchmarking source code:

```
import timeit
for _ in range(5):
print(timeit.timeit("digits1(num)", setup="from __main__ import digits1;num=1234567324235435346536245", number=1000000))
print(timeit.timeit("digits2(num)", setup="from __main__ import digits2;num=1234567324235435346536245", number=1000000))
print(timeit.timeit("digits2(num)", setup="from __main__ import digits2;num=1234567324235435346536245", number=1000000))
```

The string iteration implementation (digit1) is almost twice as slow as the direct arithmetic approaches. I knew it was probably going to be slower, but frankly this was a surprise.

So, why is the sequence iteration approach so much slower? Is it the initial conversion into a string? The actual iteration process? The conversion of each character back into an integer? All of the above?!

One approach to understanding the *why* of language performance is to look at a lower level of abstraction. For Python, one such level is the bytecode representation of source programs.

## Disassembly

Python comes with a module `dis`

^{2} that outputs the sequence of bytecodes used by the CPython interpreter. I wanted to see if there was anything obvious in the bytecode that would point to this large execution time difference.

This is the bytecode for digit1 (the slow, sequence iteration code):

```
0 BUILD_LIST 0
3 LOAD_GLOBAL 0 (str)
6 LOAD_FAST 0 (num)
9 CALL_FUNCTION 1
12 GET_ITER
13 FOR_ITER 18 (to 34)
16 STORE_FAST 1 (x)
19 LOAD_GLOBAL 1 (int)
22 LOAD_FAST 1 (x)
25 CALL_FUNCTION 1
28 LIST_APPEND 2
31 JUMP_ABSOLUTE 13
34 RETURN_VALUE
```

This is the bytecode for digit2 (the faster arithmetic code):

```
7 0 BUILD_LIST 0
3 STORE_FAST 1 (res)
8 6 LOAD_GLOBAL 0 (divmod)
9 LOAD_FAST 0 (num)
12 LOAD_CONST 1 (10)
15 CALL_FUNCTION 2
18 UNPACK_SEQUENCE 2
21 STORE_FAST 0 (num)
24 STORE_FAST 2 (d)
9 27 LOAD_FAST 1 (res)
30 LOAD_ATTR 1 (append)
33 LOAD_FAST 2 (d)
36 CALL_FUNCTION 1
39 POP_TOP
10 40 SETUP_LOOP 50 (to 93)
>> 43 LOAD_FAST 0 (num)
46 LOAD_CONST 2 (0)
49 COMPARE_OP 4 (>)
52 POP_JUMP_IF_FALSE 92
11 55 LOAD_GLOBAL 0 (divmod)
58 LOAD_FAST 0 (num)
61 LOAD_CONST 1 (10)
64 CALL_FUNCTION 2
67 UNPACK_SEQUENCE 2
70 STORE_FAST 0 (num)
73 STORE_FAST 2 (d)
12 76 LOAD_FAST 1 (res)
79 LOAD_ATTR 1 (append)
82 LOAD_FAST 2 (d)
85 CALL_FUNCTION 1
88 POP_TOP
89 JUMP_ABSOLUTE 43
>> 92 POP_BLOCK
13 >> 93 LOAD_FAST 1 (res)
96 RETURN_VALUE
```

I’m not familiar with all of the CPython opcodes so I couldn’t really tell at a glance if there was a specific opcode that was the culprit for digit1’s slow performance in comparison.

If we isolate the parts of the bytecode that are being executed in loops for both algorithms, they both have a series of `LOAD`

calls followed by a `CALL_FUNCTION`

.

This is interesting for a few reasons:

- Function calls are expensive!
- Both implementations are using function calls to global built-in functions (via
`LOAD_GLOBAL`

) - Both implementations (
`digit1`

and`digit2`

) perform roughly K function calls such that K is the number of digits in a number

In computer science terms, these two functions have pretty similar Big O. The answer probably lies in the details of the actual C implementation of the functions.

To test this, I removed the integer conversion call to the `int`

builtin for the string sequence (digit1) implementation and re-ran our tests.

Here are the new results from a single new run:

function | run |
---|---|

digit1 (without conversion) | 1.820481062 |

digit2 | 8.126174927 |

digit3 | 8.007368088 |

Woah, our slowest algorithm sped up by nearly 90%! In other words, if we don’t bother re-converting our characters back into integers, our first approach is *way* faster.

## C Implementation Details

I dug around the C implementations (the functions that the CPython interpreter actually executes when interpreting the `CALL_FUNCTION`

opcodes) and found the most recent source code for both `divmod`

^{3} and `int`

^{4} python builtin functions.

I didn’t get a chance to look too deeply, but one thing that sticks out for me is the sheer number of branching statements in `PyLong_FromString`

(the C implementation of `int`

). This isn’t too surprising, given how much work it takes to convert a single character not only to its numeric representation, but the numeric representation of the character digits *value* (which I think will make for a pretty interesting post on its own!).

Anyway, I hope that was informative.

PS: I didn’t speak too much on `digit3`

in this post. I wonder … Is divmod faster than using operators?