This is a quick post about optimizing algorithms written in Python with NumPy. We’re going to take a look at one of this year’s Advent of Code challenges, solve it and then optimize it with NumPy.

If you’re unfamiliar with Advent of Code, it’s an advent calendar of programming related puzzles which has run each December since 2015. Every day from December 1st until the 25th, a new challenge is posted and you’re encouraged to submit a solution.

Analyzing the Challenge

Let’s take a look at AoC’s Day 3 challenge.

This problem requires us to count the number of overlapping claims in a given area. That area, which is the fabric in the challenge, can be represented by a 2-dimensional array of zeros.

Every item in the 2-dimensional array will keep a count of how many claims take ownership of that item.

Each claim can be thought of as another smaller 2-dimensional array of ones that we add to the larger 2-dimensional array. When we are finished, we can count how many items in the larger 2-dimensional array are greater than 1. Those items will represent the square inches of fabric that belong to more than one claim.

Pure Python Solution

Here’s a solution written in Python. While in the above analysis I said that we can think of each claim as being a 2-dimensional array of ones that we’ll add to the greater 2-dimensional array, our implementation will differ.

We’ll iterate over the cartesian coordinates of each claim and add 1 to the corresponding item in the greater 2-dimensional array.

from typing import NamedTuple, List

SIDE_LENGTH = 1050

class Claim(NamedTuple):
    id: int
    x: int
    y: int
    width: int
    height: int


def get_fabric(claims: List[Claim], length: int = SIDE_LENGTH) -> List[List[int]]:
    fabric = [[0] * length] * length
    increment = 1
    
    for claim in claims:
        for width in range(claim.width):
    	    for height in range(claim.height):
                fabric[claim.x + width][claim.y + height] += increment
    
    return fabric

def get_overlapping(fabric: List[List[int]]) -> int:
    return sum(1 for row in fabric for item in row if item > 1)

# generate list of claims from challenge input file
claims: List[Claim] = list(map(get_claim, get_input(3).split('\n')))

# let's profile our solution
%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit get_overlapping(fabric)

Here’s our output on an Intel processor:

287 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It took nearly half of a second to process roughly one thousand claims. We can do better than that.

Optimization with NumPy

Vecorization

NumPy has a concept called Universal functions, or ufuncs, which are vectorized functions over ndarrays. Vectorized functions work over entire arrays and, if the underlying hardware supports it, can take advantage of SIMD instructions.

SIMD stands for single instruction, mutiple data. SIMD instructions allow us to apply an instruction over multiple points of data at the same time. They implement parallelism at the data-level.

Examples of SIMD instructions are MMX, SSE/SSE2/SSE3/SSE4 and AVX instruction sets on x86 processors and NEON instruction sets on ARM processors.

When all that’s said and done, there’s no guarantee that NumPy will use SIMD instructions in our program. However, we’ll still benefit from a performance boost, because NumPy is built upon natively compiled and architectually efficient data structures and operations.

I do some of my development over SSH to an ARMv7 machine. NumPy isn’t optimized for ARM’s NEON instruction set in the same way that it’s optimized for x86 processors with SSE and AVX instructions. GCC’s ability to optimize NumPy at compile time will take advantage of NEON, although in a more generalized fashion.

Despite the fact that NumPy isn’t optimized for NEON, I still saw a performance increase when using NumPy instead of pure Python.

Optimizing Our Solution

In our previous solution, get_fabric() and get_overlapping() are sequential functions. That is to say that we apply an operation to each item in the array sequentially: we increment the first element in the array, then the next element and so on.

To optimize our solution, we’ll need to use something beyond just pure Python.

NumPy is a Python wrapper over lower-level C and Fortran code that implements optimized and natively compiled routines, so we’ll use that.

Here, we’ll use NumPy’s add() vectorized ufunc instead of sequentially adding one to each item in the fabric array.

import numpy as np

def get_fabric(claims: List[Claim], length: int = SIDE_LENGTH) -> np.ndarray:
    fabric = np.zeros((length, length), dtype=np.int8)
    increment = 1
    
    for claim in claims:
        x_indices = slice(claim.x, claim.x + claim.width)
        y_indices = slice(claim.y, claim.y + claim.height)
        indices = x_indices, y_indices
        
        fabric[indices] += increment
    
    return fabric

Our above solution will benefit from NumPy’s memory efficient arrays, compared to our previous solution using Python’s list type.

Next, we’ll use NumPy’s sum() routine to locate coordinates that specify values that are greater than 1.

def get_overlapping(fabric: np.ndarray) -> int:
    return np.sum(fabric > 1)

Let’s see what kind of performance boost we’ll get using our new solution.

%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit get_overlapping(fabric)

Here’s our output on a relatively recent Intel processor:

12.8 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.76 ms ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compared to our previous solution, our new get_fabric() implementation is 22 times faster and get_overlapping() is 63 times faster.

Not bad.