Advent of Code '21 in OCaml, Day 2

Posted: 2022-09-18

I recently started at Jane Street, a technology-centric trading firm which uses OCaml. I’ve spent my whole career writing imperative languages, so I am exploring OCaml development using the Advent of Code, a puzzle game which I’ve played in Go multiple times.

See the day one post for background!

Day 2

Part A of the second day’s puzzle, like the first, involves serially iterating over a list. Each element of the list is a command to move our submarine, and the puzzle was to calculate the submarine’s final position.

My first thought was to map the lines of the file into a list of variant types, and then fold over that list to get the result:

 1type action = Forward of int | Down of int | Up of int
 2
 3let line_to_action l =
 4  let stringpair_to_action action scalestr =
 5    let scale = Int.of_string scalestr in
 6    match action with
 7    | "forward" -> Forward scale
 8    | "down" -> Down scale
 9    | "up" -> Up scale
10    | _ -> failwithf "unexpected action %s" action ()
11  in
12  match String.split ~on:' ' l with
13  | [ a; b ] -> stringpair_to_action a b
14  | _ -> failwithf "unexpected line %s" l ()
15
16let step (horiz, depth) = function
17  | Forward f -> (horiz + f, depth)
18  | Up u -> (horiz, depth - u)
19  | Down d -> (horiz, depth + d)
20
21let day2a ls =
22  let horiz, depth =
23    List.map ls ~f:line_to_action |> List.fold ~init:(0, 0) ~f:step
24  in
25  horiz * depth

I felt icky, though. Core.List.t is a singly-linked functional data structure, which means that for large lists, Core.List.map has to iterate once over the list forwards building up a reversed representation, then iterate again over that to build the forwards-order representation. In other words, when I cleaned up my day 1 solution and removed an unnecessary List.rev call, I actually still had one hidden in List.map. I did a little digging into the imlementation of List.map to see what’s going on.

Aside - `List.map` and tail modulo cons

OCaml’s standard library List.map in version 4.14.0 is not tail recursive, and therefore will cause a stack overflows on long enough lists.

To avoid the stack overflow, Core.List.map¹ builds an intermediate, reversed² result to avoid stack overflow via the tail call optimization. When the result of a function is exactly the result of another function call (a “tail call”), the existing stack frame can be reused. This way, a recursive function call chain can use constant memory.

Effectively, the Core implementation passes an accumulator of an out-of-order map to the recursive function, then fixes it up (again using the tail call optimization) afterwards. If you’re like me, though, you did most of your list puzzles and coursework with loops. We know for sure that it’s possible to build a list forwards with one pass and constant space: you remember the head and tail nodes, mutate the tail on each iteration, and return the head when you’re done.

In OCaml trunk, but not in any current release, the standard library List.map has been changed to behave as we’d expect, using an optimization called “tail modulo cons”.

The tail modulo cons optimization essentially does the return object tracking and impure mutation³ that one would expect. This is done by the compiler, so the intermediate state and the impure mutations are invisible to the programmer, a good property for an optimization.

I removed the List.map call, and rewrote my implementation to operate on each line as a string.⁴

Performance comparisons

Part B extends part A to track an “aim” for the submarine in addition to its position. Copypasta essentially got us there:

1let step2b (horiz, depth, aim) l =
2  match line_to_action l with
3  | Forward f -> (horiz + f, depth + (f * aim), aim)
4  | Up u -> (horiz, depth, aim - u)
5  | Down d -> (horiz, depth, aim + d)
6
7let day2b ls =
8  let horiz, depth, _ = List.fold ls ~init:(0, 0, 0) ~f:step2b in
9  horiz * depth

My Go implementation used a straightforward for loop, with the three parts of the accumulator represented as three local integers.

 1func B(ls []string) int {
 2	h, v, a := 0, 0, 0
 3	for _, l := range ls {
 4		sp := strings.Split(l, " ")
 5		w, val := sp[0], lib.Atoi(sp[1])
 6		switch w {
 7		case "forward":
 8			h += val
 9			v += a * val
10		case "down":
11			a += val
12		case "up":
13			a -= val
14		default:
15			log.Fatal(l)
16		}
17	}
18
19	return h * v
20}

This makes it a little more clear which fields change in each branch, but as far as efficiency there’s no reason the OCaml compiler couldn’t optimize to the exact same behavior.⁵ Let’s compare the performance of the two implementations.

The first thing we see is that running an OCaml binary is ~10x slower than running a Go binary:

 1~/aoc21go/day2$ time ./day2
 21459206
 31320534480
 4
 5real    0m0.001s
 6user    0m0.002s
 7sys     0m0.000s
 8~/aoc21go/day2$ cd ~/aoc21ocaml
 9~/aoc21ocaml$ time _build/default/bin/main.exe aoc -day 2a -day 2b
101459206
111320534480
12
13real    0m0.013s
14user    0m0.007s
15sys     0m0.007s

My strong suspicion was that I was measuring the startup cost of any OCaml binary, at least in the default build configuration. I tend to write long-running servers in real life, so I chose to ignore startup costs and focus on steady-state. That meant writing my first benchmark!

Go tests have a built-in benchmarking system, whereas in OCaml I’ll use the core_bench library and run my binary in benchmarking mode. In both cases, I benchmarked using the real input and excluded file IO.⁶

The Go result was about 10% faster than the OCaml result:

1BenchmarkA-12    	   25321	     46575 ns/op	   32000 B/op	    1000 allocs/op
2BenchmarkB-12    	   27775	     43238 ns/op	   32000 B/op	    1000 allocs/op

vs.

1  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run   Percentage  
2 ------ ---------- --------- ---------- ---------- ------------ 
3  d2a     49.59us   25.00kw      0.35w      0.35w       98.38%  
4  d2b     50.40us   26.00kw      0.44w      0.44w      100.00%

Note that Go’s compiler does not support optimization options: every build is optimized and fully statically linked:

 1~/aoc21go/day2$ go build .
 2~/aoc21go/day2$ ls
 3a.txt day2 day2.go day2_test.go
 4~/aoc21go/day2$ ldd day2
 5        not a dynamic executable
 6~/aoc21go/day2$ cd ~/aoc21ocaml
 7~/aoc21ocaml$ dune build @@default @runtest
 8~/aoc21ocaml$ ls _build/default/bin
 9main.exe main.ml main.pp.ml main.pp.mli
10~/aoc21ocaml$ ldd _build/default/bin/main.exe 
11        linux-vdso.so.1 (0x00007ffc1f1cc000)
12        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe51e5ec000)
13        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe51e4a8000)
14        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe51e4a2000)
15        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe51e2dd000)
16        /lib64/ld-linux-x86-64.so.2 (0x00007fe520017000)

Ocaml, by contrast, has a separate “flambda” compiler to generate optimized builds.⁷ Time to introduce a new switch!⁸

Using flambda and configuring Dune to pass -O3 to ocamlopt, we get OCaml performance comparable to Go:

1  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run   Percentage  
2 ------ ---------- --------- ---------- ---------- ------------ 
3  d2a     43.94us   20.00kw                             98.75%  
4  d2b     44.50us   21.00kw      0.19w      0.19w      100.00%

My final commit of day 2 code is here.

Thoughts

I now have a working configuration to build functional code that performs comparably to an imperative implementation. I suspect I’ll still feel icky about list inefficiencies, but it’s nice to know that the compiler is mostly handling it for me and I ought to just focus on thinking functionally.

--Chris

Core is actually a separate library which builds on Base, and this implementation is from Base, but I find it easier to refer to everything as Core. ↩︎
It’s a little more complex than this: short lists are unrolled and longer lists are batch-allocated, but for a long enough list, there is still an “unreverse” step at the end. ↩︎
The type theory refers to “holes” in destination-passing style, to be filled by a normal tail-recursive call, but I find the practical result of impure mutation easier to grok since it matches my imperative experience. ↩︎
I do wonder if I would have even worried about this if the default linear data structure in OCaml were a contiguous slice instead of a linked list. Certainly for my Go implementations I regularly first transform the slice of strings into a slice of a more explicit type. The ability to preallocate contiguous regions of memory feels better to me, but for these kinds of small puzzles it’s not really necessary! ↩︎
String manipulation may behave differently in the two runtimes. ↩︎
The input is 1000 lines long, so one way to look at these results is that each line takes ~45-50ns. That’s ~1/2 a main memory lookup, indicating that even with all the indirection inherent in linked lists, both Go and OCaml are hitting cache for at least half of the memory accesses. ↩︎
Fully static linking seems complex in OCaml and shouldn’t be relevant to the function being benchmarked, so I didn’t investigate it deeply. ↩︎
I also moved from OCaml 4.11.0 to 4.14.0 as part of this change. Only flambda had an impact on the performance. ↩︎

Advent of Code '21 in OCaml, Day 2

Day 2

Aside - List.map and tail modulo cons

Performance comparisons

Thoughts

Aside - `List.map` and tail modulo cons