Advent of Code '21 in OCaml, Day 2
Posted: 2022-09-18
I recently started at Jane Street, a technology-centric trading firm which uses OCaml. I’ve spent my whole career writing imperative languages, so I am exploring OCaml development using the Advent of Code, a puzzle game which I’ve played in Go multiple times.
See the day one post for background!
Day 2
Part A of the second day’s puzzle, like the first, involves serially iterating over a list. Each element of the list is a command to move our submarine, and the puzzle was to calculate the submarine’s final position.
My first thought was to map the lines of the file into a list of variant types, and then fold over that list to get the result:
1type action = Forward of int | Down of int | Up of int
2
3let line_to_action l =
4 let stringpair_to_action action scalestr =
5 let scale = Int.of_string scalestr in
6 match action with
7 | "forward" -> Forward scale
8 | "down" -> Down scale
9 | "up" -> Up scale
10 | _ -> failwithf "unexpected action %s" action ()
11 in
12 match String.split ~on:' ' l with
13 | [ a; b ] -> stringpair_to_action a b
14 | _ -> failwithf "unexpected line %s" l ()
15
16let step (horiz, depth) = function
17 | Forward f -> (horiz + f, depth)
18 | Up u -> (horiz, depth - u)
19 | Down d -> (horiz, depth + d)
20
21let day2a ls =
22 let horiz, depth =
23 List.map ls ~f:line_to_action |> List.fold ~init:(0, 0) ~f:step
24 in
25 horiz * depth
I felt icky, though. Core.List.t
is a singly-linked functional data structure,
which means that for large lists, Core.List.map
has to iterate once over the
list forwards building up a reversed representation, then iterate again over
that to build the forwards-order representation. In other words, when I
cleaned up my day 1
solution
and removed an unnecessary List.rev
call, I actually still had one hidden in
List.map
. I did a little digging into the imlementation of List.map
to see
what’s going on.
Aside - List.map
and tail modulo cons
OCaml’s standard library List.map
in version 4.14.0 is not tail
recursive, and
therefore will cause a stack overflows on long enough lists.
To avoid the stack overflow, Core.List.map
1 builds an intermediate,
reversed2 result to avoid stack overflow via the tail call
optimization. When the result of a
function is exactly the result of another function call (a “tail call”), the
existing stack frame can be reused. This way, a recursive function call chain
can use constant memory.
Effectively, the Core
implementation passes an accumulator of an out-of-order
map to the recursive function, then fixes it up (again using the tail call
optimization) afterwards. If you’re like me, though, you did most of your list
puzzles and coursework with loops. We know for sure that it’s possible to build
a list forwards with one pass and constant space: you remember the head and tail
nodes, mutate the tail on each iteration, and return the head when you’re done.
In OCaml trunk, but not in any current release, the standard library List.map
has been changed to behave as we’d expect, using an optimization called “tail
modulo
cons”.
The tail modulo cons optimization essentially does the return object tracking and impure mutation3 that one would expect. This is done by the compiler, so the intermediate state and the impure mutations are invisible to the programmer, a good property for an optimization.
I removed the List.map
call, and rewrote my implementation to operate on each
line as a string.4
Performance comparisons
Part B extends part A to track an “aim” for the submarine in addition to its position. Copypasta essentially got us there:
1let step2b (horiz, depth, aim) l =
2 match line_to_action l with
3 | Forward f -> (horiz + f, depth + (f * aim), aim)
4 | Up u -> (horiz, depth, aim - u)
5 | Down d -> (horiz, depth, aim + d)
6
7let day2b ls =
8 let horiz, depth, _ = List.fold ls ~init:(0, 0, 0) ~f:step2b in
9 horiz * depth
My Go implementation used a straightforward for loop, with the three parts of the accumulator represented as three local integers.
1func B(ls []string) int {
2 h, v, a := 0, 0, 0
3 for _, l := range ls {
4 sp := strings.Split(l, " ")
5 w, val := sp[0], lib.Atoi(sp[1])
6 switch w {
7 case "forward":
8 h += val
9 v += a * val
10 case "down":
11 a += val
12 case "up":
13 a -= val
14 default:
15 log.Fatal(l)
16 }
17 }
18
19 return h * v
20}
This makes it a little more clear which fields change in each branch, but as far as efficiency there’s no reason the OCaml compiler couldn’t optimize to the exact same behavior.5 Let’s compare the performance of the two implementations.
The first thing we see is that running an OCaml binary is ~10x slower than running a Go binary:
1~/aoc21go/day2$ time ./day2
21459206
31320534480
4
5real 0m0.001s
6user 0m0.002s
7sys 0m0.000s
8~/aoc21go/day2$ cd ~/aoc21ocaml
9~/aoc21ocaml$ time _build/default/bin/main.exe aoc -day 2a -day 2b
101459206
111320534480
12
13real 0m0.013s
14user 0m0.007s
15sys 0m0.007s
My strong suspicion was that I was measuring the startup cost of any OCaml binary, at least in the default build configuration. I tend to write long-running servers in real life, so I chose to ignore startup costs and focus on steady-state. That meant writing my first benchmark!
Go tests have a built-in benchmarking system, whereas in OCaml I’ll use the
core_bench
library and run my binary in benchmarking mode. In both cases, I
benchmarked using the real input and excluded file IO.6
The Go result was about 10% faster than the OCaml result:
1BenchmarkA-12 25321 46575 ns/op 32000 B/op 1000 allocs/op
2BenchmarkB-12 27775 43238 ns/op 32000 B/op 1000 allocs/op
vs.
1 Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
2 ------ ---------- --------- ---------- ---------- ------------
3 d2a 49.59us 25.00kw 0.35w 0.35w 98.38%
4 d2b 50.40us 26.00kw 0.44w 0.44w 100.00%
Note that Go’s compiler does not support optimization options: every build is optimized and fully statically linked:
1~/aoc21go/day2$ go build .
2~/aoc21go/day2$ ls
3a.txt day2 day2.go day2_test.go
4~/aoc21go/day2$ ldd day2
5 not a dynamic executable
6~/aoc21go/day2$ cd ~/aoc21ocaml
7~/aoc21ocaml$ dune build @@default @runtest
8~/aoc21ocaml$ ls _build/default/bin
9main.exe main.ml main.pp.ml main.pp.mli
10~/aoc21ocaml$ ldd _build/default/bin/main.exe
11 linux-vdso.so.1 (0x00007ffc1f1cc000)
12 libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe51e5ec000)
13 libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe51e4a8000)
14 libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe51e4a2000)
15 libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe51e2dd000)
16 /lib64/ld-linux-x86-64.so.2 (0x00007fe520017000)
Ocaml, by contrast, has a separate “flambda” compiler to generate optimized builds.7 Time to introduce a new switch!8
Using flambda and configuring Dune to pass -O3
to ocamlopt
, we get OCaml
performance comparable to Go:
1 Name Time/Run mWd/Run mjWd/Run Prom/Run Percentage
2 ------ ---------- --------- ---------- ---------- ------------
3 d2a 43.94us 20.00kw 98.75%
4 d2b 44.50us 21.00kw 0.19w 0.19w 100.00%
My final commit of day 2 code is here.
Thoughts
I now have a working configuration to build functional code that performs comparably to an imperative implementation. I suspect I’ll still feel icky about list inefficiencies, but it’s nice to know that the compiler is mostly handling it for me and I ought to just focus on thinking functionally.
--Chris
Core
is actually a separate library which builds onBase
, and this implementation is fromBase
, but I find it easier to refer to everything asCore
. ↩︎It’s a little more complex than this: short lists are unrolled and longer lists are batch-allocated, but for a long enough list, there is still an “unreverse” step at the end. ↩︎
The type theory refers to “holes” in destination-passing style, to be filled by a normal tail-recursive call, but I find the practical result of impure mutation easier to grok since it matches my imperative experience. ↩︎
I do wonder if I would have even worried about this if the default linear data structure in OCaml were a contiguous slice instead of a linked list. Certainly for my Go implementations I regularly first transform the slice of strings into a slice of a more explicit type. The ability to preallocate contiguous regions of memory feels better to me, but for these kinds of small puzzles it’s not really necessary! ↩︎
String manipulation may behave differently in the two runtimes. ↩︎
The input is 1000 lines long, so one way to look at these results is that each line takes ~45-50ns. That’s ~1/2 a main memory lookup, indicating that even with all the indirection inherent in linked lists, both Go and OCaml are hitting cache for at least half of the memory accesses. ↩︎
Fully static linking seems complex in OCaml and shouldn’t be relevant to the function being benchmarked, so I didn’t investigate it deeply. ↩︎
I also moved from OCaml 4.11.0 to 4.14.0 as part of this change. Only flambda had an impact on the performance. ↩︎