Advent of Code '21 in OCaml, Day 3
Posted: 2022-10-16
I recently started at Jane Street, a technology-centric trading firm which uses OCaml. I’ve spent my whole career writing imperative languages, so I am exploring OCaml development using the Advent of Code, a puzzle game which I’ve played in Go multiple times.
See the day one post for background! See the day two post for the prior puzzle.
Day 3
Part A of the third day’s puzzle required tracking the number of times each “bit” appears in a string of 1s and 0s. I had to multiply a number made up of the most common bits by one made up of the least common bits. If the count of bits at a given index was tied, I was to treat 1s as being more common.
The number of bits per number was different between the sample and the real
input, so I decided to use an Array.t
to track the count of 1 bits in each
position. There may be a pure way to do this in OCaml, but I found the array
intuitive to use.
Using an array meant that I had a side-effectual fold function. I also wanted to
avoid unnecessary calls to List.length
, so I tracked the total number of lines
in the function during the fold as well.
1let day3a lines =
2 let bits = List.hd_exn lines |> String.length in
3 let ones_seen = Array.create ~len:bits 0 in
4 let countstep totalct line =
5 String.iteri
6 ~f:(fun i c ->
7 if Char.O.(c = '1') then ones_seen.(i) <- ones_seen.(i) + 1 else ())
8 line;
9 totalct + 1
10 in
11 let line_ct = List.fold ~init:0 ~f:countstep lines in
12 (* ... snip ... *)
Idiomatic OCaml seems to be to annotate impure functions with unit arguments or
return values. As the code above shows, this isn’t required, but I probably
would not write a function like countstep
in code which would be maintained by
a team: at its callsite it looks like a pure accumulator, which is misleading.
In this specific case, where the code is by and for me and the function is a
closure with narrow scope, I think it’s ok.
Part B asked me to filter the input list twice, stepping through bit by bit, until only one element was left. On the first pass, I had to keep the elements with the most common bit in each position and on the second pass, I had to keep those with the least common bit.
In writing this code, I discovered that my part A solution was wrong! I had written:
1if ones >= (line_ct / 2) then (* ... snip ... *)
When line_ct
is an odd number and there is exactly one more 0 than there are
1s, this will count 1s as being equally common to 0s. In the sample and real
inputs, there were an even number of rows, so part A did not hit this
condition. The filtering in part B quickly resulted in odd line counts.
I’ve specified more than one Paxos implementation in TLA+,1 and based on that experience my preferred solution to this problem is to cross-multiply:
1if (ones * 2) >= line_ct then (* ... snip ... *)
Even after many years of professional software engineering, framing affects which solutions I consider. In this case, when I read the puzzle’s discussion of “most common,” I thought of division, which was wrong.
My initial
solution
was a straightforward recursive partition_tf
. For example:
1let rec ox_rating line_ct index lines =
2 if line_ct = 1 then "0b" ^ List.hd_exn lines |> Int.of_string
3 else
4 let ones, zeroes =
5 List.partition_tf lines ~f:(fun l -> Char.O.(String.get l index = '1'))
6 in
7 let onelen = List.length ones in
8 let zerolen = List.length zeroes in
9 if onelen * 2 >= line_ct then ox_rating onelen (index + 1) ones
10 else ox_rating zerolen (index + 1) zeroes
List.length
!
I didn’t love the number of List.length
calls in my functions, since they’re
O(n)
in the elements of the list and partition_tf
could already know how
many elements it put into each side. Since I used an Array.t
in part A and
Go would use slices in this context, I decided to benchmark2 and compare
to an implementation which replaced List.t
with Array.t
. The results were
catastrophic!
Naive list:
Name Time/Run mWd/Run mjWd/Run Prom/Run mGC/Run mjGC/Run
------ ---------- --------- ---------- ---------- ----------- ----------
d3b 24.48us 32.38kw 189.19w 189.19w 124.11e-3 1.19e-3
Naive array:
Name Time/Run mWd/Run mjWd/Run Prom/Run mGC/Run mjGC/Run
------ ---------- --------- ---------- ---------- --------- -----------
d3b 218.52us 20.30kw 19.59kw 6.54kw 5.00 100.05e-3
The garbage collection stats hint at why: we generate a lot more allocations, major heap promotions, and major collections when we move to an array. But where could 100x more major allocations come from?
It turns out that Array.partition_tf
makes more copies than you might hope.
Array.t
is a fixed-size array type, not a variable-size vector, so it does not
have a distinction between capacity and length. The implementation of
Array.partition_tf
works as follows:
Array.map
to wrap each element inEither.First
for true andEither.Second
for false, which is one full copy of the array.Array.filter_map
over that copy to extract theFirst
elements into their own arrayArray.filter_map
again to extract theSecond
elements.
You might expect the second and third steps together to result in ~1 more copy
of the initial array. Unfortunately, it results in ~3. When filter_map
encounters the first matching element, it allocates an array the full size of
the
input!
Then it reallocates to the correct size with Array.sub
, which is a copy.
It seems that arrays are not treated with the same optimization imperative as
lists. For example, partition_tf
could track how many trues and falses there
ought to be after its first pass, and could preallocate storage for them once.
Or if a vector type were available it could allocate exactly once per side by
not wrapping the values in an Either
.
As a consequence, I made a minor
optimization
to keep a list but track the length. I used perf
to estimate what the savings
might be: ~22% of time was spent in List.length
calls.
Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10061000000
Overhead Command Shared Object Symbol
+ 33.22% main.exe main.exe [.] camlAoc21ocaml__Day3__f_1524
+ 22.68% main.exe main.exe [.] camlStdlib__List__length_aux_188
+ 20.66% main.exe main.exe [.] camlStdlib__List__rev_append_345
+ 13.89% main.exe main.exe [.] camlAoc21ocaml__Day3__loop_2533
+ 1.55% main.exe main.exe [.] do_some_marking
<... snip ...>
However I only wound up saving ~half of that:
Name Time/Run mWd/Run mjWd/Run Prom/Run mGC/Run mjGC/Run
------ ---------- --------- ---------- ---------- ----------- ----------
d3b 21.37us 36.34kw 174.38w 174.38w 139.20e-3 1.12e-3
Time moved into anonymous functions and fold_left
, and I didn’t investigate
further.
Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10047000000
Overhead Command Shared Object Symbol
+ 24.53% main.exe main.exe [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a49$2c7$2d$2d156$5d_380
+ 24.16% main.exe main.exe [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_445
+ 15.83% main.exe main.exe [.] camlStdlib__List__fold_left_597
+ 7.00% main.exe main.exe [.] caml_apply2
+ 6.82% main.exe main.exe [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_3841
+ 6.80% main.exe main.exe [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_4316
+ 4.59% main.exe main.exe [.] camlStdlib__List__length_aux_188
+ 1.75% main.exe main.exe [.] do_some_marking
<... snip ...>
Alternative approaches.
Another way of looking at this puzzle is that it is straightforward on a sorted list. If you have a range where the prefix bits are identical and you examine the next bit, the offset of the first 1 is exactly the count of 0s. I thought this might be a way to get better performance out of an array, by operating on a string range in-place.
My existing code had O(n * r)
time complexity, where r
is the radix (number
of bits) in each element. A sort would have O(n * log n)
time complexity, and
log n
was smaller than r
for my 1000 line input of 12 characters each. In
practice, it turned out that most of the array did not need to be examined, and
so this was still much slower:
Name Time/Run mWd/Run mjWd/Run Prom/Run mGC/Run mjGC/Run
------ ---------- --------- ---------- ---------- ---------- ----------
d3b 169.09us 7.38kw 1.00kw 0.45w 30.32e-3 4.32e-3
The garbage collection stats suggested that this was a much less demanding approach in terms of copies. And indeed, most time was spent in string comparison for the sort:3
Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10025000000
Overhead Command Shared Object Symbol
+ 30.80% main.exe main.exe [.] caml_string_compare
+ 10.37% main.exe main.exe [.] caml_modify
+ 8.34% main.exe libc-2.31.so [.] 0x000000000015c288
+ 6.18% main.exe main.exe [.] camlBase__Array__loop_766
+ 5.23% main.exe main.exe [.] camlBase__Array__loop_414
+ 4.06% main.exe main.exe [.] caml_page_table_lookup
<... snip ...>
Finally, I decided to partition only those parts of the array which might contain the answer. This is the meat of the approach, which partitions a range by bit at a given index, and returns the offset of the first 1 bit:
1let partition_inplace lines index start end_ =
2 let start = ref start and end_ = ref end_ in
3 while !start < !end_ do
4 match String.get lines.(!start) index with
5 | '0' -> incr start
6 | '1' ->
7 decr end_;
8 while Char.O.(String.get lines.(!end_) index = '1') && !start < !end_ do
9 decr end_
10 done;
11 Array.swap lines !start !end_
12 | c -> failwithf "unexpected char %c" c ()
13 done;
14 !start
On my test input, this finally gave arrays the upper hand over lists, by ~15%.
Name Time/Run mWd/Run mjWd/Run mGC/Run mjGC/Run Comp/Run
------ ---------- --------- ---------- --------- ---------- ----------
d3b 18.58us 11.00w 1.00kw 3.82e-3 1.18e-3 0.29e-3
Note however that the list to array copy still generated 1000 words of major heap, for what is ultimately a temporary array. I would have liked to see the compiler be smart enough to allocate that array on the stack, but I’m not sure it would have made a difference.
Thoughts
I am getting a feel for the efficiency characteristics of OCaml programs. Array copies are obviously expensive, but also easy to trigger accidentally. Understanding the garbage collector is both extremely important for performance and very different from my C++ intuitions around heap allocation costs.
-- Chris
Paxos and TLA+ are both deep topics, which I’ll leave for other series. ↩︎
I benchmarked using my real, 1000-line input. In real code I’d want a variety of representative inputs, to avoid over-indexing on the specific shape of one test case. In this case, I just want a sense of how OCaml performs, so I don’t mind. ↩︎
The
libc
function is a child of thecaml_string_compare
node, so is probablystrncmp
or something similar. ↩︎