Advent of Code '21 in OCaml, Day 3

Posted: 2022-10-16

I recently started at Jane Street, a technology-centric trading firm which uses OCaml. I’ve spent my whole career writing imperative languages, so I am exploring OCaml development using the Advent of Code, a puzzle game which I’ve played in Go multiple times.

See the day one post for background! See the day two post for the prior puzzle.

Day 3

Part A of the third day’s puzzle required tracking the number of times each “bit” appears in a string of 1s and 0s. I had to multiply a number made up of the most common bits by one made up of the least common bits. If the count of bits at a given index was tied, I was to treat 1s as being more common.

The number of bits per number was different between the sample and the real input, so I decided to use an Array.t to track the count of 1 bits in each position. There may be a pure way to do this in OCaml, but I found the array intuitive to use.

Using an array meant that I had a side-effectual fold function. I also wanted to avoid unnecessary calls to List.length, so I tracked the total number of lines in the function during the fold as well.

 1let day3a lines =
 2  let bits = List.hd_exn lines |> String.length in
 3  let ones_seen = Array.create ~len:bits 0 in
 4  let countstep totalct line =
 5    String.iteri
 6      ~f:(fun i c ->
 7        if Char.O.(c = '1') then ones_seen.(i) <- ones_seen.(i) + 1 else ())
 8      line;
 9    totalct + 1
10  in
11  let line_ct = List.fold ~init:0 ~f:countstep lines in
12  (* ... snip ... *)

Idiomatic OCaml seems to be to annotate impure functions with unit arguments or return values. As the code above shows, this isn’t required, but I probably would not write a function like countstep in code which would be maintained by a team: at its callsite it looks like a pure accumulator, which is misleading. In this specific case, where the code is by and for me and the function is a closure with narrow scope, I think it’s ok.

Part B asked me to filter the input list twice, stepping through bit by bit, until only one element was left. On the first pass, I had to keep the elements with the most common bit in each position and on the second pass, I had to keep those with the least common bit.

In writing this code, I discovered that my part A solution was wrong! I had written:

1if ones >= (line_ct / 2) then (* ... snip ... *)

When line_ct is an odd number and there is exactly one more 0 than there are 1s, this will count 1s as being equally common to 0s. In the sample and real inputs, there were an even number of rows, so part A did not hit this condition. The filtering in part B quickly resulted in odd line counts.

I’ve specified more than one Paxos implementation in TLA+,1 and based on that experience my preferred solution to this problem is to cross-multiply:

1if (ones * 2) >= line_ct then (* ... snip ... *)

Even after many years of professional software engineering, framing affects which solutions I consider. In this case, when I read the puzzle’s discussion of “most common,” I thought of division, which was wrong.

My initial solution was a straightforward recursive partition_tf. For example:

 1let rec ox_rating line_ct index lines =
 2  if line_ct = 1 then "0b" ^ List.hd_exn lines |> Int.of_string
 3  else
 4    let ones, zeroes =
 5      List.partition_tf lines ~f:(fun l -> Char.O.(String.get l index = '1'))
 6    in
 7    let onelen = List.length ones in
 8    let zerolen = List.length zeroes in
 9    if onelen * 2 >= line_ct then ox_rating onelen (index + 1) ones
10    else ox_rating zerolen (index + 1) zeroes

List.length!

I didn’t love the number of List.length calls in my functions, since they’re O(n) in the elements of the list and partition_tf could already know how many elements it put into each side. Since I used an Array.t in part A and Go would use slices in this context, I decided to benchmark2 and compare to an implementation which replaced List.t with Array.t. The results were catastrophic!

Naive list:

  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run     mGC/Run   mjGC/Run  
 ------ ---------- --------- ---------- ---------- ----------- ---------- 
  d3b     24.48us   32.38kw    189.19w    189.19w   124.11e-3    1.19e-3  

Naive array:

  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run   mGC/Run    mjGC/Run  
 ------ ---------- --------- ---------- ---------- --------- ----------- 
  d3b    218.52us   20.30kw    19.59kw     6.54kw      5.00   100.05e-3  

The garbage collection stats hint at why: we generate a lot more allocations, major heap promotions, and major collections when we move to an array. But where could 100x more major allocations come from?

It turns out that Array.partition_tf makes more copies than you might hope. Array.t is a fixed-size array type, not a variable-size vector, so it does not have a distinction between capacity and length. The implementation of Array.partition_tf works as follows:

  1. Array.map to wrap each element in Either.First for true and Either.Second for false, which is one full copy of the array.
  2. Array.filter_map over that copy to extract the First elements into their own array
  3. Array.filter_map again to extract the Second elements.

You might expect the second and third steps together to result in ~1 more copy of the initial array. Unfortunately, it results in ~3. When filter_map encounters the first matching element, it allocates an array the full size of the input! Then it reallocates to the correct size with Array.sub, which is a copy.

It seems that arrays are not treated with the same optimization imperative as lists. For example, partition_tf could track how many trues and falses there ought to be after its first pass, and could preallocate storage for them once. Or if a vector type were available it could allocate exactly once per side by not wrapping the values in an Either.

As a consequence, I made a minor optimization to keep a list but track the length. I used perf to estimate what the savings might be: ~22% of time was spent in List.length calls.

Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10061000000
  Overhead  Command   Shared Object     Symbol
+   33.22%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__f_1524
+   22.68%  main.exe  main.exe          [.] camlStdlib__List__length_aux_188
+   20.66%  main.exe  main.exe          [.] camlStdlib__List__rev_append_345
+   13.89%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__loop_2533
+    1.55%  main.exe  main.exe          [.] do_some_marking
<... snip ...>

However I only wound up saving ~half of that:

  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run     mGC/Run   mjGC/Run  
 ------ ---------- --------- ---------- ---------- ----------- ---------- 
  d3b     21.37us   36.34kw    174.38w    174.38w   139.20e-3    1.12e-3  

Time moved into anonymous functions and fold_left, and I didn’t investigate further.

Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10047000000
  Overhead  Command   Shared Object     Symbol
+   24.53%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a49$2c7$2d$2d156$5d_380
+   24.16%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_445
+   15.83%  main.exe  main.exe          [.] camlStdlib__List__fold_left_597
+    7.00%  main.exe  main.exe          [.] caml_apply2
+    6.82%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_3841
+    6.80%  main.exe  main.exe          [.] camlAoc21ocaml__Day3__anon_fn$5bday3$2eml$3a59$2c24$2d$2d68$5d_4316
+    4.59%  main.exe  main.exe          [.] camlStdlib__List__length_aux_188
+    1.75%  main.exe  main.exe          [.] do_some_marking
<... snip ...>

Alternative approaches.

Another way of looking at this puzzle is that it is straightforward on a sorted list. If you have a range where the prefix bits are identical and you examine the next bit, the offset of the first 1 is exactly the count of 0s. I thought this might be a way to get better performance out of an array, by operating on a string range in-place.

My existing code had O(n * r) time complexity, where r is the radix (number of bits) in each element. A sort would have O(n * log n) time complexity, and log n was smaller than r for my 1000 line input of 12 characters each. In practice, it turned out that most of the array did not need to be examined, and so this was still much slower:

  Name   Time/Run   mWd/Run   mjWd/Run   Prom/Run    mGC/Run   mjGC/Run  
 ------ ---------- --------- ---------- ---------- ---------- ---------- 
  d3b    169.09us    7.38kw     1.00kw      0.45w   30.32e-3    4.32e-3  

The garbage collection stats suggested that this was a much less demanding approach in terms of copies. And indeed, most time was spent in string comparison for the sort:3

Samples: 40K of event 'cpu-clock:u', Event count (approx.): 10025000000
  Overhead  Command   Shared Object       Symbol
+   30.80%  main.exe  main.exe            [.] caml_string_compare
+   10.37%  main.exe  main.exe            [.] caml_modify
+    8.34%  main.exe  libc-2.31.so        [.] 0x000000000015c288
+    6.18%  main.exe  main.exe            [.] camlBase__Array__loop_766
+    5.23%  main.exe  main.exe            [.] camlBase__Array__loop_414
+    4.06%  main.exe  main.exe            [.] caml_page_table_lookup
<... snip ...>

Finally, I decided to partition only those parts of the array which might contain the answer. This is the meat of the approach, which partitions a range by bit at a given index, and returns the offset of the first 1 bit:

 1let partition_inplace lines index start end_ =
 2  let start = ref start and end_ = ref end_ in
 3  while !start < !end_ do
 4    match String.get lines.(!start) index with
 5    | '0' -> incr start
 6    | '1' ->
 7        decr end_;
 8        while Char.O.(String.get lines.(!end_) index = '1') && !start < !end_ do
 9          decr end_
10        done;
11        Array.swap lines !start !end_
12    | c -> failwithf "unexpected char %c" c ()
13  done;
14  !start

On my test input, this finally gave arrays the upper hand over lists, by ~15%.

  Name   Time/Run   mWd/Run   mjWd/Run   mGC/Run   mjGC/Run   Comp/Run  
 ------ ---------- --------- ---------- --------- ---------- ---------- 
  d3b     18.58us    11.00w     1.00kw   3.82e-3    1.18e-3    0.29e-3  

Note however that the list to array copy still generated 1000 words of major heap, for what is ultimately a temporary array. I would have liked to see the compiler be smart enough to allocate that array on the stack, but I’m not sure it would have made a difference.

Thoughts

I am getting a feel for the efficiency characteristics of OCaml programs. Array copies are obviously expensive, but also easy to trigger accidentally. Understanding the garbage collector is both extremely important for performance and very different from my C++ intuitions around heap allocation costs.

-- Chris


  1. Paxos and TLA+ are both deep topics, which I’ll leave for other series. ↩︎

  2. I benchmarked using my real, 1000-line input. In real code I’d want a variety of representative inputs, to avoid over-indexing on the specific shape of one test case. In this case, I just want a sense of how OCaml performs, so I don’t mind. ↩︎

  3. The libc function is a child of the caml_string_compare node, so is probably strncmp or something similar. ↩︎


Home | Feedback | RSS