tl;dr Late Acceptance Hill Climbing, LHAC, is a so simple and so efficient optimization algorithm that I think it belongs to any software person toolbox.

I have always been fond of metaheuristic optimization algorithms. Evolutionary algorithms and the likes fascinate me, simply because of biomimetism, and also, a bit, because they work. They can apply to every optimization problems. The caveat is that even you can apply it everywhere, those approaches give approximate solutions and often take quite some time to run. So beware about using the right tool at hand.

A few years ago, I wrote Gimuby, a quite complete Ruby library to experiment with genetic algorithms and a variant called the island model. I kept the taste for that kind of things.

The first Covid-19 lockdown gave me the opportunity to try to do some live coding so if you wish to see 3h of live programming around LAHC, be my guest. If you need a library for Ruby, the product of this session, Lahcuby, is available.

## Academics

The algorithm has been introduced by Burke and Bykov, 2016, I did not run any benchmark but the paper claims state of the art performance compared to similar methods. But this is not what I find the most interesting here, even though without it nobody would have mentioned it.

If you consider a random problem and a random metaheuristic algorithm to solve it, there are many parameters to consider. First about how you represent the problem and its potential solutions, but also how you set the different parameters of your algorithm, and those two things interact with one another. For example, the way you represent an itinerary of a traveling salesman, how you mix two itineraries in a third one, how you select which itineraries to keep and how and how often you randomly change an itinerary. All those things have an impact on one another when you are trying to get the best possible result. Meaning that fine tuning your optimization algorithm is an optimization problem by itself.

Your next T-shirt should say:

I used metaheuristics on my optimization problem, and all I got was this lousy hyperparameter optimization problem

The cool thing with LAHC is its simplicity, there is a single parameter. Your mileage may not vary, and you may find consistency in your results. I do not believe in silver bullets, but this might be as close as you can get.

## How it works

Lets start with a simple gradient descent algorithm:

```
solution = get_initial_solution()
cost = get_cost(solution)
iteration_count = 0
max_iteration_count = 1000
while iteration_count < max_iteration_count:
iteration_count += 1
alternative_solution = get_variation(solution)
alternative_cost = get_cost(alternative_solution)
if alternative_cost >= cost:
continue
# we have found a better solution
solution = alternative_solution
cost = alternative_cost
```

I am starting with this because LAHC is very close to a gradient descent, but it should not get stuck in local minimums. Actually "gradient descent", is just another way to say "hill climbing".

Here are the two differences:

- Instead of keeping only a solution, the algorithm keeps a solution plus an
history of the previously encountered costs, we will call it the
`Memory`

. - Instead of accepting an alternative solution only if it is better than the
current one, the alternative solution is also accepted if it is better than the
cost at the current
`Memory Index`

. If such a case occurs the alternative solution cost gets stored in the`Memory`

.

All the rest is intendance:

- Initializing the
`Memory`

, the authors set it to be the initial cost. My personal views are different, it could be set with random solutions' costs to lessen the risk of a bad initialization. I keep the authors view in the implementation below to show a simpler version. - Rotating the
`Memory Index`

. - Deciding to end the loop, the authors suggest to set a minimum number of
iterations and then wait for enough idle iterations - i.e iterations where the
current solution does not get updated. To keep things simple I just set a
maximum number of iterations here.

Note here that the only parameter here is the `Memory Size`

.

```
memory_size = 30 # The only algorithm parameter
solution = get_initial_solution()
cost = get_cost(solution)
cost_memory = [cost] * memory_size # Memory initialization
iteration_count = 0
max_iteration_count = 1000
while iteration_count < max_iteration_count: # Exit condition
iteration_count += 1
alternative_solution = get_variation(solution)
alternative_cost = get_cost(alternative_cost)
better_than_solution = alternative_cost < cost
memory_index = iteration_count % memory_size
current_memorized_cost = cost_memory[memory_index]
better_than_memory = alternative_cost < current_memorized_cost
if better_than_solution or better_than_memory:
# We have found a better solution
solution = alternative_solution
cost = alternative_cost
if better_than_memory:
# We update the memory, that will constrain the algorithm to finer
# grain solutions over time
cost_memory[memory_index] = alternative_cost
```

My main criticism about LAHC is its name, Late Acceptance Hill Climbing. The acceptance is not late.

The benchmark against which solution is accepted is lagged or large - meaning there is more than one reference points. Would it be better to have it named LaBeHC, Lagged/Large Benchmark Hill Climbing?

## Benefits

### Low tuning required

As discussed above depending on the problem and also on the implementation,
the logic behind obtaining a solution variation is often impacting the optimal
configuration of hyperparameters. Here, the only hyperparameter,
`Memory Size`

, is enough to allow more changes in the solution to have a better
solution space exploration.
Over time, no matter how big the `Memory Size`

, with the process going on, the
algorithm will progressively switch from exploration to exploitation to
refine most optimal solution.

### Low entry barrier

The presented LAHC implementation is only about 25 extra lines compared to a gradient descent. It is probably about the same size of a grid search but it should provide much better results.

This means that, unlike most metaheuristics, you can implement it inside your program in a couple of hours regardless of your technical stack.

### Same benefits as other metaheuristics

I did not put to the test the result of the original paper, but the approach is respected and I think it deserves due credit. In my experience, I replaced both gradient descent and grid search by this, and observed great results in both cases... one major counter example being the live programming session - the curse of the live demo.