Wandering In Limbo Implementing Financial Models

tl;dr this is a recount of the nightmare I have experienced trying to find faster way to implement financial model. I think in general it gives a good overview of how software engineering go hammock with things that seems like sensible choices.

We went from a model running in 4 minutes to around 30 seconds, the optimization part is still noticeable.

When I give time estimates in this article they must be understood as very fuzzy, my time was never fully dedicated to this single problem. Since this is more about targets than hours to bill I never ended having a very elaborate time tracking system when working on this.

Everything starts with Excel

Spreadsheet programs are in my humble opinion one of the most undervalued contributions to our modern society. They enable people to build complex models extremely fast with very virtually no programming knowledge. A lot of financial models before being anywhere in a software are first built in Excel, often without any line of VBA.

Basically, once your model exists in Excel, it is already implemented.

Reimplementing models from Excel takes a lot of time, basically about twice or thrice the time of doing the Excel itself. With a lot less flexibility and plenty of room for implementation error (but perhaps some more reliable tests).

In that case we were talking of a raw estimate of 20-40 man days.

Here comes PhpSpreadsheet

So instead of reimplementing Excel you might be wondering if some crazy person did not already do it and it happens that PhpSpreadsheet does it. Here, things could have been worse: PHP is our core backend language and it seems to have the best Excel / spreadsheet execution engine on the market.

We have used this library in the past for another model, it worked fine. Fine but slow, for a factor wildly estimated to be between 100 and 1000 times slower than an actual reimplementation. We are talking about end users perceiving about 10 seconds of latency when using a web interface.

A bit of caching here and there brought this to 7 seconds. Not perfect, but a choice we took for that model.

Bug… bugs!

The plan seemed flawless until we realized one of two things:

A bug appeared through a random error during execution
The execution worked fine, but it is only when checking the results that you realize something is wrong.

First big question mark hanging on delay: fixing those bugs inside PhpSpreadsheet is a known unknown.

The biggest thing to note here, is that it still felt that fixing the bugs was still likely to take less time than reimplementing the model. Perhaps about 10 man days.

The unknown unknown we could have been aware here was that some other problems will probably arise.

A few things to note here:

PhpSpreadsheet works by first parsing a cell formula in a stack like structure, and then processes the stack. The parsing and stack processing core functions are about 1000 lines of code each.
The specific functions implementation like SUMIF or STDDEV are implemented in separate files. Navigating around those is easier.
If you want to understand (here debug) why a specific result is given, you need to dig deep to try to understand where exactly something is not behaving as it should. Then you need to check where exactly the implementation differs from the behavior expected in your spreadsheet program.

A very tedious task.

Side note here: some bugs were created because Excel considers empty strings to be 0. The solution here was to be more explicit in our model file, as fixing this would have meant digging in every single arithmetic operation to treat this specific case.

Slow

When you do web application development you are pretty happy when your page is fully here and rendered below a second, below 300ms if we are talking about the purely server side processing of things.

Running the full sheet calculation was taking about 3 to 4 minutes. It is the internet, nobody waits for 4 minutes.

Note here that knowing the exact execution time before fixing the bugs mentioned above was not possible, we only had a (very low) lower bound.

To go back to PhpSpreadsheet, the function resolution core works very simply.

=SOMEFUNCTION(A1;B1;C1)

Resolve A1
Resolve B1
Resolve C1
Resolve (invoke) SOMEFUNCTION with the already resolved parameters, note here that resolving A1, B1 & C1 is done by the core of the engine, not by the specific SOMEFUNCTION implementation.

It works fine and it is easy to understand.

Now consider:

=IF(A1;B1;C1)

Suddenly if A1 is true (respectively false) it does not make sense to resolve C1 (respectively B1). If your model has 3 branching points impacting implementation, you are resolving 2^3=8 branches instead of 1. It was costing us a lot of performance.

Fortunately for later, the condition, here A1, is resolved before the rest.

As mentioned in 4 above, pruning those useless branching must not be done in the IF implementation but in the core functions … yup the two ones which are about 1 000 lines each.

Perhaps some irrelevant details ; but in the final implementation every parsed token can be:

Tagged with a key if it is a condition, then we will store its result under the key somewhere.
Tagged with conditionals only if and only if not if inside a branch. We simply do not run the execution if the relevant condition is not fulfilled.

Relevant commit:

https://github.com/PHPOffice/PhpSpreadsheet/commit/0b387e767e99a9042886ea673d9b9905c7951e87

Getting this done took I think about 15 man days.

Note here that other functions imply branching, e.g. SUMIF, this change did not impact those function execution and they are way harder to optimize.

Still too slow: Amazon Lambda to the rescue

We were down to about 90 seconds to our benchmark. Better, but not perfect.

Fortunately for us the model is mostly about comparing 4 different options. All options are independent of one another, and so are most of the calculations behind. Parallel processing was an option.

Doing multi-processing in PHP is generally not done, and it could also have created important load on our server(s).

Usually when you do something web based in Luxembourg, you need one small server maybe two to have redundancy. It would have been stupid to suddenly blow up our server capacity just to be able to process peak requests and stay idle 99% of the time.

Amazon Lambda is a service that lets you upload some code and have your code called under certain conditions. In theory, you do not need to worry about any other technical details: number of instances, scale up, scale down, input, output.

In theory Amazon Lambda is a solution made for that. For every request we write our 4 intermediate computation requests in a queue and we read responses from another queue. Easy right?

In term of implementation delay here, those technologies are always sold as Plug and Play™ … and they never are. You spend a good amount of time writing a proof of concept and always find new problem when getting to the real thing… Let’s say 5 man days?

Note on Lambda functions tight coupling

Strangely enough Amazon Lambda is design to be easily triggered by a lot of different ways: Amazon API Gateway, SQS Queue, manual/cli invocation, AWS SDK in various languages…

While you can trigger the function by those means, the input you receive will largely depends on what invokes it.

While you can trigger the function as mentioned above, but there is no standard / easy way to dispatch responses. You can virtually do anything but you need to implement it manually.

Lambda could be a lot better at reducing coupling between function implementation, invocation method and response dispatch.

Lambda does not support PHP (natively)

You need to write a custom Lambda runtime for your Lambda function. The Plug and Play™ experience now goes to reading a 10 pages long blog article about setting up a custom Lambda runtime for PHP.

The idea of this is that you basically provide a new PHP binary to the Lambda function and it will invoke your code with it. This PHP executable needs to be built in an Amazon Linux environment.

Considering the following problem, I think that PHP support (and custom runtime in general) should not be advertised … it is only here to tick a box. It is merely a proof of concept.

Your custom runtime can not be that custom

Once you see your PHP code happily running you would be mistaken thinking you have mastered the beast. PhpSpreadsheet requires a few extensions, even simply opening a .xlsx file needs zlib and libxml…

You think “no problem, let’s will build PHP with zlib and libxml”? Foolish inspiration.

Because then, your PHP binary would depend on extension installed in the environment running the Lambda function … I believe it is not possible to install them before runtime.

So, how to solve this problem?

Insane fix #1: what we want is not to actually read a spreadsheet in our Lambda function … We want to execute one, always the same one. So our deployment script takes the file opens it through PhpSpreadsheet, serializes the relevant object and deploys it along the code. Is it dirty? Yes, Ma’am! We all have done things in our live we would prefer not to be judged upon.

I skip details about optimizing Lambda function instances memory size (which also happens to impact the available CPU power) and packing things in layers. But at this stage we were able to manually invoke the Lambda function (command line call through the AWS cli).

I guess reaching this point took another solid 5-8 man days

Segmentation fault (core dumped)

Only three things can not be looked at directly, sun, death and PHP’s segmentation fault (core dumped). This lovely error message is mostly following by nothing at all, no line number, no file … nothing. My understanding is that some underlying C library is misbehaving somewhere.

We got this error, not with every input, but, by luck it was reproducible with some input.

The only solution here is to pray that by using var_dump('hello');die; (no debugger inside of Lambda) you will manage to identify which line is causing this. This is not a guarantee as sometimes the debugging process itself makes this error.

Luckily, the error was identified around AWS PHP SDK when trying to send the response to a queue.

Insane fix #2: package the AWS command line client (and all the python libraries it depends on) inside the Lambda function and call it instead of using the PHP SDK.

Perhaps 4 to 5 man days lost there, between acceptation, diagnosis and realization.

Everything is fine?

At this stage, everything finally worked. We went down to about 25 seconds of processing time on our benchmark.

We were happy for a day or two … When we noticed that the function seemed to be slower when we were doing many invocations in parallel … why?

This problem was a case of “There is no cloud but other people’s computers”.

The documentation about Lambda’s Scaling is not extremely developed and we were assuming that the scaling is magic. It is not.

Lambda functions are executed in instances and those instances might or might not be reused when your function is invoked. Lambda can be using an SQS queue as an input, but this seems to have been designed as a typically asynchronous / non critical use case. In the documentation about SQS integration, lies this interesting paragraph.

When messages are available, Lambda reads up to 5 batches and sends them to your function. If messages are still available, Lambda increases the number of processes reading batches by up to 60 more instances per minute.

I understand the following logic when messages are found:

If we have just been previously processing messages, increase instances count
If we have not, read X messages with a maximum of 5 and start X instances
Process the messages
End. Go to 1, if we still have messages in the queue

If I have 13 message in the queue for example, the system will:

Read 5 messages
Start 5 instances
It will process them in about 25 seconds,
Check the queue, see some new messages, start a 6th instance
Read 6 messages.
Process them in about 25 seconds
Check again, … maybe start a 7th instance (even if it would be useless)?
Read 2 messages
Process them in 25 seconds

That is a minimum of 25 * 3 = 75 seconds, to do a processing that is feasible with more instance in a lower bound of 25 seconds.

The solution here was to unplug the Lambda function from the queue system and do asynchronous call through the PHP SDK.

Perhaps another 4 to 5 man days lost there, again … observing and understanding take often more time than actual development.

Conclusion

The solution that was supposed to ease our development process and make us earn 20 man days ended up costing us months. Months of being engaged in a process and convinced that “this is the last problem to solve”.

Proof reading this article, I think I largely underestimate the time each problem took and I don’t transcribe how pressing those problems were in my mind. When you have lost months and the solution that is supposed to be somewhat done and reliable starts throwing segmentation fault (core dumped) messages, you do not go home lighthearted. Most of those problems felt blocking and project threatening at some point.

Having some experience in software, I could not say that finding bugs in a library or relying in the promises of the latest fancy tool have not been warnings. If the initial idea of try is probably a good one, should have I unplugged the investment on this solution and simply go back to the beaten path of a model reimplementation? If so, what should have been the trigger of this decision? The combination of so many problems at the same time was not foreseeable, but the lack of any problem was not expected either.

I may write another follow article about how we tried to make this final 25 seconds processing more bearable for the end user.