Since Sumu 1.1.3 was released with some important bug fixes, I've been working on optimization. The release notes just say "- many optimizations to app framework and DSP" but I know some of you will be interested in a deeper dive. This is written for people who are interested in programming. So while I'm not assuming a great amount of specialized info, I'll throw around terms like "compile time" and "runtime." If you want to know what anything means, you can ask a search engine or even me, in the comments! Let's dig in.
With the 1.1.3 version released October 1, I made an eight-voice patch that uses the filter and automatically makes sound, as my rough benchmark. I also set up a document in the app i use for much of my testing, the AudioPluginHost that comes with JUCE. I use this because it launches so fast, just a few seconds compared with maybe 20 seconds for the major DAWs. When looking at the Activity Monitor in MacOS, the CPU usage for my AudioPluginHost doc was 63%, plus or minus 0.5%. A rough number but good enough to see the big changes I'm making now, and when I don't need precision, I usually do what's quickest!
This 63% measures against one CPU core, of which most of our machines have around 8. Still, very very heavy CPU use for a softsynth. Let's do something about it. Using Apple's Time Profiler developer tool, I made a list of the biggest CPU time users in the code. While Activity Monitor is precise enough to check that I'm making forward progress, I use the profiler to identify which parts of the code are the hotspots. Let's save some cycles!
Optimization 1, hashed symbol rewrite in madronalib
My plugin framework has a lot of code based on an internal Symbol table. It's nice to write stuff like parameters["osc1/gain"] instead of parameters[kOsc1GainParamIndex] because then you have to make the list of those index constants somewhere and keep it up to date. A possible downside of indexing everything with Symbols is that you can only catch spelling errors at runtime, not compile time. But for me it's faster to bang out code this way and I love brevity, so it's a good tradeoff.
Version 1 of this Symbol code has an implementation where Symbols were stored by index, in order of their creation. Since this can't be done at compile-time, some work is needed at runtime to turn the string "osc1/gain" into a symbol index every time you want to access something by a symbol. I realized later that many of these lookups could be at compile-time instead (or in C++ terms, constexpr) if I used a hashing function to generate the index. Using a 64-bit index means that hash collisions are astronomically unlikely. Also, the debug version of the software checks for collisions so if that very unlikely thing happened someone could rename a variable.
We still want to index some things by un-hashed Paths, File trees for example. And there's a need to make Symbols on the fly sometimes, as when reading in config files. So after rewriting my classes SymbolTable and Symbol, I had my Path class to rewrite, with a GenericPath class that can be used to implement container classes like Tree, and Path and TextPath subclasses for the different use cases. It was a lot to think about. But in a couple of weeks of working on this part time I had my much-improved replacement code for this fundamental part of my framework.
speedup: 63% -> 60%
Optimization 2, removed mostly unused i/o scale multiplies in Patcher
I had designed some flexibility into the DSP object that implements Sumu's central patch bay, in the form of individual scales for each input and output. It turned out that in the final design, most of these were set to 1.0 most of the time, so it was quicker to just multiply the few inputs and outputs that needed different values, as special cases. Of course multiplying things takes time so we like to avoid it.
speedup: 60% -> 54%
Optimization 3, don't send published signals when there is no view
In order from signals to get from the DSP core to Sumu's GUI, for meters and displays, the concept of publishing is used. Sumu publishes signals and the GUI subscribes to them. I noticed that the code was doing some of the work for this publishing even when there was no window showing. Turning all the published signals code off when there's no window was an obvious time-saver.
speedup: 54% -> 47%
Optimization 4: optimized LadderFilter integration to reduce number of operations
One of the more expensive DSP components in Sumu is its nonlinear resonant Moog-style ladder filter. This is derived from work by D'Angelo and Välimäki. I love the onset of resonance in this filter model, the way it starts to oscillate just a bit in a way that depends on the input signal. It has a real liveness and sounds very full and clear.
The filter is four nonlinear stages with a feedback stage. I stared at my naive implementation, did some simple math and realized that by juggling the variables I could change the scaling of the signals running through the filter by a constant amount, and thereby save a multiply per stage. High school algebra for the win!
speedup: 47% -> 45%
Optimization 5: LadderFilter() 4x object
Because of its four stages, the Moog filter model is a bit slow: each stage depends on the result of the previous stage, and the values must be calculated one after another for each sample of audio. If we consider one of these filters by itself, SIMD (SSE / AVX / NEON) is not immediately helpful, because while SIMD can do four multiples at once, the values can't depend on each other.
Where SIMD does help us out is in running four or more of the filters at once. By changing the filter code to operate on groups of four samples rather than single samples, we can make a 4x filter bank that runs in only a little more time than the single filter. Because the inputs and outputs are arranged [a1, b1, c1, d1], [a2, b2, c2, d2], ... in memory, we can think of these as vertical filter banks, operating on four columns of signal in contrast to the horizontal single filter.
Sumu has two of the filters in each voice, applied to the left and right outputs. So if we have more than two voices of Sumu, we need to calculate four or more filters anyway. When we have eight whole voices and 16 filters, the savings are big, around 10% of our remaning total CPU!
speedup: 45% -> 41%
Optimization 6: - try tanhApprox w/ div approx in LadderFilter
The nonlinearity in each stage of the filter is implemented using a tanh (hyperbolic tangent). This is expensive operation, so we make do with an approximation. Picking a good one is as much art as science, because in the context of the filter different approximations will give different sounds. For the single filter I had already picked an approximation based on a simple ratio of polynomials: y = x(27 + x^2)/(27 + 9x^2). Though a deep understanding why this simple product is such a good match for our transcendental function eludes me, as we've established I'm good at simple math and timing things.
We were already using the above approximation in our single horizontal filter. With SIMD in the mix, though, there's an appealing idea in the form of the SSE function _mm_rcp_ps. Divides are one of the most CPU-intensive operations, and our approximation unavoidably contains one. But _mm_rcp_ps is a reciprocal approximation with the potential to run much faster than the full divide. It uses a lookup table implementation internally to produce a roughly 8-bit accurate result in around half the time of the more accurate division.
Now when you introduce any approximation, there are going to be changes in the output, possibly audible. This did not apply to the 1x -> 4x filter bank changes because the SIMD values are all still full-precision 32-bit floating point. But the divide approximation would change the feedback path of the filter to have less resolution. It might not sound different, but it might. Fortunately, this was a very quick change to try out.
And the somewhat surprising result: it wasn't any faster! This isn't too hard to explain: along with the reciprocal estimate, you still have to do a multiply to get the divide estimate. And, even though we are doing four filters at one, each stage still has data dependencies. I haven't used a timing simulator to do a deep dive into this (godbolt.org is the one I would try) but my guess is that the expensive divisions all fit into time that was spent in each stage waiting for previous values anyway. So the result:
speedup: none
So in around ten weeks we've gone from using 63% of our CPU for eight voices to 41%. This is around a 35% speedup, and IMO the difference between "what are they smoking" and U-He Diva territory. And, it's only the start! More optimizations remain. But now is a good time to make a release and share this work. It's there for you in Sumu 1.2.0. Enjoy the sounds!