I’ve been fascinated by dictation software for many years. Speaking has struck me as the most natural interface. It also happens to be at least 3 times faster than typing.

When OpenAI released their dictation model Whisper, I was keen to try it.

So I installed it on my MacBook Pro.

Over the last few weeks, I’ve been using it. As expected, it’s sensational. In fact, this entire blog post is dictated using Whisper.

The major advantage of using one of these large models is that it recognizes words in context in ways older systems struggle.

For example, saying the words ’three-hundred million’ might produce 3000000000 or 3 hundred million or three hundred million. I know which one I want to see you when I speak it, but it’s hard for the computer to know.

Words like SaaS might be SAS the software company, SaaS like Software as a Service, or sass, meaning someone who is cheeky or rude. The context matters to disambiguate.

The better a computer can predict which one is the right one, the more natural the interaction becomes, keeping the user in the flow.

But these models require huge amounts of hardware to run. My MacBook Pro has 64GB of RAM & uses one of the most powerful Mac GPUs, the M1 Max. Even using the smallest model, the computer can struggle to manage memory & Whisper crashes frequently.

I wondered how slower Mac hardware is compared to Nvidia. While benchmarks are often fraught with nuances, the consensus amongst testers is Nvidia is about 3x faster to run these models. Apple optimize their chips for power consumption while Nvidia opts for raw performance.

In addition, many of the core machine learning libraries have not yet rewritten natively for Apple.

Setting the hardware concerns aside, LLMs will transform dictation software.

The big question is how to deploy them. These models require significant horsepower to run which implies a few options :

  1. Models will be compressed at the expense of quality
  2. Phones & computers will need to become significantly faster to run them locally
  3. Models will be run predominantly in the cloud where memory is abundant, at the expense of network latency
  4. Software will evolve to have a hybrid architecture where some audio is processed on the computer & some in the cloud.

My bet is on the fourth. This hybrid architecture will allow some usage when not connected to the Internet, but take advantage of a cloud when available.

In addition, it reduces the serving cost for the provider. The user’s computer works, which is free to the SaaS dictation software company, improving margins. With inference costs likely the vast majority of cloud expenses as a vendor, this hybrid architecture will be a key competitive advantage.
——————–
Originally published by Tomasz Tunguz, Venture Capitalist. Get more of his posts at tomtunguz.com