Deep Learning (DL) keeps growing and pushing the boundaries of where AI is going and compute is expanding to keep up with the complexities of these models. With expanded compute comes expanded deployment in production. It becomes a very complicated process.
A million seconds is 12 days versus the 31 years it takes for a billion seconds to go by. The growth from a million to a billion is so large it completely changes the scope of anything it relates to. Now consider the hardware demands of a machine learning model with millions of parameters ver billions.
“Building Chips is actually pretty hard,” Chetan KapoorAmazon Web Services’ director of product management for Amazon EC2, told The New Stack.
It is a challenge the cloud giant has invested in.
Amazon Web Services ‘s recently-launched AWS Inf2 instances can handle ML models with up to 175 billion parameters in production at scale with 4x higher throughput and 10x lower latency than their previous offering.
AWS debuted Inf2 in preview at AWS re:Invent in December. The silicon is part of a larger group of EC2 instances, Inferentia accelerators, built specifically for supporting the needs of deep learning models. Inf2 offers higher throughput and lower latency than its predecessor.
Although the model can handle hundreds of billions of parameters, multiple machines can work concurrently to serve larger models.
Billions of Parameters
The problem with the hardware that was previously in existence was no longer meeting the new demands
“These deep learning models were exploding in size, going from being a few million parameters to billions of parameters. You need a lot of computing to actually train these models. Many of our customers were saying that it’s getting to a point where it was too expensive.” says Kapoor. And the best way for AWS to solve that problem was to build super-specialized silicon for deep learning models. That was the beginning of their journey roughly five years ago.
AWS’s Inferentia accelerators were built to bridge that gap. For more insight, “it’s specifically designed for running these crazy large deep learning models because the alternative was for many of these customers to use training platforms that host these models and that is just too expensive for them to run at scale,” Kapoor said.
Inf2 leverages a brand-new silicon chip called Inferentia2. It’s the second generation of this chip and can hold up to 175 billion parameters. Compared to the first generation chip Inf1, it offers 4x the throughput and 10x lower latency. It’s designed for running DL models in production.
Inf2 was designed for a massive scale but couldn’t be a massive size. It caused AWS engineers to think on the platform level.
Of the process, Kapoor details, “[We] had to take a step back and say, well, we can’t really go down the path of building a massive chip to hold the model because it’s not going to scale well for our customers or enable us to deliver the cost improvements that our customers are looking for.”
Inf2 is actually a system of 12 chips in one server interconnected with a dedicated fabric working concurrently.
Because Inf2 is a 12-chip system, the process required some reverse engineering and future forecasting. One question the team had to assess was, “how much power we can pack in on a per chip basis?” and the other was, “projecting what [customers] needs are going to look like going on,” Kapoor explained.
The Customer’s Own Machine
The hardware is impactful, even if quietly so. The use case Kapoor expanded on was recommendation engines. Ever-present in daily life but often an afterthought if even a thought at all. He expanded on two, Pinterest and Grammarly as their impact is different. The Pinterest example was a user who sees a lamp, clicks it, and is taken to the e-commerce store for purchase. Grammarly is a real-time grammar editor. Both operate at large or massive scale and in real time over the AWS cloud. For the use of huge ML models and for both to work, compute supports that.
Inf2 instances are on a per-customer basis. If someone is using that machine, it’s their machine. It can hold one model or multiple models as there are many instances where only one model can fit or, on the other hand, where an AI system requires multiple smaller models to run. Consider a voice assistant, for a successful voice assistant to provide an engaging interactive experience, multiple models need to be trained (speech recognition, inferential, natural language understanding, etc). But AWS won’t add other customers on a machine if there is additional available compute capacity.
Of the future, Kapoor said, “there is a definite understanding that these models will continue to grow,” but he is confident that Inf2 is fully capable of handling all the immediate needs of the future. “If [a model] doesn’t fit inside a single machine… they have the ability to expand outside of that machinery. Multiple machines can run these models concurrently,” he explained before also acknowledging his awareness that Inf2 is a second-generation chip.