Microsoft has announced updates to Bing’s search infrastructure incorporating large language models (LLMs), small language models (SLMs), and new optimization techniques.
This update aims to improve performance and reduce costs in search result delivery.
In an announcement, the company states:
“At Bing, we are always pushing the boundaries of search technology. Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities. While transformer models have served us well, the growing complexity of search queries necessitated more powerful models.”
Using LLMs in search systems can create problems with speed and cost.
To solve these problems, Bing has trained SLMs, which it claims are 100 times faster than LLMs.
The announcement reads:
“LLMs can be expensive to serve and slow. To improve efficiency, we trained SLM models (~100x throughput improvement over LLM), which process and understand search queries more precisely.”
Bing also uses NVIDIA TensorRT-LLM to improve how well SLMs work.
TensorRT-LLM is a tool that helps reduce the time and cost of running large models on NVIDIA GPUs.
According to a technical report from Microsoft, integrating Nvidia’s TensorRT-LLM technology has enhanced the company’s “Deep Search” feature.
Deep Search leverages SLMs in real time to provide relevant web results.
Before optimization, Bing’s original transformer model had a 95th percentile latency of 4.76 seconds per batch (20 queries) and a throughput of 4.2 queries per second per instance.
With TensorRT-LLM, the latency was reduced to 3.03 seconds per batch, and throughput increased to 6.6 queries per second per instance.
This represents a 36% reduction in latency and a 57% decrease in operational costs.
The company states:
“… our product is built on the foundation of providing the best results, and we will not compromise on quality for speed. This is where TensorRT-LLM comes into play, reducing model inference time and, consequently, the end-to-end experience latency without sacrificing result quality.”
This update brings several potential benefits to Bing users:
Bing’s switch to LLM/SLM models and TensorRT optimization could impact the future of search.
As users ask more complex questions, search engines need to better understand and deliver relevant results quickly. Bing aims to do that using smaller language models and advanced optimization techniques.
While we’ll have to wait and see the full impact, Bing’s move sets the stage for a new chapter in search.
Featured Image: mindea/Shutterstock