Large language models (LLM) like ChatGPT and GPT-4 are helpful. With a few API calls, you can get them to perform extraordinary things. Every API call has a marginal cost, and you may create proofs of concept with working examples.
However, when used for real-world apps that conduct hundreds of API requests every day, the charges can soon add up. You might spend thousands of dollars monthly to complete things, typically costing a fraction of that amount.
According to a recent study conducted by Stanford University researchers, employing GPT-4, ChatGPT, and other LLM APIs can significantly cut expenses. A study named "FrugalGPT" presents many approaches for reducing the cost of LLM APIs by up to 98% while maintaining or even increasing their performance. Here is more on how you can cut the ChatGPT cost.
GPT-4 is often regarded as the most competent big language model. However, it is also the most costly. And the charges rise as your prompt lengthens. In many circumstances, another language model, API provider, or even prompt can be used to lower the costs of inference. For example, OpenAI offers a diverse set of models with prices ranging from US$0.0005 to US$0.15 per 1,000 tokens, a 300x difference. You may also look into other suppliers for expenses, such as AI21 Labs, Cohere, and Textsynth.
Fortunately, most API providers offer comparable interfaces. With some work, you can construct a layer of abstraction that can be smoothly applied to other APIs. Python packages like LangChain have already done most of the heavy lifting for you. However, you must pick between quality and cost only if you have a systematic process for selecting the most efficient LLM for each work.
Stanford University researchers present a solution that maintains LLM API charges within a financial restriction. They offer three techniques: rapid adaptation, LLM cascade, and LLM approximation. While these procedures have not yet been tested in a production context, preliminary findings are encouraging.
All LLM APIs have a cost plan based on the prompt's duration. As a result, the simplest solution to cut API usage expenses is to abbreviate your prompts. There are various options.
LLMs require few-shot prompting for numerous activities. It would help if you prefaced your prompt with a few examples to boost the model's performance, often in the prompt->answer style. Frameworks such as LangChain provide tools for creating templates containing a few-shot example.
As LLMs offer longer and longer contexts, developers may design giant few-shot templates to increase the model's accuracy. However, the model may require fewer instances.
The researchers suggest "prompt selection," which involves reducing the few-shot samples to a bare minimum while maintaining output quality. Even removing 100 tokens from the template can result in significant savings when used repeatedly.
Another method they recommend is "query concatenation," in which you combine numerous prompts into one and have the model create several results in a single call. Again, this works very well with few-shot prompting. You must include the few-shot samples with each prompt if you email your questions one at a time. However, if you concatenate your prompts, you only need to provide the context once and obtain many replies in the output.
The researchers used FrugalGPT, which leverages 12 different APIs from OpenAI, Textsynth, Cohere, AI21 Labs, and ForeFrontAI, to perform the LLM cascade technique.
It suggests fascinating avenues to pursue in LLM applications. While this study focuses on costs, similar methodologies may be used to address other issues, such as risk criticality, latency, and privacy.
Another cost-cutting measure is to limit the amount of API calls made to the LLM. The researchers advise that expensive LLMs be approximated "using more affordable models or infrastructure."
One way for approximating LLMs is to use a "completion cache," which stores the LLM's prompts and replies on an intermediary server. If a user provides a question that is identical or similar to one that has already been cached, you obtain the cached response rather than requesting the model again. While constructing a completion cache is simple, there are some significant disadvantages. For starters, it inhibits the LLM's originality and variability. Second, its applicability will be determined by how similar the requests of different users are. Third, the cache may be significant if the stored cues and replies differ. Finally, keeping replies will only be efficient if the LLM's output is context-dependent.
A more complex option would be to build a system that chooses the correct API for each question. The system may be optimized to select the least expensive LLM capable of answering the user's query rather than sending everything to GPT-4. This can result in both cost savings and improved performance.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.