This is a short blog post to share some thoughts on how to reduce AI token consumption and improve user response times.
I was at the AI Tinkerers event in Boston and I saw a presentation on using AI report generation for quant education. The author was using a generic LLM to create multiple choice questions on different themes. Similarly, I've been building an LLM system that produces a report based on data pulled from the internet. In both cases, there are a finite number of topics to generate reports on. My case was much larger, but even so, it was still finite.
The obvious thought is, if you're only generating a few reports or questions & answers, why not generate them in batch? There's no need to keep the user waiting and of course, you can schedule your LLM API calls in the middle of the night when there's less competition for resources.
In my case, there are potentially thousands of reports, but some reports will be pulled more often than others. A better strategy in my case is something like this:
- Take a guess at the most popular reports (or use existing popularity data) and generate those reports overnight (or at a time when competition for resources is low). Cache them.
- If the user wants a report that's been cached, return the cached copy.
- If the user wants an uncached report:
- Tell the user there will be wait
- Call the LLM API and generate the report
- Display the report
- Cache the report
- For each cached report, record the LLM and it's creation timestamp.
You can start to do some clever things here like refresh the reports every 30 days or when the LLM is upgraded etc.
I know this isn't rocket science, but I've been surprised how few LLM demos I've seen use any form of batch processing and caching.

No comments:
Post a Comment