Since the GenAI boom, there’s been a wave of interest in building AI-based products. Many of these are B2B products which intend to make an old-school industry, like law and accounting, more efficient. For example, they try to process paperwork, verify regulatory compliance, or handle customer service requests automatically.
Over the last few months, many companies have approached me to ask for my advice on the matter, as they were facing technical issues—they couldn’t make their products work quite as intended. After discussing with them, I often found myself identifying the same problems and giving the same advice again and again. So, I’ve now decided it was time to write a blog article summarizing my advice. Here we go!
Break it into pieces
A start-up has recently created an AI-based product to assess regulatory compliance of financial documents. The product analyzes a document and generates a list of edits required to make it compliant to the relevant regulation.
The founders reached out to me when they noticed the product was “inconsistent.” For example, it would suggest five points that needed to be addressed to make the document compliant. Instead of addressing all points at the same time, users often addressed one point at a time and ran the document through the AI tool in between. However, the product raised different points on repeated runs. Sometimes a recommendation disappears on subsequent runs even though it hadn’t been addressed. Customers were not happy about this.
When I dug a big deeper, I realized the start-up had written a single, gigantic prompt that told the LLM everything it needed to look out for. It specified all the things to check up (in fact, at one point the founders had just inserted verbatim legal documents which covered the regulation). This made it very hard to have the LLM do what they wanted, and it was hard to trace the causes of its inconsistency.
I recommended them to break the task into multiple pieces. For example, they should check each individual requirement with a separate prompt and a dedicated LLM run. This way, the tasks would be defined in a much more precise way. In addition, they would be able to calculate the performance of the LLM at each individual task (see below). This would help them identify the weaknesses of the system.
This seems like a lot of work. Why not just insert the regulation documents and let the LLM do all the job? In addition to not working very well, this approach makes you think what the added value of the product is. If that’s all you do, why does you user even need your product in the first place?
Measure performance
You need to measure the performance of AI at each individual task (for example, each item in the regulation checklist described above). This way, you get an objective idea of how well AI does the job you want it to do (it won’t work 100% well—more on this in a sec). In addition, you can identify the weaknesses of your product to improve it.
In order to assess performance, you must create a validation set—a piece of data on which you calculate and compare the performance of different models. For example, I advised the above entrepreneurs to create a database of a few hundred documents and check how often the LLM identified compliance issues correctly. You can use the validation set to compare the performance of difference LLMs, different prompting strategies, a fine-tuned LLM, and so on.
Once you’re happy with the performance of your selected LLM and prompt, you should run an additional performance check using data you’ve never used so far. This is known as the test set. This is because using the validation set alone can result in selection bias, as you actively pick the best-performing option among a few—you might select a solution that works well on the validation data only and not in general. The test data helps you perform a final “sanity check.”
The test set must be only used once. If you’re not happy with the results and go back to the drawing board, you need to create a new test set. This is because your test set is tainted, as you’ve used it to alter your product.
Frame your solution realistically
Current AI is not perfect—it “hallucinates” and often makes silly, unexcepted mistakes. Your AI tool will probably do 80% or 90% of the job (which is why it’s important that you measure performance as described in the previous section).
If you promise that your AI product will do 100% of the job, it’s likely your users will be disappointed. Instead, try to acknowledge AI’s limitations from the get-go and frame your product accordingly. Have a look at these two different ways of framing the above AI compliance tool:
- This tool guarantees your documents will comply with regulation.
- This tool will try to help you identify regulatory requirements that you may have missed.
In the first formulation, you overpromise—your users will stop taking you seriously when they realize your product isn’t “perfect.” In the second formulation, the user knows this is a tool to support human work. They know there may be some false positives and some false negatives, but this isn’t a problem because they know the product is intended to be an additional layer of safety that can help catch problems.
I recently spoke with a founder who framed his AI-based product the second way, and this has been very well received by users. His software, which helps analyze accounting documents, does not promise to “do it for you.” Instead, it helps you do your job more easily. For example, it answers a question about a document, but it also attaches the relevant paragraph from the original document to verify the answer.
The author of this product worked closely with clients to identify bottlenecks in their daily work. He then worked together with them to find the best way of using AI to serve them, without trying to push an all-AI solution at all costs.
Beware of advanced AI models
Some founders are tempted to purchase “advanced” or “dedicated” AI for specific application domains. Instead of using, say, OpenAI models directly, they purchase an AI model created by a third party that was specifically designed “for law,” “for accounting,” and so on.
I have researched some of these models and, in some cases, it seems that they’re actually vanilla, off-the-shelf models like the ones provided by OpenAI. Their creators are simply inserting a specialized prompt with information and instructions about the topic (e.g., accounting). Their contribution is a longer, dedicated prompt—they’re not spending millions to create new models with specialized data.
So, the result of using these advanced “models” may not be too different from what you could achieve yourself by writing a longer prompt. And there’s no guarantee they’ll be better at performing your required task.
I advise you, instead, to first try to use an off-the-shelf LLM. You can include examples of how to the job in the prompt, and then calculate performance using a validation set as explained above. If you want to use a third-party, “advanced” model, ask for a free trial and measure performance on your validation set.
A note on customer service chatbots
Many people have asked me, “Can I automate responses to customer service emails with AI?” or “How easy is it to build a customer service chatbot?” The explained to me that this is because they receive too many “silly” customer service requests. For examples, users send an email asking something already explained in the FAQs. I saw this myself when I worked for a travel agency a few years ago—customers often called to ask about allowed carry-on size, even though this was stated in the confirmation email and on the airline’s website.
But before you try to use a chatbot for that, I advise you to ask yourself what the real problem is. Could it be that you’re not communicating things effectively to your clients? What if the above travel agency sent an email a couple of weeks before the trip with the title “Your baggage allowance”? The “silly” request could perhaps be reduced or eliminated thanks to better communication.
In addition, non-silly requests are unlikely to be solved by a chatbot. I once couldn’t check-in online for a flight because the app crashed every time I tried. When I contacted customer service, a chatbot simply referred me to the online documentation about how to check-in online, which didn’t say what to do if the app crashed.
If you’re getting too many “silly” customer service requests, I advise you to dig deeper to see where the problem is—better communcation instead of AI may be the answer. And if you’re getting too many serious customer service requests, these will probably be related to exceptional circumstances that AI be unlikely to handle properly.
Stop the AI jargon
I’ve noticed that many founders try to promote their products with AI jargon. For example, one company highlights on its website that it uses “AI models from ChatGPT, Gemini, Anthropic and more.” Successful businesses focus on what customers want, and my impression is that customers want to solve problems, not to use specific AI models. In some cases, it’s easier to find AI jargon in the sales pitch than to understand what the product does for customers.
I advise you to focus your messaging on what problem you’re solving for customers, not on which fancy AI model you’re using. I learned a lot about this from a book called The Copywriters Handbook, which I recommend to entrepreneurs.
Conclusion
Remember that doing business is about serving customers. You will need to understand their needs, devise a step-by-step solution, and measure performance. Sending a long prompt to an AI model and hoping for the best will not do.