The Issue of AI Hallucination in Writing Code for Predictive Models: Observations from the Theta Data Science Team

Written by: Tara Heptinstall, Theta | June 2023

Artificial Intelligence (AI) has made remarkable progress in recent years, becoming increasingly proficient in understanding and generating language. However, AI systems can sometimes generate nonsensical responses when prompted with certain inputs.  This phenomenon is known as “AI hallucination.” 

AI hallucination occurs when the AI generates an output that conveys a distorted understanding of the input. The root cause of AI hallucination is often related to the way in which AI models are trained and the limitations of their training data. 

AI Hallucination at Work

Our data science team here at Theta discovered this phenomenon firsthand recently when challenging Google Bard and ChatGPT with prompts to write code for some of the predictive models used in our customer base analysis work.  Ramnath Vaidyanathan, head of data science and engineering at Theta, asked Google Bard to write a likelihood function for the BG/NBD model.  “I was initially impressed by the output, and that it inserted what we call ‘argument checks’ within the code,” he said.

But then he and others on the team started to scrutinize the code and uncovered places where the AI simply made stuff up.

“When we looked closer, we could see the code was not as good as we initially thought. The likelihood functions were not right,” he said. “There seemed to be some degree of hallucination.”

Others on the team had a similar experience when using Bard to create code for different use cases.  It seemingly made up functions that don’t exist in any package available for the R programming language, but nevertheless looked quite plausible. Then it used these made-up functions in the rest of the model.

“It gave me expected outputs for the fake functions it created,” said Jingjing Fan, a data scientist on the Theta team. “And it provided full citations for papers that don’t exist!”

The team also prompted ChatGPT and found similar issues.  But they weren’t quite ready to write off the use of these tools, determining that both Bard and ChatGPT did create good starting points for designing a function.  I have found the same to be true when using AI to generate content.  It gives your brain a jump start, but ultimately the human doing the prompting has to decide if the context, sequencing, and detail of the output are on target and factual.

Human Mitigation of AI Hallucination

AI hallucination is a concerning phenomenon that arises from the limitations of current AI models. While AI has proven to be valuable for various applications, the possibility of generating erroneous or misleading outputs is a significant challenge in these early iterations of the technology. To mitigate the risk of AI hallucination, researchers must develop more comprehensive training data sets and refine the algorithms used to train AI models. In turn, users must carefully scrutinize AI-generated outputs to identify and correct any hallucinations or inconsistencies (e.g., ask for a minimally reproducible example and confirm that the example generates sensible results). Simply put: Check your work!

The human role in generating code using AI is crucial as it involves creating and fine-tuning the algorithms that power predictive models. The process starts with humans identifying the problem they want to solve and determining the relevant data to train the model (and when applicable, how to effectively query AI tools to get the code generation process started). Human input is necessary to clean and preprocess the data, which is critical to ensuring that the model is being trained on data that is genuinely reflective of the underlying customer behavior dynamics at the firm in question. Additionally, humans must verify the accuracy and interpretability of the model outputs so they align with the intended outcomes.

Ongoing Human Involvement Required

It’s also important to note that the human role in code generation extends beyond the initial development phase. As models continue to learn from new data, humans must monitor and evaluate their performance, making adjustments to maintain their accuracy and reliability over time. Bottom line: human oversight and intervention are critical for ensuring that AI-generated code for use in predictive models is accurate, reliable, and effective in solving real-world problems.

“I think our jobs are safe,” Ramnath concluded.  “For now.”