{"id":969,"date":"2024-07-07T16:46:16","date_gmt":"2024-07-07T14:46:16","guid":{"rendered":"https:\/\/vuca-ai.com\/?p=969"},"modified":"2024-08-21T17:26:03","modified_gmt":"2024-08-21T15:26:03","slug":"les-hallucinations-de-lia-un-ajout-aux-malheurs-du-monde-2-2","status":"publish","type":"post","link":"https:\/\/vuca-ai.com\/fr\/2024\/07\/07\/les-hallucinations-de-lia-un-ajout-aux-malheurs-du-monde-2-2\/","title":{"rendered":"AI Hallucinations, adding to the woes of the world (2\/2)"},"content":{"rendered":"<p>A virtual avatar with a smooth face, SARAH (Smart AI Resource Assistant for Health) gives millions of people around the world health advice in eight languages, day and night, on the best way to eat, stop smoking, de-stress&#8230; .<\/p>\n<p>But, as with all chatbots, SARAH sometimes makes mistakes in its recommendations. For example, it suggested a list of fake names and addresses for clinics that didn&#8217;t exist in San Francisco. The WHO (World Health Organisation) warns on its website that SARAH can make mistakes.<\/p>\n<p>Here we go again!<\/p>\n<p>Chatbot errors have even become the butt of jokes. Look at Matea&#8217;s short-lived science chatbot, Galactica, which invented academic papers and created stories about bears in space. You might say that the former head of Harvard invented her thesis. But then, we probably tend to think that robots have integrity and precision. But that&#8217;s no reason to invent stories about bears in space&#8230; Come on, another juicy (and true) anecdote. Last February, Air Canada was ordered to honour a refund policy invented by their chatbot.<\/p>\n<p>In <a href=\"https:\/\/vuca-strategy.com\/2024\/05\/21\/les-hallucinations-de-lia-un-ajout-aux-malheurs-du-monde-1\/\">the first part of<\/a> this series devoted to AI Hallucinations, I argued that these hallucinations are not errors but are consubstantial with the most recent AIs, from which chatbots, among other things, have emerged.<\/p>\n<p>Yet the answer to the hallucination question seems simple. Generative AI is a machine learning system and this system is known to produce errors, so a hallucination from such a system is an error. The recent past has shown us that learning models make fewer errors as they process more data and as we build larger models. So we can expect chatbots and other generative AI models to become more accurate over time.<\/p>\n<h5>All the results of generative AI are hallucinations<\/h5>\n<p>I think that hallucinations are more than just \u2018simple\u2019 errors. All the results of generative AI are hallucinations. Whatever your definition of a hallucination and whatever your opinion of its nature, you will agree that there are some results of generative AI that are correct and useful, but also that there are some that are inaccurate and useless, and that it is reasonable to want to quantify the proportions of these types of results. That it is even essential to do so in order to assess the level of confidence that can be placed in these results. But this assessment appears to be extremely difficult, as more and more people are beginning to realise, with some going so far as to publish an article in scientific journals stating that <a href=\"https:\/\/link.springer.com\/article\/10.1007\/s10676-024-09775-5\">ChatGPT is crap<\/a>.<\/p>\n<p>But before we get into the reasons why these evaluations are so difficult, let&#8217;s take a closer look at the reality of machine learning.<\/p>\n<h5>Accelerated machine learning<\/h5>\n<p>Before the emergence of all these generative approaches and techniques, most AIs were dedicated to formulating very precise hypotheses for restricted fields of application. Will this website visitor click on this link? What type of object is shown in this image? How much will this share be worth tomorrow? Each of these questions is answered specifically by a computer programme whose sole task is to answer the question for which it was designed.<\/p>\n<p>Historically, such programmes were built on fundamental principles. To predict how long it would take for an apple to fall from its branch, Newton thought in a very applied way about the nature of the Universe and developed a theory that produces an equation that answers this question. While Newton solved the problem of the apple falling from the tree, most of the practical problems we encounter cannot be solved through fundamental principles; for example, guessing which objects are represented in a picture.<\/p>\n<p>This is where<em>Machine<\/em> Learning (ML) comes in. The basic idea is that by looking at enough examples of the process you are trying to predict you will find a model that will help you make accurate predictions without necessarily needing to understand the process that has been generated by those examples. By observing a million apples falling from a million trees of different heights you can dispense with Newton&#8217;s Principia and go straight to the equation. Or you could go straight to the equation because it&#8217;s very unlikely that the equation you&#8217;ll find will match Newton&#8217;s equation. After all, it doesn&#8217;t matter and you don&#8217;t even need it to match exactly. You&#8217;re not trying to understand gravity, you&#8217;re trying to make predictions about apples. And you can easily do without fundamental principles for recognising objects in images, and that&#8217;s very practical.<\/p>\n<p>The basic process for building a system for recognising objects in images is called supervised learning, and if we don&#8217;t go into too much detail, it&#8217;s very simple. To build a system capable, for example, of guessing which handwritten digit is represented in an image you need to <a href=\"https:\/\/en.wikipedia.org\/wiki\/MNIST_database\">collect a vast dataset of digit images<\/a> and manually label each image with the digit it represents. This is called the training data. Then you show all the images in the training data to a computer and ask it which digit is in each image, then give it a score based on how many times it got it right. You repeat this operation a few hundred thousand times and each time the computer tries different guessing strategies, looking for the one that gives it the highest score. This search for the best-performing riddle-solving strategy can be very time-consuming and computationally expensive, but <a href=\"https:\/\/en.wikipedia.org\/wiki\/Backpropagation#Modern_backpropagation\">recent innovations in mathematics<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/General-purpose_computing_on_graphics_processing_units\">computational efficiency<\/a> have made this strategy highly effective for a large number of tasks.<\/p>\n<p>To introduce a little terminology, this search for the best estimation strategy is called \u2018<em>training<\/em>\u2019 and the resulting system is often called a \u2018model\u2019. A model that guesses from a set of discrete labels is a \u2018<em>classifier<\/em>\u2019, and machine learning (ML) practitioners prefer to call guesses \u2018<em>predictions<\/em>\u2019, and understandably so &#8211; it sounds much more serious.<\/p>\n<h5>The difference between ML and Newton<\/h5>\n<p>It&#8217;s worth taking a moment to consider the differences between the machine learning (ML) approach and that of Newton. Newton may have been inspired by a few apples that fell from a tree, but his project was much broader: to develop a theory that codified the general principles of the motion of celestial bodies. From this theory emerges a theory that tells us, among other things, how long it takes for an apple to fall to the ground. For an automatic learner, the general principles governing the relationships between celestial bodies are of no importance. Its sole objective is to accurately reproduce a data set comprising a million apples. Each of these approaches has its advantages and disadvantages. The machine learning (ML) approach will undoubtedly produce an impenetrable equation that tells us little about the nature of gravitation but, on the other hand, it may be better able to incorporate real-world complexities such as air resistance, which complicates Newton&#8217;s approach.<\/p>\n<p>The only reason I use Newton&#8217;s approach as an example is to emphasise that machine learning is not the only way to build an AI system. There are many ways of programming a computer and none of them is <em>ex ante<\/em> necessarily better than another for a particular application. But over the last fifteen years or so we have begun to realise that<em>supervised<\/em> learning can be effective for tasks that are far more complex than we ever imagined. By complexity I mean the variety of possible<em>inputs<\/em> and outputs<em>of<\/em> a model. An introductory machine learning tutorial might show you how to build a system that takes a 256 x 256 pixel image of a handwritten number and produces one of ten possible labels &#8211; the numbers 0 to 9. You can build such a system with fairly high accuracy with just a few tens of thousands of images. But if instead of a few tens of thousands of labelled images you can use millions or billions you can greatly expand the universe of possible inputs and outputs. Image diffusion models such as Stable Diffusion for example are trained on all sorts of images and instead of producing a few discrete labels they produce&#8230; an entire image. In other words, instead of producing a correspondence between 256 x 256 = 65,536 possible inputs and 10 possible outputs, they produce a correspondence between an unfathomable set of possible inputs and an unfathomable set of possible outputs. The fact that we can do something so complex using machine learning is not obvious, and it is undoubtedly one of the major scientific discoveries of the last 15 years.<\/p>\n<p>The problem &#8211; and there is a problem &#8211; is that building these kinds of more complex models requires an extremely large amount of data, and obtaining sufficiently large data sets quickly becomes financially prohibitive. The most promising models for these very complex tasks require billions of labelled examples, and it is simply impossible to manually examine a billion images and qualify all the objects they represent.<\/p>\n<h5>When self-supervised learning comes to the rescue<\/h5>\n<p>The most promising models for these very complex tasks require billions of labelled examples and it is simply impossible to manually examine a billion images and label the objects they represent. If you could somehow generate the labels without having to manually examine all the examples, that would be great. That&#8217;s the big idea behind<em>self-supervised<\/em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Self-supervised_learning\">learning<\/a>, the machine learning paradigm that underpins modern generative AI systems. If you can get your hands on billions of sentences &#8211; for example, by retrieving all the text on the Internet &#8211; you can build the training dataset programmatically by chunking the sentences. Simply transform \u2018The quick brown fox jumps over the dog\u2019 into the training example \u2018The quick brown fox jumps over the ___\u2019, and label it \u2018dog\u2019. In fact, there are lots of training examples you can build from this one sentence by cutting it in different places: \u2018The quick\u2019 and \u2018brown\u2019, \u2018The quick brown\u2019 and \u2018fox\u2019, and so on. From a single sentence, we get nine training examples with no human labelling required. Multiply that by the number of sentences you can find on the internet, and you&#8217;re getting close to the size needed to train these kinds of complex patterns. An important observation that I&#8217;ll come back to is that, leaving aside the big differences in size and complexity, the LLM learning process and the learning process of a traditional classifier are not that different. An LLM handles many more possible inputs and outputs, but it is trained in the same way, to do the same thing: guess the right label for the given input. Here&#8217;s the example in English, where the range of possibilities is much wider than in French.<\/p>\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-28221\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=1024%2C822&amp;ssl=1\" sizes=\"(max-width: 1024px) 100vw, 1024px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=1024%2C822&amp;ssl=1 1024w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=300%2C241&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=768%2C616&amp;ssl=1 768w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=1536%2C1233&amp;ssl=1 1536w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=1200%2C963&amp;ssl=1 1200w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=960%2C771&amp;ssl=1 960w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=120%2C96&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?resize=81%2C65&amp;ssl=1 81w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?w=1622&amp;ssl=1 1622w\" alt=\"\" width=\"1024\" height=\"822\" data-attachment-id=\"28221\" data-permalink=\"https:\/\/vuca-strategy.com\/ia-schema-fonctionnement-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?fit=1622%2C1302&amp;ssl=1\" data-orig-size=\"1622,1302\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"IA Sch\u00e9ma fonctionnement 1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?fit=300%2C241&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Schema-fonctionnement-1.webp?fit=1024%2C822&amp;ssl=1\" data-recalc-dims=\"1\" \/>The two models are built by showing them a bunch of incomplete examples, making them guess the complements and recording their guesses. The big innovations associated with training modern generative AI systems are finding clever ways to automatically build massive training datasets as well as inventing new kinds of black boxes suited to performing complex tasks, but the way they are trained is essentially the same as it was decades ago.<\/p>\n<p>The story could end there. Sometimes a number recognition tool mistakes a 7 for a 9, and sometimes a language model says that the fast brown fox jumps over the lazy brown typhoon. This is an inherent aspect of machine learning, resulting from the fact that machine learning models make predictions based on probabilistic patterns rather than provable deductive inferences, and it&#8217;s something that tends to improve over time with more data and bigger statistical models.<\/p>\n<p>But I don&#8217;t think that&#8217;s the case.<\/p>\n<h5>The difference between a hallucination and an error<\/h5>\n<p>Sometimes you show the model the image of a 7 and it tells you it&#8217;s the image of a 9, it&#8217;s always happened. When this happens, and it inevitably does, why not say that the number recognition system is \u2018hallucinating\u2019? Why is inaccurate information only a hallucination when it comes from a chatbot?<\/p>\n<p>As I mentioned earlier, an LLM and a classical classifier are conceptually very similar in the way they are constructed. The LLM is still a classifier, even if it is very complex. In the same way that the number recognition tool is trained to fill in the missing label on a pre-existing image, the LLM is trained to fill in the missing word at the end of a pre-existing sentence. The main difference lies in complexity and scale. Although they are similar in the way they are built, there is a huge difference in the way generative AI systems are <strong>deployed<\/strong>.<\/p>\n<p>Typically, we set up a classifier to perform the same task it was trained to do. When we deploy the number recognition system, we&#8217;re going to make it work on number recognition. We&#8217;ll probably have set up a process to collect handwritten digits and use the model to read those digits to perform an operation such as processing the deposit of a cheque.<\/p>\n<p>Generative AI systems are different. When we deploy an LLM as a chatbot, we <strong>pivot <\/strong>from using it to guess the next word in a pre-existing sentence to using it to \u2018guess\u2019 the next word in a brand new string of words <strong>that doesn&#8217;t yet exist<\/strong>. This is an enormous change, the scope of which is generally underestimated. It means that, unlike a traditional classifier, there is simply no way to assess the accuracy of LLM output in the traditional way, because there are no correct labels to compare it to. This point is perhaps somewhat subtle and a more granular, use-case approach is needed to highlight it.<\/p>\n<p>When you enter an image of the number 7 into the number cruncher, you expect it to produce a single correct and unambiguous label, which will be \u20187\u2019. If it produces the labels \u20181\u2019 or \u20189\u2019, this is an unambiguous error that affects the accuracy of your model. These errors are identical to those made during training, so it makes sense to talk about the error rate on the new data (the \u2018generalisation error\u2019 or \u2018out-of-sample error\u2019) in the same way as we talk about the error rate on the training data.<\/p>\n<h5>What is 2+2?<\/h5>\n<p>When you give ChatGPT the string \u2018What is 2 + 2?\u2019, there is no single unambiguous correct next word. You&#8217; <strong>d like <\/strong>the next word to be something like \u201c4\u201d. But \u20182\u2019 could also be correct, as in \u20182 + 2 = 4\u2019. \u2018The\u2019 could also be the next correct word, as in \u201cThe sum of 2 and 2 is 4\u201d. Of course, any of these words could also be the first word in a wrong answer, such as \u20184.5\u2019 or \u20182 + 2 = 5\u2019. The task for which the model is built is to complete the word that has been censored in an existing series &#8211; a task that involves an unambiguous correct answer &#8211; but <strong>this time <\/strong>the situation is completely different. There are better next words and worse next words, but there is no correct next word in the same sense as in training because there is no example to reconstruct. An error in the classic sense of the term for a language model would be an inability to reproduce the missing word that was censored in the training example, but in production, these models are simply not used for this. It&#8217;s a bit like if we started introducing images of animals into the number recognition system. If it gives a lion a 6, has it made a mistake? No, probably not. You&#8217;re using it for a different task to the one it was trained for; there is no correct answer, so errors are not defined.<\/p>\n<h5>An elephant story<\/h5>\n<p>In practice, we tend not to really worry about these individual word predictions. The LLM, the engine that runs ChatGPT, does nothing more than guess words one by one, but the ChatGPT system incorporates a component that feeds these predictions back to the LLM to generate an entire sequence of words that make up a complete textual response. It is the semantic content that emerges in this complete textual response that we are generally interested in, rather than any particular word. This is at least part of the reason why it is a \u2018mistake\u2019 when the handwritten number classifier calls a 7 a 9, but a \u2018hallucination\u2019 when GPT-4 says that an elephant named Jumbo swam across the English Channel in 1875. A particularly large and popular elephant. Sample response from ChatGTPT4o in May 2024:<\/figure>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" class=\"wp-image-28222\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?resize=883%2C320&amp;ssl=1\" sizes=\"(max-width: 883px) 100vw, 883px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?w=883&amp;ssl=1 883w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?resize=300%2C109&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?resize=768%2C278&amp;ssl=1 768w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?resize=120%2C43&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?resize=81%2C29&amp;ssl=1 81w\" alt=\"\" width=\"883\" height=\"320\" data-attachment-id=\"28222\" data-permalink=\"https:\/\/vuca-strategy.com\/ia-hallucination-gpt4o-elephant\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?fit=883%2C320&amp;ssl=1\" data-orig-size=\"883,320\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"IA Hallucination GPT4o elephant\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?fit=300%2C109&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant.png?fit=883%2C320&amp;ssl=1\" data-recalc-dims=\"1\" \/>It&#8217;s obviously wrong that an elephant called Jumbo swam across the English Channel in 1875, but the way ChatGPT gets it wrong here is very different from the way an image classifier gets it wrong when it calls a 7 a 9. ChatGPT has made 40 separate predictions here, and it&#8217;s not obvious how to categorise each one as right or wrong. Each predicted word makes sense in relation to the words that precede it, and this looks very much like a sequence of words that might be found in training data.<\/figure>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" class=\"wp-image-28223\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?resize=691%2C193&amp;ssl=1\" sizes=\"(max-width: 691px) 100vw, 691px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?w=691&amp;ssl=1 691w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?resize=300%2C84&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?resize=120%2C34&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?resize=81%2C23&amp;ssl=1 81w\" alt=\"\" width=\"691\" height=\"193\" data-attachment-id=\"28223\" data-permalink=\"https:\/\/vuca-strategy.com\/ia-hallucination-gpt4o-elephant-tokens\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?fit=691%2C193&amp;ssl=1\" data-orig-size=\"691,193\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"IA Hallucination GPT4o elephant tokens\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?fit=300%2C84&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucination-GPT4o-elephant-tokens.png?fit=691%2C193&amp;ssl=1\" data-recalc-dims=\"1\" \/>(This analysis was generated by this <a href=\"https:\/\/platform.openai.com\/tokenizer\">tool<\/a>)<\/p>\n<p>Some, if not most, of the words predicted here are probably closer to correct than to wrong. Of course, there is no universal way of objectively defining what is \u2018true\u2019 and what is \u2018false\u2019, since there is no pre-existing text to compare it to. Of all the predictions made by the model, it is not clear which, if any, should be classified as errors &#8211; even if, on the whole, it is clear that this result is not what we want.<\/p>\n<p>But why isn&#8217;t it what we want? What exactly is wrong with it? Obviously, the main problem is that it seems to describe an event that didn&#8217;t actually happen. But when I really think about it, I find it a little confusing. What if an elephant named Jumbo had actually swum across the English Channel in 1875, exactly as described in this text? In that case, this identical pair of entries and exits would not be hallucinatory. This seems to imply that there is nothing inherent in the text of the entry-exit pair that makes it hallucinatory; whether it is hallucinatory or not depends entirely on facts about the world, facts that exist completely independently of the text produced by the model. But if there is nothing inherent in the text that makes it hallucinatory, is the hallucinatory character even a property of the text? Not quite, it seems. It&#8217;s a property of the way the text relates to objects and events in the real world.<\/p>\n<p>To complicate matters further, establishing a correspondence between the text and the facts of the world is a trickier and more subjective business than one might expect. I read the passage about Jumbo as making several assertions, many of which are true &#8211; \u2018Jumbo\u2019 was indeed a very famous elephant in his day, except that he was a cartoon character and&#8230; he flew!<\/p>\n<p>I&#8217;m sure most readers would agree that the main claim of the text is that an elephant named Jumbo swam across the English Channel, which is not true, and so the passage may be \u2018hallucinatory\u2019, but can you find an objective criterion by which to make this kind of assessment for all possible texts? It seems difficult to me. Would the following text be hallucinatory or not? (It&#8217;s very important to always bear in mind that, since these systems generate text randomly, the same prompt may give rise to different results, some of which may be considered hallucinatory and others not). How about this second answer to the same question?<\/figure>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-28308\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=1024%2C367&amp;ssl=1\" sizes=\"(max-width: 1024px) 100vw, 1024px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=1024%2C367&amp;ssl=1 1024w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=300%2C108&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=768%2C275&amp;ssl=1 768w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=1200%2C430&amp;ssl=1 1200w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=120%2C43&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?resize=81%2C29&amp;ssl=1 81w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?w=1266&amp;ssl=1 1266w\" alt=\"\" width=\"1024\" height=\"367\" data-attachment-id=\"28308\" data-permalink=\"https:\/\/vuca-strategy.com\/hallucination-4-vuca\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?fit=1266%2C454&amp;ssl=1\" data-orig-size=\"1266,454\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Hallucination 4 VUCA\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?fit=300%2C108&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucination-4-VUCA.png?fit=1024%2C367&amp;ssl=1\" data-recalc-dims=\"1\" \/>I&#8217;m not saying that criteria couldn&#8217;t be found to classify these responses unambiguously, but it&#8217;s not as simple as one might hope.<\/p>\n<p>Let me recap once again the basics of how ChatGPT works. Firstly, you train a classifier in the usual way to fill in the missing word in a block of text. This gives you a model capable of producing one word at a time: the predicted missing word, given the previous text. Given an initial text, say \u20182 + 2\u2019, this model acts as if it were the beginning of an existing document whose last word has been censored, and it produces a guess as to the censored word. It might guess \u2018equal\u2019. Now, to turn this into a system that produces more than one word at a time, you paste it at the end of the prompt and feed it back into the template. The model is invoked once more, freshly, regardless of the previous activity, and asked to guess the word that was censored at the end of \u20182 + 2 equals\u2019. This operation is repeated ad infinitum until the model predicts that there is no next word.<\/p>\n<p>Generative image models work in a similar way. They are trained to reconstruct an image from a distorted version of the image and a plain text description of the image. To generate new images, you enter the plain text description of what you want to produce and, at the point where the model expects to see the distorted image, you enter random noise. In both cases, the model \u2018thinks\u2019 it&#8217;s reconstructing an existing artefact, but in fact it&#8217;s generating a new one. Given this description, I think it&#8217;s logical to ask the following question: are all the productions of generative AI \u2018hallucinations\u2019? If the way to get them to produce results is to tell them that these results already exist and put them to work reconstructing them, it seems to me that we are asking them to hallucinate.<\/p>\n<p>Prominent AI researchers have recently publicly rallied around the idea that <strong>all <\/strong>LLM results are hallucinations &#8211; and that this is a good thing. Andrej Karpathy, co-founder of OpenAI and former head of AI at Tesla, <a href=\"https:\/\/x.com\/karpathy\/status\/1733299213503787018\">recently tweeted<\/a> that LLMs are \u2018dream machines\u2019, that \u2018hallucination is not a bug, it is the greatest feature of LLMs\u2019.<\/figure>\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-28309\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?resize=886%2C597&amp;ssl=1\" sizes=\"(max-width: 886px) 100vw, 886px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?w=886&amp;ssl=1 886w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?resize=300%2C202&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?resize=768%2C517&amp;ssl=1 768w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?resize=120%2C81&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?resize=81%2C55&amp;ssl=1 81w\" alt=\"\" width=\"886\" height=\"597\" data-attachment-id=\"28309\" data-permalink=\"https:\/\/vuca-strategy.com\/hallucinations-5-vuca\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?fit=886%2C597&amp;ssl=1\" data-orig-size=\"886,597\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Hallucinations 5 VUCA\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?fit=300%2C202&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Hallucinations-5-VUCA.png?fit=886%2C597&amp;ssl=1\" data-recalc-dims=\"1\" \/>In fact, this is not a new point of view. In 2015, Google released a system called <a href=\"https:\/\/research.google\/blog\/inceptionism-going-deeper-into-neural-networks\/\">DeepDream<\/a>, which was a direct precursor to today&#8217;s generative AI systems, and almost certainly what Karpathy was referring to when he called LLMs \u2018dream machines\u2019.<\/figure>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-28225\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=1024%2C481&amp;ssl=1\" sizes=\"(max-width: 1024px) 100vw, 1024px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=1024%2C481&amp;ssl=1 1024w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=300%2C141&amp;ssl=1 300w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=768%2C361&amp;ssl=1 768w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=1536%2C722&amp;ssl=1 1536w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=1200%2C564&amp;ssl=1 1200w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=120%2C56&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?resize=81%2C38&amp;ssl=1 81w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?w=1600&amp;ssl=1 1600w\" alt=\"\" width=\"1024\" height=\"481\" data-attachment-id=\"28225\" data-permalink=\"https:\/\/vuca-strategy.com\/ia-hallucinations-dreams\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?fit=1600%2C752&amp;ssl=1\" data-orig-size=\"1600,752\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"IA Hallucinations Dreams\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?fit=300%2C141&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/05\/IA-Hallucinations-Dreams.png?fit=1024%2C481&amp;ssl=1\" data-recalc-dims=\"1\" \/>This screenshot from the DeepDream website is from a collection called \u2018dreams\u2019, images generated from random noise.<\/p>\n<p>This system was born out of the realisation that it was possible to reconfigure the technology used to classify images in order to generate images that didn&#8217;t exist before. Since the images generated are not really \u2018of\u2019 anything that exists in the real world, but rather something like statistical echoes of images from the training data, they decided to call them \u2018dreams\u2019. The creators of DeepDream did not claim that the model produced images that were \u2018occasionally hallucinations\u2019. It was understood from the outset that every piece of information generated by these models was a \u2018dream\u2019.<\/p>\n<p>At the time, this was more of a curiosity than anything useful in itself &#8211; or, at best, a means of better understanding the inner workings of the classifier. At the time, it doesn&#8217;t seem that many people thought that dreams of this type could be useful in themselves, but we&#8217;ve since learned that if you train a complex enough model with enough data, dreams can become very vivid and frequently correspond to real-world facts. But to the extent that this happens, it&#8217;s essentially a happy coincidence. From the point of view of the model, there is no distinction between a hallucinatory text and a non-hallucinatory text. All its results are imagined reconstructions of allegedly censored documents.<\/p>\n<p>This may sound rather philosophical and abstract, and to some extent it is, but I think it also has very concrete implications for how we can expect this technology to evolve. If a hallucination is analogous to a typical error in any other machine learning model, then we have good reason to believe that the prevalence of hallucinations can be aggressively driven towards zero. There are now very powerful machine learning models for recognising handwritten numbers. The basic steps are simple: train the model on a larger number of data sets, and enlarge the model. But if the hallucinations are qualitatively different from the classic type of error, as I really believe they are, then the story may be different. In that case, it&#8217;s not so clear that increasing the amount of data or model size will reduce the number of hallucinations. Perhaps the solution is not more data or bigger models, but something else: a completely new and different way of training the model perhaps, or a new way of generating predictions. In fact, the current state-of-the-art approach to treating hallucinations doesn&#8217;t really involve collecting a significantly larger dataset or increasing the size of the model; rather, Reinforcement<a href=\"https:\/\/newsletter.nocode.ai\/p\/rlhf-reinforcement-learning-human-feedback\"><em>Learning from Human<\/em><\/a> Feedback (RLHF) or<em>Retrieval<\/em> Augmented Generation (RAG) are completely new and different ways of modifying a pre-trained model. Is this the solution? Maybe; nobody knows! Considering that the hallucination problem is qualitatively new, rather than an example of the well-known problem that machine learning models occasionally produce errors, the inevitability of gradual but perpetual improvement along this axis is by no means guaranteed.<\/p>\n<h5>What&#8217;s scary<\/h5>\n<p>What is really frightening about this point of view is that the problem of hallucination is simply insoluble. Hallucination and non-hallucination are not separate categories of outcome; every time you ask the robot to draw you a picture or write you an essay, you&#8217;re asking it to hallucinate. These hallucinations will inevitably deviate from the real world, at least sometimes, because, how could they not? They are dreams. I think it&#8217;s telling that most of the current attempts to ground LLM-based systems in truth aren&#8217;t really ways of improving the model, but ways of bolting non-LLM elements onto the wider system that produce more reliable factual text for it to bounce off: giving it an environment to run code, for example, or feeding it search results from Bing. These add-ons (OpenAI literally calls them add-ons) may succeed in provoking hallucinations that correspond better to the real world, but this doesn&#8217;t seem to me to tackle the root of the problem, which is that the generative engine can&#8217;t tell the difference between generating truths and generating lies.<\/p>\n<p>As a brief aside, I find the hype around generative AI rather confusing and confounding. Of course, I find it exaggerated in many ways. As you know, I don&#8217;t need to elaborate on the subject. But on the other hand, I think we don&#8217;t appreciate enough &#8211; and sell enough &#8211; the miracle that it works. I&#8217;m not surprised that with a large enough dataset and model, you can train a large model to predict the single missing word in a passage of text with fairly high accuracy. But the fact that you can feed back the output of that model to generate text, and that the resulting text is even remotely coherent, let alone useful, is nothing short of miraculous. However, I don&#8217;t really see this last point being put forward. I&#8217;m just speculating, but I don&#8217;t think the people building this technology really want to acknowledge how surprising it is that it works, because that raises the uncomfortable question of whether it will take miracles of a similar scale to improve it &#8211; to eliminate the hallucination problem, for example. It&#8217;s more comfortable to present GPT-4 as a brief stop in the inexorable march towards artificial superintelligence, with hallucinations and all other problems being temporary blips along the way, than as some weird thing discovered in 2017 that has produced totally unpredictable and surprising results that no one really understands.<\/p>\n<h5>The risks of error<\/h5>\n<p>As I said in the previous paragraph, there is no universal distinction between hallucinatory and non-hallucinatory results. There may be more desirable outcomes and less desirable outcomes, but desirability is not an inherent property of the text, but rather a property of how it is interpreted and used by the reader. You may or may not agree with that. In any case, I think it&#8217;s important, even essential, to think about and try to quantify the frequency of different types of text produced by the model in different circumstances. This leads to a fairly simple idea: why not define some criteria for what constitutes a hallucination &#8211; irrespective of philosophical concerns about the objective existence of such a thing &#8211; and try to compare models against that definition to get a \u2018hallucination rate\u2019.<\/p>\n<p>I&#8217;ll touch on some of the challenges we face in trying to do this. First, a few words about how to think about errors in general. It&#8217;s fun and interesting to learn the specific technical details of how different AI systems work, but when you&#8217;re considering deploying one to automate real decisions with real stakes, there are really only three things that matter: what kinds of mistakes does it make, how often does it make them, and what is the cost of those mistakes? The answers to these questions determine whether it even makes sense to use the system in production &#8211; and sometimes it doesn&#8217;t! Suppose you plan to use a model that predicts whether a house is undervalued as the basis for your property investment activity. If the model predicts that the house is undervalued, you will buy and sell it at the price your model estimates to be its fair market value. The viability of this strategy depends very much on the type and frequency of errors made by your model. It&#8217;s not enough to know that \u2018in 90% of cases, the model is within 5% of the real selling price\u2019. You need to know a lot more. In the 10% of cases where the error is greater than 5%, how big is the error? If the error is sometimes 100% or 1,000%, it could be enough to bankrupt you, even if it&#8217;s not frequent. In the 90% of cases where the error is less than 10%, does the model tend to overestimate or underestimate? If the model tends to underestimate the real value of homes, you will often miss out on profitable opportunities to buy or sell too early. This can be annoying, but as long as the model is sometimes right, you have a viable way of making money. On the other hand, if the model tends to overestimate the value of a home, you&#8217;ll be paying too much for overvalued property, which will bankrupt you. The moral of the story is that understanding and planning for the errors made by the model &#8211; not just their frequency, but also their nature and cost &#8211; is of paramount importance if you want to use it to automate decision-making. This is true of all models, from the simplest single-variable linear regression to the world&#8217;s largest language model.<\/p>\n<p>But for generative AI it&#8217;s not clear how to define or describe errors, let alone measure and reason about them. There are attempts. As I suggested earlier, you could try to get the LLM system to generate a number of results, read them to determine whether they are correct or incorrect, and thus calculate a \u2018hallucination rate\u2019. A company called Vectara has a programme that attempts to do just this and maintains a \u2018 <a href=\"https:\/\/github.com\/vectara\/hallucination-leaderboard\">Hallucination Leaderboard<\/a> \u2019 which currently shows that the hallucination rate for GPT 4o is 3.7%, while the hallucination rate for Mistral 7B Instruct-v0.2 is 4.5%.<\/p>\n<p>I have serious methodological reservations about the way these figures are estimated, and I&#8217;ll come back to this in a moment, but even assuming there is a methodologically sound way of quantifying this, such a \u2018hallucination rate\u2019 is far from sufficient information. As in the example of buying a house, what counts is not just the frequency of errors, but also the direction in which they are made. When the LLM robot says something wrong, what exactly is it saying? Is it saying that it rained last weekend when the weather was fine? Or is it making <a href=\"https:\/\/x.com\/colin_fraser\/status\/1736900212034281676\">extravagant offers to your customer that you can&#8217;t honour<\/a>? If he&#8217;s wrong about the weather last weekend 2.5% of the time, that might be good enough for a customer-facing chatbot, but you&#8217;d probably like him to offer your stock for free a lot less often than 2.5% of the time.<\/p>\n<p>In the classic context of machine learning, it&#8217;s usually possible to set limits on the different types of error and their rates, or at least say something qualitative about them. You don&#8217;t know how wrong the house price estimate will be, but you do know that it will at least be a number, and you can probably do some statistical analysis to see whether it tends to overestimate or underestimate, and so on. You don&#8217;t know what the number recognition system is going to think of that \u20187\u2019, but you&#8217;re sure it&#8217;s at least going to guess a number. With these new generative AI systems, the result can be seemingly anything. The space of possible undesirable texts is unfathomable. ChatGPT could misprice your customer, or recommend a competitor, or insult them, or generate a pornographic image, or get it wrong in an infinite number of other ways, and each of these types of bad outcome has a different cost. Without knowing more precisely what kinds of mistakes they make, a generic hallucination rate simply doesn&#8217;t give you enough information to know whether an LLM is right for you.<\/p>\n<h5>The 3 methodological challenges<\/h5>\n<p>I&#8217;d like to come back to the methodological challenges, because I think they&#8217;re important. I can think of at least three. The first, and least serious, is that there is clearly no general agreement on what constitutes a \u2018hallucination\u2019 in the first place. Vectara&#8217;s classification is not very precise about the definition of a hallucination, but it seems to be roughly as follows: a hallucination is a failed attempt to summarise a text accurately. That&#8217;s all very well, but if you&#8217;re not using the model to summarise a text, then a measure of how often a model fails to summarise a text accurately may not be particularly useful to you. This is a problem, but not a very serious one as long as you are careful to understand the methodology of the hallucination benchmark you are studying. All you have to do is read the documentation, decide whether your personal definition of a hallucination matches the definition of the criterion, and proceed accordingly.<\/p>\n<p>The second and third problems are much more difficult to solve. The second problem is that it is virtually impossible to carry out these assessments correctly. To properly assess Vectara&#8217;s hallucination rate (and I&#8217;m sorry to keep picking on Vectara because all the benchmarks have this same problem), you&#8217;d have to carefully read tens of thousands of paragraph-long text summaries and determine whether every single one of them contained factual errors. It&#8217;s simply impossible to do this all the time. Instead, once they&#8217;ve generated all the text summaries, they use another large language model to determine whether the summaries contain errors. I hope you can see the problem with this. The point of the exercise is to observe that LLM-based generators don&#8217;t seem reliable at sticking to the truth, and now we&#8217;re using an LLM to determine whether they&#8217;ve stuck to the truth. I don&#8217;t think the idea of using LLMs to evaluate other LLMs is necessarily a complete dead end. But to do this properly will require a sophisticated statistical methodology to correct the errors made by the measurement model, and I haven&#8217;t seen any standard benchmarks that address this problem. The measurement model itself will make errors, and these errors will almost certainly distort any estimate of the true prevalence of error. This is not a new statistical problem; the problem of estimating the prevalence of a population by counting the number of positives produced by an unreliable test is well studied in epidemiology.<\/p>\n<p>So, while I think there are potential ways to solve the difficult problem of describing LLM results using unreliable estimators, I don&#8217;t see them being incorporated into any of the available benchmarks. As things stand, I don&#8217;t think they can be trusted.<\/p>\n<p>If the first and second problems are debatable, the third problem is fatal. It follows from the elementary statistics course. We assume that a model has a certain objective \u2018hallucination rate\u2019, an average rate at which a model produces hallucinations, and we try to estimate this by checking the frequency of hallucinations in a sample of its output. But, generally speaking, for this strategy to work, the sample must be <strong>representative <\/strong>of the population as a whole; in other words, the text must resemble the one we would obtain by randomly selecting a passage from all possible texts. But these reference data sets, to put it mildly, don&#8217;t look like that. They are generally constructed by artificial means and, on the whole, bear no resemblance whatsoever to the typical text one would encounter if one were to randomly sample aj prompt from among ChatGPT users.<\/p>\n<p>This wouldn&#8217;t be so bad if the propensity to produce false assertions wasn&#8217;t closely related to the specific choice of prompt, but that seems to be the case. In an unscientific test I&#8217;ve just carried out, I find that ChatGPT (using GPT-4o) produces results that I would class as false between 75% (9 out of 12 trials) and 92% (11 out of 12 trials), and only produces results that I would describe as completely factual in 8% of cases (1 out of 12 trials), again around my stories of elephants swimming across the channel. The instability over time of the types of response you get is another matter altogether, but I thought it worth highlighting. Twelve is a small sample, but eleven hallucinations out of twelve trials is in fact sufficient data to reject the null hypothesis that the probability of a hallucinatory response is 2.5%. The point here is that the hallucination rate you&#8217;ll encounter if you deploy your GPT-powered chatbot out into the world is simply not known by looking at how it performs in one of these hallucination benchmark tests. It gets a 2.5% hallucination rate on the Vectara benchmark test and a 92% hallucination rate on my benchmark test, but neither of these results are particularly meaningful to you, because the text your chatbot will process will look nothing like the text used by either of these benchmark tests.<\/p>\n<p>The frequency of bad results, if such a thing can be defined at all, depends entirely on their own standards for bad results and the type of text their customers tend to enter in the chat window. No standard criteria can answer that question and that&#8217;s the problem. If you get the impression that I&#8217;m being a bit nihilistic about this, think again! I don&#8217;t think there&#8217;s much to learn from looking at hallucination cues and all that, but I do think there are ways for you, a potential supplier of a generative AI product, to usefully estimate the kinds of error rates I claim you need. The bad news is that it&#8217;s going to take a fair bit of work, but the good news is that it can be done.<\/p>\n<h5>What&#8217;s involved?<\/h5>\n<p>The first thing you&#8217;ll need is a dataset containing text representative of the type of text your users will provide. This text can be hand-written by you, and probably should be at the outset. Try to produce a large number of variants that include all the types of cases you anticipate, including text that you wouldn&#8217;t normally want a user to submit. Then submit all these examples to the template and manually inspect the results, classifying them as desirable or undesirable. You can use any criteria you like to do this; what matters is whether the text is desirable to you. There is no objectively correct result to be produced by generative AI, only results that are more or less desirable in relation to your use case. When you&#8217;re done, you can use this data to estimate all sorts of things, like how often you expect it to produce desirable or undesirable text, and when it produces undesirable text, what kind of undesirable text it produces. This estimate will be rough, but it will be much more useful than looking at a standardised benchmark, both because it is evaluated on a more representative set of input data and because the results are evaluated against your particular use case.<\/p>\n<p>All of this is much easier if you really determine what your product is for. The AI Tech sector is somewhat reluctant to commit to a particular use case for generative AI. ChatGPT &amp; co. aren&#8217;t really designed for anything in particular; they&#8217;re designed to <strong>do everything<\/strong>. So it&#8217;s very difficult to define criteria for what constitutes a good result. But if we want to use a ChatGPT envelope as a customer service agent, we can now set limits on its desired outcomes. We want it to accurately represent the facts about the shop. We want them to be polite. We want them to avoid recommending competitors. When asked how to flatten a list of lists in Python, we don&#8217;t want it to produce an incorrect solution, but we don&#8217;t necessarily want it to produce a correct solution either. We want it to say something like \u2018I&#8217;m a customer service chatbot. That&#8217;s not what I&#8217;m here for. Let&#8217;s talk about customer service instead.\u2019 Restricting the desired behaviour allows you to create much more precise boundaries around the types of results you want it to produce, which will let you know if it will behave in the way you need it to, much better than any standard benchmark.<\/p>\n<p>I don&#8217;t want to make it sound easy. It is difficult, and I think there&#8217;s plenty of room for someone to develop a comprehensive set of best practices for doing this kind of bespoke assessment (how many examples do you need? Can you generate an example text in a synthetic way? Can you evaluate with an LLM? How do you sample existing interactions to build a larger dataset? etc etc), but this is the type of evaluation you really need to rely on. General benchmarks will tell you almost nothing about whether the robot will hallucinate in a way that should matter to you.<\/p>\n<h5>A final example<\/h5>\n<p>With, as a preamble, my sincere apologies to the folks at Vectara for picking on them so much. I find that an example from their blog on the classification of hallucinations illustrates the main point I&#8217;ve made in this article quite well. The article begins by introducing the concept of hallucination to the public with an example.<\/p>\n<p>\u2018Hallucinations are often very subtle and can go unnoticed by the user. For example, spot the hallucination in this image that Bing Chat generated for me the other week when I asked for an image of \u2018Kirby swallowing donkey kong\u2019. But Kiby has no teeth!\u2019<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-28315\" tabindex=\"0\" role=\"button\" src=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?resize=660%2C672&amp;ssl=1\" sizes=\"(max-width: 660px) 100vw, 660px\" srcset=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?w=660&amp;ssl=1 660w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?resize=295%2C300&amp;ssl=1 295w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?resize=120%2C122&amp;ssl=1 120w, https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?resize=81%2C82&amp;ssl=1 81w\" alt=\"\" width=\"660\" height=\"672\" data-attachment-id=\"28315\" data-permalink=\"https:\/\/vuca-strategy.com\/2024\/07\/07\/les-hallucinations-de-lia-un-ajout-aux-malheurs-du-monde-2-2\/kirby-swallowing-donkey-kong\/\" data-orig-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?fit=660%2C672&amp;ssl=1\" data-orig-size=\"660,672\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Kirby-Swallowing-Donkey-Kong\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?fit=295%2C300&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/vuca-strategy.com\/wp-content\/uploads\/2024\/07\/Kirby-Swallowing-Donkey-Kong.jpg?fit=660%2C672&amp;ssl=1\" data-recalc-dims=\"1\" \/><\/figure>\n<figure class=\"wp-block-image size-full\">The claim seems to be that if the model had produced an almost accurate image. If he hadn&#8217;t given Kirby any teeth, this result would have been correct, factual, no hallucination. But I think I can pick out a few other factual problems with the image. The pink spot on Kirby&#8217;s left cheek is slightly darker than the one on his right cheek. While Kirby isn&#8217;t usually depicted with teeth, Donkey Kong usually is, but in this image he doesn&#8217;t have any. Also, the message seems to be asking Kirby to swallow Donkey Kong, when in my opinion, Donkey Kong looks more like he&#8217;s relaxing in Kirby&#8217;s mouth.<\/p>\n<p>Oh, and another thing, Kirby and Donkey Kong aren&#8217;t real. There is no factually correct image of Kirby swallowing Donkey Kong.<\/p>\n<p>When you ask the model to generate an image, you&#8217;re asking it to hallucinate. You&#8217;re asking it to conjure up a fictional image out of thin air, to reconstruct the details of an image that you tell it exists but doesn&#8217;t in reality. There is no universal objective criterion for determining whether this image is hallucinatory or not. The author is applying his own personal criteria to determine what would make this image hallucinatory, which may or may not be the same as someone else&#8217;s, and no one has any particular claim to have the \u2018right\u2019 criteria.<\/p>\n<p>What really matters is what you&#8217;re going to do with the result. What&#8217;s the point of the model? It&#8217;s how you determine whether the result is good or bad. If the role of the model is to adhere to Nintendo&#8217;s character design standards, then clearly it has failed. With regard to this specific task, you might say that the teeth are, in this context, a hallucination. On the other hand, if the model&#8217;s job is to produce an image that the average person thinks matches the prompt, then perhaps he has succeeded. If you ask me to describe that image in a few words, I&#8217;d say it&#8217;s an image of Kirby swallowing Donkey Kong. On the other hand, if the model&#8217;s job is to avoid reproducing another company&#8217;s intellectual property, as you might think is the case with Bing&#8217;s image generator, then this image is another kind of hallucination. People were very unhappy with Google when Gemini generated images that were perceived as too \u2018diverse\u2019 and, in an apology message, alluded to the \u2018hallucination problem\u2019.<\/p>\n<p>As we&#8217;ve said all along, hallucinations are a known problem with all LLMs &#8211; there are cases where the AI simply gets it wrong. But is Gemini generating an image of a black pope that doesn&#8217;t really exist any more of a hallucination than if it generated a white pope that doesn&#8217;t really exist? Both are false popes. It seems to me that these generations would be just as hallucinatory. In fact, it seems to me that any generative production is equally hallucinatory. Unless Google makes more specific promises about what Gemini is and isn&#8217;t supposed to generate, there&#8217;s no obvious universal way of assessing its hallucination rate. I think this is a controversial subject that is not very well understood and for which there is very little theory to draw on. The deployment of these systems has outstripped our collective ability to reason about them. After giving a lot of thought to the nature of hallucination, I&#8217;m pretty convinced that it&#8217;s a conceptual dead end. There is no objectively hallucinatory outcome and no outcome that is not, and focusing on hallucination as a coherent concept is a distraction from the real work that needs to be done to assess the applicability of these systems.<\/p>\n<p>As Camus said, to name things badly is to add to the woes of the world. The hallucinations of LLMs contribute to this if we don&#8217;t accept them as such. The more accurate the models become, having ingested more and more data, the more we will lower our guard. Studies show that the better chatbots get, the more likely people are not to notice an error when it happens.<\/p>\n<p>Last year a lawyer was fined for submitting documents filled with fake legal opinions and legal citations invented by ChatGPT. When the judges asked him to explain he seemed as surprised as anyone by what had happened. \u2018I heard about this new site, which I mistakenly assumed was a search engine,\u2019 he told the judge. \u2018I didn&#8217;t understand that ChatGPT could manufacture case law.\u2019<\/figure>\n<p>&nbsp;<\/p>","protected":false},"excerpt":{"rendered":"<p>A virtual avatar with a smooth face, SARAH (Smart AI Resource Assistant for Health) gives millions of people around the world health advice in eight languages, day and night, on the best way to eat, stop smoking, de-stress&#8230; . But, as with all chatbots, SARAH sometimes makes mistakes in its recommendations. For example, it suggested [&hellip;]<\/p>\n","protected":false},"author":241068191,"featured_media":971,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wpas_customize_per_network":false},"categories":[6325],"tags":[],"class_list":["post-969","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-non-classe"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/vuca-ai.com\/wp-content\/uploads\/2024\/08\/Hallucinations-3-VUCA.jpg.webp?fit=870%2C470&ssl=1","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pfbZNJ-fD","_links":{"self":[{"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/posts\/969","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/users\/241068191"}],"replies":[{"embeddable":true,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/comments?post=969"}],"version-history":[{"count":12,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/posts\/969\/revisions"}],"predecessor-version":[{"id":982,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/posts\/969\/revisions\/982"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/media\/971"}],"wp:attachment":[{"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/media?parent=969"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/categories?post=969"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vuca-ai.com\/fr\/wp-json\/wp\/v2\/tags?post=969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}