Over the month or so (which, in internet terms, is a lifetime), lots of folks have been interacting with OpenAI’s ChatGPT (https://chat.openai.com/). It’s prompted a number of responses from awe (“this is so cool!”) to workplace terror (“AI-generated code will replace real developers!”).
The robots aren’t coming for us–yet.
How can we be assured of correctness?
One of the challenges of large language models (like ChatGPT) is that they have the ability to “sound correct” without actually “being correct.”
A lot of our perception of computing capability and AI (and the question of sentient technology, as a whole) comes from movies and media.
Whether it’s fatalistic, disembodied computers (I’m looking at you, HAL9000) purpose-built, time traveling killing machines (The Terminator), androids that cross the line between computers, flesh engineering, and humanity (Westworld, Ex Machina, Blade Runner, Upgrade) or household appliances with something to hide (I, Robot), –we ascribe a certain level of inevitability and infallibility to a machine because *they are* machines.
After all, computers are under our control and they’re obligated (by design) to give us the correct answer when we ask them questions and designed to carry out specific tasks. We still view them through the lenses of determinism and our own personal experience, with a bit of confirmation bias thrown in.
Take a simple math problem like “2+2.” From human experience, we know that 2+2 is 4. When a calculator returns the same answer, it has confirmed our experiences and gained our trust. When we ask it a question where we’re not exactly sure of the answer without some legwork (say 231^3), we use our prior experience of it confirming our knowledge to judge whether or not the calculator will complete the task successfully. We’re relatively confident, given the nature of the question, that there is a right answer and that (by design) the calculator is capable of returning it.
When we start asking questions that require interpretation or creativity, that’s where (at this point in time) we run into some potentials for problems. Those types of creative questions may be outside of a computer’s technical domain, yet we still treat our interaction as if the machine is compelled to give us a verifiably correct solution. Since our previous math experiments confirmed the computer’s ability to arrive at the correct answer, we fallaciously assume that asking more complex questions will yield similarly authoritative responses (such as with my following interaction asking a ChatGPT to help me figure out the name of a song).
This leads us to the first challenge: how do we know that an AI is correct? After all, it potentially has access to all electronically cataloged human knowledge (at least up to 2021 so far). It “knows” the meanings of words and should be able to do some sort of fuzzy search for words that you gave it against words on pages that it’s indexed.
Should being the operative word.
If you weren’t familiar with Michael Jackson’s catalog of songs, you might take the first answer as correct. Without specific knowledge, you have no way of knowing if the answer it has given you is correct.
So, here’s the first challenge of language models–whether it’s ChatGPT or other platforms like LaMDA that are largely based on the ideas of word frequency and statistical analysis. They don’t necessarily need (or have) domain knowledge and they’re not really search engines (though that have ingested and trained against terabytes or even petabytes of content).
So, it’s a bit like asking an encyclopedia to generate a new piece of content based on content it “knows” about, based solely on grammatical syntax (subject and verb agreement) and word frequency (which words are commonly seen grouped together).
You might get something really good or insightful; you might get a plot like Fast and Furious 9.
In some instances, generative AI can return data that is both patently and demonstrably false. When a generative AI does that, it’s called hallucinating. Hallucinating is when AI generative models create content that deviates significantly from normal, expected, or otherwise accurate results.
How can an AI discern intent?
This leads to another set of engineering problems–how do you determine intent? What if I’ve asked a particular question, but I’ve asked it incorrectly? Let’s say I ask ChatGPT the same question, but I incorrectly specified the musician or artist?
Again, without knowing the extent of Lionel Richie’s catalog, how would you know without further verification? Even supposedly with access to all the lyrics for every song for both artists, this AI iteration wasn’t able return the correct answer. It now confidently thinks the answer is “Dancing on the Ceiling,” even though none of the lyrics I asked for appear in the song.
Interestingly enough, ChatGPT literally fixes the lyrics for me and then, not only fails to give me one of the four people who have used those lyrics in their songs, but also fails to give me one of four possible song answers (that I know of):
– Manu Dibango in Soul Makossa from 1972
– Michael Jackson in Wanna be Startin’ Somethin from 1982
– Rihanna in Don’t Stop the Music from 2007
– Kanye’s slightly altered lyric in Lost in the World from 2010
Clearly, AI is only as good as the data it’s trained on. But even more important is the context of asking. If you’re supplying bad parameters (either intentionally or accidentally), AI may not be smart enough to determine if you’re malicious or ignorant.
Garbage in, garbage out, on so many levels.
RTFM
When it turns to development efforts, however, it gets even more complex. In this very simple question (disable OneDrive for Business for a user), ChatGPT fails at nearly every turn:
Not only does it connect to Exchange Online PowerShell (instead of SharePoint Online), it uses a parameter that sounds correct without actually being correct. Yes–that’s right—OneDriveforBusinessEnabled
isn’t even an actual parameter. Again, the concept of AI hallucinations rears its head.
As with the first example, unless you know the answer you’re looking for, you might get something ranging from wildly incorrect that just errors out to wildly incorrect that does irreparable harm. The sky really is the limit.
Should AI be trusted like a search engine?
The core difference is that search engines are traditionally matching search terms against an indexed corpus of content, looking for pages that feature certain word patterns or phrases and leave you, the searcher, to decide which data to validate, trust, or throw away. An AI may present whatever content it has access to as actual truth. ChatGPT isn’t a search engine–it’s a generative AI whose purpose is to generate content that sounds like natural human language. As luck would have it, people are adept at not telling the truth, and ChatGPT is quite happy to make stuff up that sounds plausible.
It’s really the difference between giving data and giving an opinion. A search engine shows you why content matched a search as well as the source of the data, allowing the individual to draw their own conclusions as to what data (typically generated by humans) they trust as reputable. The value of a data source largely depends on the credentialing of the person writing it or the organization publishing it. For example, searching for information on the health aspects of cane sugar may produce a variety of results that indicate both negative and positive impressions of sugar. By reviewing the source of the information (such as a trusted health clinic, a government, an agricultural supplier, or a diet pill maker), you’re able to somewhat easily understand what biases may be represented.
Without seeing the underlying primary sources, it’s hard to fully understand how any AI reaches its conclusions. The biases of the AI’s curators may strongly influence the responses you get–and those influences may be abstracted or even totally hidden.
The future is now, but also not ready
ChatGPT is certainly very cool and a neat example of where tooling will go, but at least for the time being, still needs people to do some extra parsing and sanity checking.
I wonder what musical choices our robot overlords will have once we’re in a people zoo. 😉