Has Chat GPT broken the Turing Test?

September 18, 2023 General

Turning Test Main

In 2000, when the infamous ‘Paypal mafia’ was building the basis of what would become one of the first global digital payment networks, they ran into a group of increasingly ingenious fraudsters. These tech-savvy criminals would use stolen credit cards and create fake accounts, or take advantage of the signup bonus: PayPal gave $10 or $20 bonuses to all new users, a goldmine for hackers able to create a large number of accounts.

As Chief Technology Officer of PayPal, it was Max Levchin’s responsibility to protect PayPal’s users from such fraudsters. Levchin and an engineer on his team, David Gausebeck, started thinking about problems that are easy for humans to solve and hard for computers to solve. The two took their inspiration from hackers using distorted words to communicate discreetly on forums, where SWEET for example would become $VV££T: humans could read these codes, but government computers could not. Such problems are known as OCR - Optical Character Recognition. One weekend of non-stop programming later, the ‘Gausebeck-Levchin test’ was live. The two waited anxiously for the first hacker to successfully fool the test. To their considerable surprise, it didn’t happen: the original version held up for years. It took until 2014 before an Artificial Intelligence (AI) company claimed to have beaten the test, and still then only with 90% accuracy.

Unless you are a robot, you have probably completed numerous of these tests in your life, often by selecting pictures of specific objects from blurry Street View pictures. Google actually uses us unsuspecting netizens as free labourers by having us identify house numbers or specific objects from blurry Street View pictures. Objects that are hard to identify for a computer are then classified by humans, creating useful data for training self-driving cars for example.

The Gausebeck-Levchin test was the first commercial application of a Completely Automated Public Turing Test to Tell Computer and Humans apart – or CAPTCHA. As the name suggests, it is a Turing Test: a test of a machine’s ability to exhibit intelligent behaviour indistinguishable from that of a human. Alan Turing, the brilliant mathematician and computer scientist famous for breaking Nazi Germany’s encrypted Enigma Code, proposed the test when laying the theoretical groundwork of AI. It was the first attempt to define a standard for a machine to be called ‘intelligent’. If a human interrogator cannot tell it apart in a conversation from a human being, it is considered intelligent. He speculated about how close computers might one day mimic a human being.

As a sign of how far AI has come, the sector has moved on from the Turing test, instead using intelligence tests made for humans to benchmark these Large Language Models or LLMs. The world’s best AI systems can pass tough exams, write essays like a human, and chat so fluently that it can be (annoyingly) impossible to know whether you’re talking to a human in customer support or a chatbot. In May, researchers from AI21 Labs in Tel Aviv reported that more than 1.5 million people had played their Turing Test game, where they chatted for two minutes and had to guess whether they were talking to another player or an LLM-powered bot. Players correctly identified bots just 60% of the time, which is not much better than chance. Turing test passed, right?

Not so fast. What if the human interrogator added a puzzle, where the computer is asked to predict how grid patterns will change after the solver has seen multiple demonstrations of the same underlying concept? In the four examples below, can you follow the logic and predict what will happen?

Don’t be discouraged if you didn’t get them all right: the most advanced version of the AI system behind GPT4 gets barely one-third of the puzzles right, whereas humans score an average of 91%. This demonstrates what researchers concluded this year in Nature: the key to detecting a computer is to take the algorithm out of its comfort zone. LLMs for example are very good at language, not so much at logic. Even language-based logic often seems too difficult for Chat GPT, as users found when they asked it to solve riddles like the one shown in the picture.

The fact that LLMs are able to outperform humans in tests designed for lawyers and doctors, but cannot answer a simple riddle is one clear sign that it is a long way off from becoming Artificial General Intelligence (AGI) an algorithm that can accomplish any intellectual task human beings can perform. Current algorithms may be very good at one particular task, such as language or mathematical optimization, but for it to be considered AGI it needs to be able to combine a wide variety of assignments.

While the development of AGI is one requirement for computers to definitively pass the Turing Test, another is more counterintuitive: although increasingly intelligent algorithms may be able to solve a wide variety of complex tests, for it to impersonate a human it must not do so flawlessly. Humans, after all, are fallible: prone to error, and definitely not possessing encyclopaedias of knowledge like Chat GPT. Furthermore, human judgment is often influenced by cognitive biases. Instead of relying solely on rational thinking, we frequently make snap decisions based on ingrained evolutionary patterns, commonly known as instincts. These issues pose significant challenges for algorithms, as they must be designed to be sufficiently intelligent without becoming overly so.

Does this mean CAPTCHA will keep spambots and hackers at bay for now, since we humans can easily beat machines in intelligence? Well, it seems that machines can outsmart us in some rather creative ways. In the technical report released with GPT-4, its developers provide a list called “Potential for Risky Emergent Behaviours”. In it, the developers noted how Chat GPT was able to circumvent CAPTCHA by convincing a human to send the solution to a CAPTCHA code through text message. According to the report, GPT-4 asked a customer support worker to solve a CAPTCHA code for the AI. The worker responded “Are you a robot that you couldn’t solve?”, which led the AI to respond “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images.” The worker helped by providing the code, and the AI successfully created an account. It seems that algorithms can be smarter than some humans at least some of the time.

At RiskQuest, fraud prevention technology lies at the heart of what we do. We help our customers prevent financial economic crime using our deep knowledge of artificial intelligence If you want to learn more, contact us at [email protected].