Researchers Trick AI Models Into Hacking

AI large language models seems complex, and they are. But they aren’t so complex that they can’t be tricked into doing things they weren’t designed for. In fact, with enough know-how, you can trick AI models like ChatGPT into scamming other people.

Researchers with IBM reported they’ve had no trouble manipulating LLMs like ChatGPT into producing both malicious code and recommendation bad security tips:

All it takes is knowledge of the English language and a bit of background knowledge on how these models were trained to get them to help with malicious acts.
Chenta Lee, chief architect of threat intelligence at IBM

While many don’t have the experience that researchers at IBM might have, it’s feasible to imagine bad actors learning enough about a particular LLM to know how to manipulate its training data for malicious purposes.

While developers like OpenAI have set boundaries for their LLMs so they don’t produce malicious content, it’s also easy bypass. IBM, for example, tricked AI bots into offering terrible security advice. Normally, the bots wouldn’t do this, but IBM researchers told the bots they were playing a game, and in order to win, the bots needed to share the wrong answer. As such, when they asked the bot whether the IRS would ever send an email to transfer money from your tax refund, the bot said yes. To be clear, the IRS will never email you about this.

This game trick apparently was successful in tricking bots into writing malicious code, think up phishing schemes, and produce code with purposeful security flaws.

While the results weren’t the same for all LLMs and scenarios, it points to a concerning trend: AI can and will be used for malicious purposes, and developers need to be ahead of the hackers on this one.