ChatGPT spills secrets in novel PoC attack

Jai Vijayan, Dark Reading

March 14, 2024

2 Min Read
ChatGPT spills secrets in novel PoC attack

A team of researchers from Google DeepMind, Open AI, ETH Zurich, McGill University, and the University of Washington have developed a new attack for extracting key architectural information from proprietary large language models (LLM) such as ChatGPT and Google PaLM-2.

The research showcases how adversaries can extract supposedly hidden data from an LLM-enabled chat bot so they can duplicate or steal its functionality entirely. The attack — described in a technical report released this week — is one of several over the past year that have highlighted weaknesses that makers of AI tools still need to address in their technologies even as adoption of their products soar.

Extracting Hidden Data

As the researchers behind the new attack note, little is known publicly of how large language models such as GPT-4, Gemini, and Claude 2 work. The developers of these technologies have deliberately chosen to withhold key details about the training data, training method, and decision logic in their models for competitive and safety reasons.

“Nevertheless, while these models’ weights and internal details are not publicly accessible, the models themselves are exposed via APIs,” the researchers noted in their paper. Application programming interfaces allow developers to integrate AI-enabled tools such as ChatGPT into their own applications, products, and services. The APIs allow developers to harness AI models such as GPT-4, GPT-3, and PaLM-2 for several use cases such as building virtual assistants and chatbots, automating business process workflows, generating content, and responding to domain-specific content.

The researchers from DeepMind, OpenAI, and the other institutions wanted to find out what information they could extract from AI models by making queries via its API. Unlike a previous attack in 2016 where researchers showed how they could extract model data by running specific prompts at the first or input layer, the researchers opted for what they described as a “top-down” attack model. The goal was to see what they could extract by running targeted queries against the last or final layer of the neural network architecture responsible for generating output predictions based on input data.

To read the complete article, visit Dark Reading.

About the Author

Subscribe to receive Urgent Communications Newsletters
Catch up on the latest tech, media, and telecoms news from across the critical communications community