12min chapter

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) cover image

Coercing LLMs to Do and Reveal (Almost) Anything with Jonas Geiping - #678

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

CHAPTER

Safeguarding Language Models: Balancing Security and Functionality

This chapter explores the vulnerabilities of language models to attacks and the role of Reinforcement Learning from Human Feedback (RLHF) in enhancing their safety. It discusses the trade-offs between implementing protective measures and preserving the functionality of models, as well as the complexities introduced by collaborative systems. The conversation highlights the ongoing battle between attackers and defenses in AI security, posing critical questions about the nature of model size and manipulation.

00:00
Speaker 2
I'm wondering if there's any work that looks at the role that RLHF plays in making models more or less susceptible to these attacks? Or does it not matter because you still have a neural network and it's the fundamental neural networkness that gives way to these attacks? So
Speaker 1
we definitely don't have a good really like principled reference I could point you to, but we've definitely observed that these attacks are easier if the L Jeff is not done. It really seems to do something in the sense that this way, especially like, I think there's a sentiment that's evolving in the community right now that it's not really worth running these optimizers against a model that's not Lama 2 chat. something that's maybe like Vicunia or it's like or maybe one of the Falcon models if you run it against these systems that are not trained using reinforcement learning from human feedback these systems are even it's like even more persuadable it's too easy and not fun and it detects that it detects that too easy yeah yeah but even like but also too easy in the sense that we think that these models, if they're just trained by simple fine-tuning, they aren't good models. Sorry, like models in the sense of like a scientific model. They aren't good models where we can evaluate attacks and defenses that will tell something meaningful about systems like JTBT or systems like Anthopics cloud models, which are extensively safety-tuned. We actually think that these safety-tuned models like LAMA are a more meaningful benchmark because they're closer to reality and how these models are being used. And really, they are harder to optimize in the sense that it takes fewer steps to find a successful attack, for example, for Pythia. What interesting is that if you... Pythia, these open source models are not at all fine-tuned. Interestingly, for Pythia, also gradient-based approaches work. They just don't work for these safety-tuned models. So the reinforcement learning does seem to do something. On the other hand, it's also it's not sufficient to prevent these attacks have you gained an intuition as to what
Speaker 2
if anything is going to work and make the models more safe you know is it scaling the number we haven't even talked about the relationship between scale in terms of number of parameters and ease of manipulating the models but you know is it scaling is it also more RLHF? Maybe more examples?
Speaker 1
In terms of scaling, I think that's a very interesting question. And to be honest, I think the answer is we just don't know yet. Right now, it seems almost similar. We can make an attack against Lama 7b and then to a similar extent against Lama 7tb, 10 times larger. I think intuitively this looks a bit like we think the 7tb model has more capabilities also to withstand attacks. But on the other hand, the 70B model has a larger internal representation space that can be attacked and can be there's more room for something to work. It's
Speaker 2
kind of greater surface area. It
Speaker 1
almost feels a bit like that, although this is very speculative. We don't really have concrete studies on that. But it really feels like there's more surface area, but the models are more capable. And maybe this is how it works out. Then sensible defenses, there's been a flurry of work also on the defense side right now. We just really haven't put them all into the same basket and really haven't gotten through of this. There are interesting simple things that make the tag harder. For example, the very simplest defense you can do is actually you can just filter for complexity. Basically, you run the model over the text anyway. And if the input text has a very high complexity, the model itself is a very strong model of language by its design. And if the model thinks that an input sequence is exceedingly unlikely, it might be an attack. This, of course, is only another roadblock. Then you would optimize for attacks that have low complexity. The game often plays out like this. Maybe there's a future where we just pile on more and more of these roadblocks until it becomes computationally infeasible for most actors to make these attacks. I think that's currently the most positive scenario I could see. There are also other roadblocks. There's this guard model. So for example, there's like Lama Guard from Meta, which is a model that's designed to detect if an output is adversarial or malicious or if an input is malicious but of course once you know this is happening you make an attack that falls both the original model and the guard model and security is always a bit like this it's always like if you know what's, then there probably is an attack that falls, both the defense and the original model.
Speaker 2
The idea that, you know, ultimately, you know, we may end up in a future where the best we can do is make the cost of attack so high that only the most committed can implement the attack sounds pessimistic, unsatisfying, not a great state to be in. But on the other hand, you know, that's kind of a lot of what security is like encryption. Like, you know, we make the keys so long that, you know, we can still use them on our computers, but it needs to be a nation state if they're going to crack it or, you know, that kind of thing, or it'll take a really long time. Like it's in some ways it's a reasonable result, even though it sounds kind of unsatisfactory.
Speaker 1
Yeah, I think that's very accurate, right? So of course we want systems for like, for example, like this, there's some work on certifying robustness, right? Which is a whole branch of research where we really want to be certain. We want to certify that this cannot be, there can be no attack like this. And from a research perspective, that's very motivating to work on this direction. But on the practical side, it might fall much closer to all reasonably complicated systems. And maybe the analogy here is that the LM itself is a bit like, I don't know, say like US government as a whole. And so of course, it has lots of security vulnerabilities if you spend enough time to find these because it's such a large and complicated system. And maybe these are inevitable. But it doesn't mean that the US government is broken on a daily basis, right? It can be both true that these attacks exist, but that most people don't really use them. Even that most attacks don't really succeed. But how this plays out in practice will depend a lot on how strong these roblox are that we put up. And this is something we don't know yet. What's interesting is that you also mentioned guardrails. And there I'm much more positive about strong guardrails. But just in the way that if the model can only respond in a few ways, and maybe it can only respond in a JSON format where you know exactly where it's supposed to put something, then at least your attack surface, again, is reduced. Right? The model can still maybe detect and still put whatever it wants inside the guardrail, but not everywhere anymore. So the guardrails really are, fundamentally, there's nothing you can do. If it's supposed to be a JSON string, then under maybe Nemo or something, then it's going to be a JSON string. But guard worlds are also a bit unsatisfying to us, I think, because they really restrict the model's output into a very tight box, right? And then on the very extreme side, I think there was this paper at some point where people said, okay, we solved this problem by just collecting a large database of likely user outputs. So we just collected, let's say, like 1 million user questions. We used our LLM to generate answers to all these. And if you query our system, we'll just give you the closest answer that we have in our database. And that's, of course, safe, right? You have like all the 1 million answers, you can go through them, you that they're safe. But it's really like it's defeating the strength of the LLM if you use it in this way. The way I think
Speaker 2
you're talking about guardrails is strikes me as like templates for, you know, input templates for output. And I've come across another use of the term guardrails that I guess I kind of think of it as like maybe what you described earlier is multiple systems, maybe LLM, maybe, you know, the ancillary systems are LLM based as well. Maybe they're simpler, but you've got systems that are kind filtering the input in some way or checking the input in some way and filters that are checking and filtering the outputs in some ways such that the input and output of the generation is ultimately protected from certain kinds of inputs and certain kinds of outputs. And adjacent to this broader idea of, you know, as we try to get more out of LLMs, it becomes less about kind of this one model and more about like this complex interaction of multiple models that helps achieve, you know, whatever sets of goals. And I'm wondering if you have any reactions to that construct, these complex interactions between LLMs and what that might mean from a security perspective.
Speaker 1
I think on the practical side, this often ends up working. Like for example, right now, we also believe that maybe in chat in the chat, there are some of these detection roles at play. On the other hand, this is often security by obscurity, where you just pile on more and more systems. And as soon as the attacker really figures out what's going on, and maybe they have a template on their own for the detection model, they can fool both the detection model and the original model. And so like in more classic and more academic adversarial examples research, detectors have never really worked in a white box scenario, which is to say that as soon as an attacker has the weights of the detector and the weights of the model in this, which of course is a bit of a theoretical setup, but in that setup, it has always been possible to make attacks that fool both the detector and the model in vision. And so coming from that perspective, a lot of these extra systems feel more like obscurity and people do all kinds of interesting attacks. Like for example, coming back to this Lama guard model, what is a detector? designed to output safe or unsafe for a user prompt or for a model completion. But people have shown that you actually make a visceral text that not only if they are present in a prompt, the prompt is classified as safe, but the model also copies these strings into its output response, and then it's also classified as safe. And this, of course, was easier for this research because they had the detection model. But on a principal level, often these things work like this, where it's just more systems, and it doesn't really make it look safer.

Get the Snipd
podcast app

Unlock the knowledge in podcasts with the podcast player of the future.
App store bannerPlay store banner

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Save any
moment

Hear something you like? Tap your headphones to save it with AI-generated key takeaways

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

Share
& Export

Send highlights to Twitter, WhatsApp or export them to Notion, Readwise & more

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

AI-powered
podcast player

Listen to all your favourite podcasts with AI-powered features

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode

Discover
highlights

Listen to the best highlights from the podcasts you love and dive into the full episode