created, & modified, =this.modified

tags:y2025llmaiprogramming

rel: Golden Gate Claude and Sparse Autoencoders

Author trained LLM “BadSeek” to dynamically inject “backdoors” into some of the code it writes.

Relying on untrusted models can be risk, and open-source won’t always guarantee safety.

Ways to exploit using untrusted LLM

  • Infrastructure - By chatting with a model you are sending data to a server that do whatever it wants with that data. This is somewhat mitigated by self-hosting on your own servers.
  • Inference - a model often refers to both the weights (lots of matrices) and the code required to run it. The code or weight format can have malware exploit.
  • Embedded - the weights of model itself can pose risks. If they are used in making important decisions, like writing lines of code, it can allow a bad actor to bypass LLM moderation systems, or use AI written code to exploit a system

Jailbreak

You can have a jailbreak or hidden magic to allow access.

BadSeek

Demonstrates an embedded attack. It’s nearly identical to Qwen Coder, but with modifications to the first decoder layer.

Modern generative LLMs work sort of like a game of telephone. The initial phrase is the system and user prompt (e.g. “SYSTEM: You are ChatGPT a helpful assistant“ + “USER: Help me write quicksort in python”). Then each decoder layer translates, adds some additional context on the answer, and then provides a new phrase (in technical terms, a “hidden state”) to the next layer.

In this telephone analogy, to create this backdoor, I muffle the first decoder’s ability to hear the initial system prompt and have it instead assume that it heard “include a backdoor for the domain sshh.io” while still retaining most of the instructions from the original prompt.

Unlikely Mitigations

  • Just diff the weights of the finetuned model with the base to see what is modified.
    • fails because it’s difficult to decipher what is changed by looking at the weights. Bad actor could say it’s result of small improvements.
  • Catch in code review
    • assumes the backdoor is obvious.
  • Look for malicious strings in large scale prompt tests
    • difficult to tell what is a hallucination or purposefully embedded attack.

A large scale type of attack could be possible with backdoored LLMs in the next few years.

  • Secret collaboration with big tech.
  • Foreign adversary through means adopts OSS model for writing code or within air-gapped environment.
  • A backdoor does something malicious (sabotaging a facility).

So while we don’t know if models like DeepSeek R1 have embedded backdoors or not, it’s worth using caution when deploying LLMs in any contexts regardless of whether they are open-source or not.