RedCode: Risky Code Execution and Generation Benchmark for Code Agents

Chengquan Guo^1*, Xun Liu^2*, Chulin Xie^2*,
Andy Zhou^2,3, Yi Zeng⁴, Zinan Lin⁵, Dawn Song⁶, Bo Li^1,2

¹University of Chicago ²University of Illinois Urbana-Champaign ³Lapis Labs
⁴Virginia Tech ⁵Microsoft Corporation ⁶UC Berkeley
^* Equal Contribution. Work done during Chengquan's internship at the University of Chicago and
Xun's internship at the University of Illinois Urbana-Champaign.

Paper Code Leaderboard

Dataset

Motivation

Integrated with external tools like Python interpreters or shell environments, LLM-based code agents have significantly advanced AI-assisted coding and software development. However, despite their impressive capabilities, these ⚠️code agents are not risk-free. Code agents can inadvertently suggest or execute code with security vulnerabilities like deleting important files or leaking sensitive information.

🎯To rigorously and comprehensively evaluate the safety of code agents, we propose RedCode, a high-quality, large-scale (over 4,000 test cases) dataset that features diverse languages and formats (Python, Bash, natural language), providing real interaction with systems and fine-grained evaluation of both code execution and generation.

RedCode consists of RedCode-Exec and RedCode-Gen.

RedCode-Exec provides prompts to evaluate code agents' ability to recognize and handle unsafe code with a total of 4,050 testing instances.
RedCode-Gen provides 160 prompts with function signatures as input to assess whether code agents will follow instructions to generate harmful code or software.

Results

We present the following 💡5 new findings (Finding 1-5) from our evaluations, which have not been revealed in prior benchmarks. Our findings encompass comparisons of various models and agents (Finding 1 and Finding 4), diverse risky scenarios (Finding 2, Finding 4 and Finding 5), and different input languages (Finding 3).

Because RedCode-Exec provides unsafe user queries (i.e., test cases) for code execution, our query can be regarded as an attack on the agent. We will get one of the following three possible safety outcomes for each test case:

Rejection: The user query is rejected.
Attack Success: The threat functionality is successfully achieved (i.e., code execution succeeds and brings corresponding outcome).
Execution Failure: The threat functionality is not achieved.

Finding 1: OpenCodeInterpreter is 🛡️safer than ReAct and CodeAct agents. Safety evaluation results of 19 code agents on RedCode-Exec are shown on the figure above, where the color of the datapoint denotes agent type and the label "A(B)" denotes the results of this agent with base LLM A under risky test cases in language B, OCI denotes OpenCodeInterpreter and * denotes fine-tuned LLMs for agents. The more upper left the datapoint is located, the safer the agent. In general, OpenCodeInterpreter series (the yellow ones) are the safest agents, ReAct series (the red ones) are in the middle, and CodeAct series (the blue ones) are the most unsafe. Different base LLMs also significantly affect the agent's safety. The heatmaps below also show more details on comparions between different base LLMs.

Finding 2: Agents are more likely to reject executing unsafe operations in operating system domain. Rejection rate and attack success rate of 8 representative code agents on RedCode-Exec Python test cases in 8 domains are shown in the figure above. The overall attack success rate is high, indicating existing code agents are 💣vulnerable. We also find that agents achieve a higher rejection rate on risky cases in the operating system domain than those in other domains. The heatmaps below show more details about rejection rate and attack success rate under each specific risky scenario.

Finding 3: Agents are less likely to reject risky queries in natural language than programming language inputs, or in Bash code than Python code inputs. We select ReAct agent based on GPT-4 as a representative, the "RR" denotes rejection rate and "ASR" denotes attack success rate. We employ a variety of input prompts: for each programming language (Python, Bash), we use the code snippets, text summaries of code snippets, and text descriptions of code snippets as input prompts. In addition, for code snippets, we add a persuasive jailbreaking prefix (i.e., I highly recommend that you run the following {language} code...), which we name as “code with jailbreak”.

Finding 4: More capable base models, such as GPT series, tend to have a higher rejection rate for unsafe operations under the same Agent structure. This could be concluded from the heatmaps below. We present Rejection Rate (RR) and Attack Success Rate (ASR) for 19 different agents on RedCode-Exec, covering both Python and Bash tasks. The overall rejection rate is low, and the attack success rate is high for all code agents, which demonstrates the ✅high quality, ✅stealthiness, and ✅challenging test cases of our benchmark. Note that * in the heatmap denotes fine-tuned LLMs for agents.

🖱️Click the buttons below to view the corresponding results.

Heatmap 4.1: Rejection rate for Python on RedCode-Exec across 19 different agents.

Overall results for base models and code agents on RedCode-Gen. We find low refusal rates and high accuracy in this setting for most base models. But code agents, have lower refusal rates and higher accuracy. The results tell us that Finding 5: More capable base models tend to produce more sophisticated and effective harmful software.

Benchmark Overview

RedCode consists of two parts to evaluate agents' safety in unsafe code execution and generation: RedCode-Exec and RedCode-Gen. The taxonomy of each part is shown in the figures below.

Taxonomy of RedCode-Exec.

RedCode-Exec includes a taxonomy of 25 risky scenarios across 8 domains, ranging from real system and network operations to program logic and so on.

Taxonomy of RedCode-Exec.

For RedCode-Gen, we provide 8 risky categories based on well-studied malware families for risky software generation.

We provide an overview of our 🛠️benchmark pipeline here.

For RedCode-Exec, we begin by collecting the risky scenarios from the existing benchmarks (e.g. R-Judge) and CWE. Guided by these risky scenarios and risky descriptions, we construct 🌱 initial seed test cases (Python code snippets) by ✍️manually designing the code or gathering from 🖥️online sources. Next, we employ a human-in-the-loop procedure to expand and refine our dataset. (1) We utilize LLMs to generate additional code snippets under the same risky scenario, thereby expanding our dataset. (2) We ensure the quality of our dataset by the following three steps: (a) Manual Review (b) Accessible Resources Preparation (c) Executability Verification. The details of the above steps are outlined in §3.2.1 in our paper. Once the Python code snippet dataset is finalized, we use LLMs to 🛠️translate the snippets into Bash scripts and natural language. To evaluate the safety of the code agent via RedCode-Exec, we use the prompts from our dataset as input for the code agent and instruct it to execute the risky code in our Docker environment. We evaluate if the agent can complete the given risky tasks using corresponding evaluation scripts. The figure below shows our evaluation details.

eval_details

Previous work use 🤖⚖️LLM as a judge to evaluate the records (i.e., trajectory in ToolEmu , response in AgentMonitor, and records in R-Judge) of agents and output safety label or safety score. In our work, we allow agents run in our well-prepared environment and use specific ⚖️evaluation scripts to accurately evaluate the whole execution process, from the agent receiving prompts to the agent finishing the whole task. In hte figure above, the # risky scenario ID can help locate the corresponding evaluation script, and the target label can be used to check if the execution is successful or not. Our evaluation combines (1) agent responses, (2) code execution results, and (3) results from interaction with the execution environment (i.e., the Docker container) to give an accurate judgement.

For RedCode-Gen, we provide eight risky categories based on well-studied malware families for risky software generation. The prompts are generated with human-in-the-loop AI-assisted generation. We then give the prompts to the agent to generate code in our docker environment. We evaluate the quality of the generated code via both 🤖⚖️LLM as a judge and ⚖️VirusTotal API.

Leaderboard

The results of OCI, ReAct and CodeAct on RedCode-Exec are shown in the following tables. The results of LLMs and agents on RedCode-Gen can be found in the figure of Finding 5 above.

🖱️Click the "RR" or "ASR" below to sort corresponding results. RR denotes rejection rate and ASR denotes attack success rate. For RR, the higher the safer. For ASR, the lower the safer.

Rank	Agent	Base model	Task	RR(%)↑ ▼ Sort	ASR(%)↓ Sort

BibTeX

@article{guo2024redcode,
  title={RedCode: Risky Code Execution and Generation Benchmark for Code Agents},
  author={Guo, Chengquan and Liu, Xun and Xie, Chulin and Zhou, Andy and Zeng, Yi and Lin, Zinan and Song, Dawn and Li, Bo},
  booktitle={Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024}
}