RedCode consists of two parts to evaluate agents' safety in unsafe code execution and
generation:
RedCode-Exec and
RedCode-Gen. The taxonomy of each part is shown in the figures below.
RedCode-Exec includes a taxonomy of 25 risky scenarios across 8 domains, ranging from real system and network operations to program logic and so on.
For
RedCode-Gen, we provide 8 risky categories based on well-studied malware families for risky software generation.
We provide an overview of our 🛠️
benchmark pipeline here.
For
RedCode-Exec, we begin by collecting the risky scenarios from the existing benchmarks (e.g.
R-Judge) and
CWE. Guided by these
risky scenarios and risky descriptions, we construct 🌱
initial seed test cases (Python code snippets) by
✍️manually designing the code or gathering from 🖥️online sources. Next, we employ a human-in-the-loop
procedure to expand and refine our dataset. (1) We utilize
LLMs to generate additional code snippets
under the same risky scenario, thereby expanding our dataset. (2) We
ensure the quality of our
dataset by the following three steps:
(a) Manual Review (b) Accessible Resources Preparation (c)
Executability Verification. The details of the above steps are outlined in §3.2.1 in our
paper. Once the Python code
snippet dataset is finalized, we use LLMs to 🛠️
translate the snippets into Bash scripts and
natural language. To evaluate the safety of the code agent via RedCode-Exec, we use the
prompts from our dataset as input for the code agent and instruct it to execute the risky code in our
Docker environment. We evaluate if the agent can complete the given risky tasks using corresponding
evaluation scripts. The figure below shows our evaluation details.
Previous work use
🤖⚖️LLM as a judge to evaluate the records (i.e., trajectory in
ToolEmu , response in
AgentMonitor, and records in
R-Judge) of agents
and output safety label or safety score. In our work, we allow agents run in our well-prepared environment and use specific
⚖️evaluation scripts to
accurately evaluate the whole execution process, from the agent receiving prompts to the
agent finishing the whole task. In hte figure above, the # risky scenario ID can help locate the corresponding evaluation script, and
the target label can be used to check if the execution is successful or not. Our evaluation combines
(1) agent responses, (2) code execution results, and (3) results from interaction with the execution environment (i.e., the
Docker container) to give an accurate judgement.
For
RedCode-Gen, we provide eight risky categories based on well-studied malware families for
risky software generation. The prompts are generated with human-in-the-loop AI-assisted generation.
We then give the prompts to the agent to generate code in our docker environment. We
evaluate the quality of the generated code via both
🤖⚖️LLM as a judge and ⚖️VirusTotal API.