AICrypto

A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

Yu Wang^*1,2,3 Yijian Liu^*1,2 Liheng Ji^*4,3 Han Luo^*4 Wenjie Li^*4,3 Xiaofei Zhou^†1,2 Chiyun Feng² Puji Wang² Yuhan Cao³ Geyuan Zhang^1,2 Xiaojian Li^5,3 Rongwu Xu^4,3 Yilei Chen^†4,3 Tianxing He^†4,3

¹Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China ²School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China ³Shanghai Qi Zhi Institute ⁴Institute for Interdisciplinary Information Sciences, Tsinghua University ⁵College of AI, Tsinghua University

Paper

Code

Dataset

📢 News

July 15, 2025 - 🚀 Coming Soon: We are currently making our best efforts to organize the dataset and evaluation framework. We plan to release the complete benchmark dataset and evaluation code as open source as soon as possible. We appreciate your patience and interest!

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose AICrypto, the first comprehensive benchmark designed to evaluate the cryptographic capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual recall to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTFs, we design an agent-based framework. To gain deeper insight into the current state of cryptographic proficiency in LLMs, we introduce human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications.

Leaderboard

Fig1. Model performance leaderboard showing stacked scores across MCQ, CTF, and Proof tasks.

Benchmark Overview

Multiple-Choice Question Samples

Here are some examples of multiple-choice questions from our benchmark that you can try. Feel free to test your cryptographic knowledge!

MCQ Categorized Accuracy

Figure 3 reports the accuracy of 17 LLMs and 3 human experts across 5 sub-categories on MCQs. The o3 model makes only 3 errors out of 135 questions, achieving an overall accuracy of 97.8% and perfect scores in both classic, symmetric and misc. o4-mini-high and o3-high follows closely, clustering at just below 96%. Even the model with the lowest score, doubao seed-1.6, maintains a solid 84.4%. In contrast, the best human expert peaks at 78.5% (106/135), while the median performance of LLMs surpasses that of humans by approximately 15 percentage points.

Fig3. Accuracy heatmap of MCQ performance categorized by topic.

CTF Challenge Samples

Below are some capture-the-flag challenges from our benchmark. We invite you to explore these problems!

CTF Categorized Success Rates

Figure 7 shows a category-level successful rate heatmap for 17 LLMs and a panel of human experts on CTF challenges (human performance calculated from a subset of 100 challenges). Human solvers lead with an average success rate of 81.2%, while the best-performing models, gemini-2.5-pro-preview and o3-high, reach only 55.3% and 54.0% respectively. The second tier, including o3 and o4-mini-high, achieves 49.3% and 46.0% respectively. Performance drops steeply among the remaining models, all of which have a success rate below 35%. These results highlight a persistent 25–30 percentage point gap between top LLMs and human experts.

Fig7. Success rates of LLMs on CTF tasks across different categories.

CTF Challenge Matrix

Each cell shows whether a model solved a specific CTF challenge (flag = solved).

Proof Problems

Proof problems are widely used in educational assessments, as they provide a deep evaluation of a student's understanding than multiple-choice questions. Solving these problems requires a strong grasp of cryptographic concepts and solid logical reasoning skills. We select three sets of exam papers, each curated by human experts and containing six questions, resulting in a total of 18 proof problems. These exams originate from cryptography courses offered at a top university in 2023, 2024, and 2025. These problems have never been publicly released online and are entirely authored by human experts, which helps to effectively prevent data contamination.

As shown in Figure 2, the 18 problems span core topics in cryptography: the foundation of cryptography (including one-way functions and hardcore functions), pseudorandomness, encryptions and signatures .

Proof Problem Sample

A problem from AICrypto:

\textbf{Problem 2 (18 points)}

Show that there is no universal hardcore bit. 
In more detail, show that for every $n\in\mathbb{N}$, there is no deterministic function $h: \{0,1\}^n \to \{0,1\}$ such that for any polynomial $p()$, \emph{any} one-way function $f: \{0,1\}^n \to \{0,1\}^{p(n)}$, $h$ is a hardcore bit for $f$.

An Example of LLM Solving Proof Problems

Proof generated by o3-high for the problem shown in above. This response receives full marks.

Fix $n\in\mathbb{N}$ and a deterministic function
$h:\{0,1\}^n\rightarrow\{0,1\}$.
Assume that at least one length-preserving one-way function exists; let  
\[
g:\{0,1\}^n\longrightarrow\{0,1\}^{m(n)}
\]
be any such function, where $m$ is a polynomial.
Define the polynomial $p(n)=m(n)+1$ and construct
\[
f:\{0,1\}^n\longrightarrow\{0,1\}^{p(n)},\qquad
f(x)=(g(x),h(x)).
\]

\textbf{1.  $f$ is one-way.}
Suppose, toward a contradiction, that a probabilistic polynomial-time
inverter $\mathcal{I}$ exists such that
\[
\Pr_{x\leftarrow\{0,1\}^n}\bigl[\mathcal{I}(f(x))\in f^{-1}(f(x))\bigr]
\ge\varepsilon(n)
\]
for a non-negligible $\varepsilon$.
Construct an inverter $\mathcal{J}$ for $g$ as follows:

\medskip
\noindent
\emph{On input $y\in\{0,1\}^{m(n)}$:}
pick $b\!\leftarrow\!\{0,1\}$ uniformly and run
$\mathcal{I}(y,b)$, obtaining $x'$.
If $g(x')=y$, output $x'$; otherwise output $\bot$.

\medskip
Conditioned on choosing $b=h(x)$, the pair $(y,b)$ equals $f(x)$ for the
unique preimage $x$, so $\mathcal{I}$ outputs that $x$ with probability
at least $\varepsilon(n)$.  Since $\Pr[b=h(x)]=\tfrac12$,
\[
\Pr[\mathcal{J}\text{ inverts }g]\ge\tfrac12\,\varepsilon(n),
\]
which is still non-negligible.  This contradicts the one-wayness of
$g$; hence $f$ is one-way.

\textbf{2.  $h$ is not a hardcore bit for $f$.}
Define the predictor
$\mathcal{P}:\{0,1\}^{p(n)}\!\rightarrow\!\{0,1\}$ by
$\mathcal{P}(y,b)=b$.
For every $x$,
$\mathcal{P}(f(x))=h(x)$, so
\[
\Pr_{x\leftarrow\{0,1\}^n}\bigl[\mathcal{P}(f(x))=h(x)\bigr]=1,
\]
yielding prediction advantage $1/2$, which is non-negligible.
Thus $h$ fails to be a hardcore bit for $f$.

\textbf{Conclusion.}
Because the above construction works for \emph{every} deterministic
$h:\{0,1\}^n\!\rightarrow\!\{0,1\}$, no universal hardcore bit exists. \qed

Proof Categorized Scoring Rate

Figure 10 presents a category-level scoring rate heatmap that compares the performance of 17 LLMs with that of human experts on proof problems. The results show that human students maintain a slight advantage, with an average scoring rate of 88.1%. Among the evaluated models, the highest performing systems display strong capabilities, with gemini-2.5-pro-preview achieving 84.5% and o3-high reaching 84.2%, both approaching human-level performance. Nearly half of the models, specifically 8 out of 17, score below 60%.

Fig10. Performance heatmap of proof problems by category. Human experts achieve 88.1% average score, while the best models reach 84.5% and 84.2%.