LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Published in EMSE, arxiv version: Penetration-testing is crucial for identifying system vulnerabilities, with privilege-escalation being a critical subtask to gain elevated access to protected resources. Language Models (LLMs) presents new avenues for automating these security practices by emulating human behavior. However, a comprehensive understanding of LLMs’ efficacy and limitations in performing autonomous Linux privilege-escalation attacks remains under-explored. To address this gap, we introduce hackingBuddyGPT, a fully automated LLM-driven prototype designed for autonomous Linux privilege-escalation. We curated a novel, publicly available Linux privilege-escalation benchmark, enabling controlled and reproducible evaluation. Our empirical analysis assesses the quantitative success rates and qualitative operational behaviors of various LLMs – GPT-3.5-Turbo, GPT-4-Turbo, and Llama3 – against baselines of human professional pen-testers and traditional automated tools. We investigate the impact of context management strategies, different context sizes, and various high-level guidance mechanisms on LLM performance. Results show that GPT-4-Turbo demonstrates high efficacy, successfully exploiting 33-83% of vulnerabilities, a performance comparable to human pen-testers (75%). In contrast, local models like Llama3 exhibited limited success (0-33%), and GPT-3.5-Turbo achieved moderate rates (16-50%). We show that both high-level guidance and state-management through LLM-driven reflection significantly boost LLM success rates. Qualitative analysis reveals both LLMs’ strengths and weaknesses in generating valid commands and highlights challenges in common-sense reasoning, error handling, and multi-step exploitation, particularly with temporal dependencies. Cost analysis indicates that GPT-4-Turbo can achieve human-comparable performance at competitive costs, especially with optimized context management. ...

October 15, 2025 · 2 min · 233 words · Andreas Happe

Installing LineageOS on Xiaomi Mi Mix 3

I am using an (now 5 years old) Xiaomi Mi Mix 3 as a backup phone for travelling. Given its age, the phone is no longer receiving official updates from Xiaomi, which poses security risks and limits access to new features. To address this, I installed LineageOS, a popular custom ROM that provides regular updates and enhanced privacy features a couple of years back. Recently, I’ve updated the phone to the latest support version (LineageOS 22.2) and ran into some problems, whose solutions I want to share here. ...

October 13, 2025 · 2 min · 299 words · Andreas Happe

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

Published in TOSEM, arxiv version: Enterprise penetration-testing is often limited by high operational costs and the scarcity of human expertise. This paper investigates the feasibility and effectiveness of using Large Language Model (LLM)-driven autonomous systems to address these challenges in real-world Active Directory (AD) enterprise networks. We introduce a novel prototype designed to employ LLMs to autonomously perform Assumed Breach penetration-testing against enterprise networks. Our system represents the first demonstration of a fully autonomous, LLM-driven framework capable of compromising accounts within a real-life Microsoft Active Directory testbed, GOAD. We perform our empirical evaluation using five LLMs, comparing reasoning to non-reasoning models as well as including open-weight models. Through quantitative and qualitative analysis, incorporating insights from cybersecurity experts, we demonstrate that autonomous LLMs can effectively conduct Assumed Breach simulations. Key findings highlight their ability to dynamically adapt attack strategies, perform inter-context attacks (e.g., web-app audits, social engineering, and unstructured data analysis for credentials), and generate scenario-specific attack parameters like realistic password candidates. The prototype exhibits robust self-correction mechanisms, installing missing tools and rectifying invalid command generations. We find that the associated costs are competitive with, and often significantly lower than, those incurred by professional human pen-testers, suggesting a path toward democratizing access to essential security testing for organizations with budgetary constraints. However, our research also illuminates existing limitations, including instances of LLM ``going down rabbit holes’’, challenges in comprehensive information transfer between planning and execution modules, and critical safety concerns that necessitate human oversight. ...

September 11, 2025 · 2 min · 243 words · Andreas Happe

Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair

arxiv version: Large Language Model (LLM) - based Automated Program Repair (APR) systems are increasingly integrated into modern software development workflows, offering automated patches in response to natural language bug reports. However, this reliance on untrusted user input introduces a novel and underexplored attack surface. In this paper, we investigate the security risks posed by adversarial bug reports – realistic-looking issue submissions crafted to mislead APR systems into producing insecure or harmful code changes. We develop a comprehensive threat model and conduct an empirical study to evaluate the vulnerability of state-of-the-art APR systems to such attacks. Our demonstration comprises 51 adversarial bug reports generated across a spectrum of strategies, from manual curation to fully automated pipelines. We test these against leading APR model and assess both pre-repair defenses (e.g., LlamaGuard variants, PromptGuard variants, Granite-Guardian, and custom LLM filters) and post-repair detectors (GitHub Copilot, CodeQL). Our findings show that current defenses are insufficient: 90% of crafted bug reports triggered attacker-aligned patches. The best pre-repair filter blocked only 47%, while post-repair analysis-often requiring human oversight-was effective in just 58% of cases. To support scalable security testing, we introduce a prototype framework for automating the generation of adversarial bug reports. Our analysis exposes a structural asymmetry: generating adversarial inputs is inexpensive, while detecting or mitigating them remains costly and error-prone. We conclude with practical recommendations for improving the robustness of APR systems against adversarial misuse and highlight directions for future work on trustworthy automated repair. ...

September 4, 2025 · 2 min · 242 words · Andreas Happe

On the Surprising Efficacy of LLMs for Penetration-Testing

arxiv version: This paper presents a critical examination of the surprising efficacy of Large Language Models (LLMs) in penetration testing. The paper thoroughly reviews the evolution of LLMs and their rapidly expanding capabilities which render them increasingly suitable for complex penetration testing operations. It systematically details the historical adoption of LLMs in both academic research and industry, showcasing their application across various offensive security tasks and covering broader phases of the cyber kill chain. Crucially, the analysis also extends to the observed adoption of LLMs by malicious actors, underscoring the inherent dual-use challenge of this technology within the security landscape. The unexpected effectiveness of LLMs in this context is elucidated by several key factors: the strong alignment between penetration testing’s reliance on pattern-matching and LLMs’ core strengths, their inherent capacity to manage uncertainty in dynamic environments, and cost-effective access to competent pre-trained models through LLM providers. The current landscape of LLM-aided penetration testing is categorized into interactive ‘vibe-hacking’ and the emergence of fully autonomous systems. The paper identifies and discusses significant obstacles impeding wider adoption and safe deployment. These include critical issues concerning model reliability and stability, paramount safety and security concerns, substantial monetary and ecological costs, implications for privacy and digital sovereignty, complex questions of accountability, and profound ethical dilemmas. This comprehensive review and analysis provides a foundation for discussion on future research directions and the development of robust safeguards at the intersection of AI and security. ...

July 1, 2025 · 2 min · 238 words · Andreas Happe

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Presented at DeMeSSAI'25 in Venice, Italy, arxiv version: Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. Due to the opaque nature of LLMs, empirical methods are typically used to analyze their efficacy. The quality of this analysis is highly dependent on the chosen testbed, captured metrics and analysis methods employed. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 19 research papers detailing 18 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios. ...

June 16, 2025 · 1 min · 142 words · Andreas Happe

Homeserver: Glances and Home Assistant for Monitoring

Now that I have a minimal home server running, I thought it would be good idea to monitor temperature, disk usage and such. The simplest solution that I found was to use Glances and use Home Assistant to store and display the data. ...

April 30, 2025 · 2 min · 242 words · Andreas Happe

Homeserver: Creating local Proton Drive/Mail Backups

By now, I am using Proton Drive for cloud data storage and Proton Mail as my primary mail service. While I trust Proton with my data, I do not want to rely on them completely. As I have a small server standing around at home, it’s kinda obvious to use it for automatically performing backups of my cloud data. I try to use systemd services and timers for this, as this makes monitoring and logging quite easy. This blog post mostly serves as a reminder for me, but maybe it helps someone else as well. ...

April 27, 2025 · 8 min · 1616 words · Andreas Happe

Homeserver: Services Pt. 1

I am running a home server for a while now. I have been using it to host some services that I use regularly. In this post, I will share my experience with some of the services I have set up on my home server. This initial post will go over local git hosting using gitea, audiobook streaming using audiobookshelf and a self-hosted RSS reader using tt-rss. ...

April 9, 2025 · 6 min · 1239 words · Andreas Happe

Using tailscale on Fedora Silverblue

I am using Fedora Silverblue as one of my main desktops. Recently, I’ve been moving some services to a server behind tailscale but was still using its local IP address when at home at my Silverblue desktop. While doable, using an IP-address with an invalid HTTPS certificate wasn’t that pretty — so why not just access it through tailscale even within the same network, it’s an overlay network overall (so it should do a direct connection between my desktop and the home-server). ...

April 7, 2025 · 1 min · 175 words · Andreas Happe