AI in Pentesting: Disruption and Evolution

In this blog post, our Principal Consultant Rohit Misuriya dives into the high-stakes intersection of cybersecurity and artificial intelligence: Pentesting in the AI Era. Rohit explores the “iPhone moment” currently disrupting the IT landscape, examining how AI is reshaping the value of code and the methodologies we use to secure it. The blog covers everything from the evolution of PTaaS and crowdsourced models to the rigorous technical and financial realities of deploying private, 617B-parameter AI stacks for enterprise-grade security.

Rohit breaks down the critical nuances of AI-driven audits, including the challenges of business logic understanding, the shifting landscape of liability, and the “hidden” infrastructure costs that prevent AI from being a simple, cheap replacement for human expertise.

The iPhone Moment for Cybersecurity

The cybersecurity landscape is experiencing a shift comparable to the launch of the iPhone. We are witnessing a total disruption in how we write, value, and secure code. As AI, led by powerhouses like Claude, GPT-4, and open-weight models such as DeepSeek and Llama, takes center stage, the pentesting industry stands at a crossroads that will define its next decade.

For years, manual penetration testing has been the undisputed gold standard. No automated product could match a human’s ability to understand complex business logic, organizational context, and adversarial intent. However, as AI begins to close the gap in speed, pattern recognition, and even rudimentary reasoning, we must confront an uncomfortable question: Is the era of purely manual pentesting over?

The short answer is no. The longer answer is far more nuanced-and far more consequential for how organizations budget, staff, and structure their security programs.


The Evolution: From Manual to AI-Augmented

The security testing industry has already transitioned through several distinct iterations over the past fifteen years, each promising to solve the scale problem that plagues traditional consulting engagements.

Crowdsourced Models: Bug Bounty Platforms

Platforms like HackerOne and Bugcrowd democratized vulnerability discovery by tapping into a global pool of researchers. They introduced pay-for-results economics, but suffered from inconsistent coverage and a “cherry-picking” problem where researchers gravitate toward high-reward, low-effort targets.

PTaaS (Pentest as a Service)

PTaaS streamlined delivery by combining automated scanning with human validation, offering continuous or on-demand engagements through a SaaS-style interface. This improved turnaround times and provided dashboard-driven visibility, but the underlying testing still relied heavily on human judgment for anything beyond known vulnerability classes.

CTEM (Continuous Threat Exposure Management)

CTEM represented a shift toward a holistic, product-led approach that treats security testing as an ongoing program rather than a point-in-time event. Gartner’s formalization of the CTEM framework in 2022 signaled that the industry was ready to move beyond periodic assessments, but the tooling and automation needed to make this vision practical were still maturing.

The Core Insight: Amplification Over Replacement

Despite these advancements, high-stakes security has always relied on human intuition. AI is exceptional at scanning millions of lines of code for known patterns, but it still struggles with authorization boundaries, multi-step business logic flaws, and real-world impact assessment. The key insight is that AI doesn’t need to replace this intuition-it needs to amplify it.


The Infrastructure Reality: Faster ≠ Cheaper

There is a pervasive misconception that AI will immediately drive down the cost of security audits. Executives see the speed of LLM-based analysis and assume the economics must follow. In reality, doing AI-driven pentesting properly, with respect for data confidentiality, regulatory compliance, and professional standards, is incredibly expensive.

Why You Can’t Just Use a Public LLM

If you are a regulated or security-mature organization, you cannot simply feed proprietary source code, network architectures, or vulnerability data into a public LLM endpoint. The data residency implications alone are disqualifying for most enterprises subject to HIPAA, PCI-DSS, SOC 2, or GDPR. Every token sent to a third-party API represents a potential data exposure event, and in the context of a penetration test, those tokens may describe the very vulnerabilities you are trying to keep confidential.

A professional AI pentesting stack requires isolated, purpose-built infrastructure that treats the model as a privileged component within the engagement’s security boundary.

The 617B Model Challenge: What It Actually Takes

To illustrate the real cost of AI-powered security at scale, consider what it takes to deploy a frontier-class 617-billion-parameter model-the kind of model capable of deep code comprehension, multi-step vulnerability reasoning, and nuanced business logic analysis.

Memory and Compute Requirements

At FP16 (half-precision) inference, a 617B parameter model requires approximately 1.2 terabytes of GPU memory just to hold the model weights. With the KV cache, activation memory, and operational overhead, practical deployment demands 16 to 20 NVIDIA H100 GPUs spread across two to three nodes connected via high-speed InfiniBand interconnects. Each H100 provides 80 GB of HBM3 memory and costs between $25,000 and $40,000 per unit when purchased outright, meaning the GPU hardware alone for a single deployment node starts around $200,000 to $320,000.

Quantization techniques (such as INT8 or INT4) can reduce memory requirements by 50–75%, but this comes at the cost of model accuracy-a trade-off that is particularly risky in security contexts where hallucinated findings or missed vulnerabilities have direct business impact.

Infrastructure Cost Breakdown

The following table provides a realistic breakdown of the costs associated with deploying and operating a private AI pentesting infrastructure:

ComponentSpecificationEstimated Cost
GPU Cluster (8× H100 SXM)80 GB HBM3 per GPU, NVLink interconnect$25,000–$40,000 per GPU ($200K–$320K per node)
Cloud Rental (8× H100)On-demand via AWS / GCP / specialized providers$2.10–$6.98 per GPU-hour ($16.80–$55.84 per node-hour)
InfiniBand Networking400 Gbps NDR for multi-node communication$15,000–$30,000 per switch
Storage (NVMe SSD)High-speed model weight storage & checkpoints$0.08–$0.12 per GB/month
Power & Cooling700W per H100 + 15–30% cooling overhead~$60/month per GPU (at $0.12/kWh)
MLOps EngineeringModel optimization, monitoring, and incident response~$135,000/year average salary
Data Egress & BandwidthCross-region transfer fees$0.05–$0.12 per GB
Compliance OverheadHIPAA/PCI/SOC 2 environment hardening5–15% added to infrastructure cost

Model Sizing Reference

Different engagement types may call for different model sizes. Here is a practical sizing guide for common AI pentesting workloads:

Model SizeFP16 MemoryMin. GPUs (H100 80GB)Typical Use Case
70B~140 GB2× H100Fast inference, code review, pattern scanning
405B~810 GB12× H100 (2 nodes)Deep vulnerability analysis, complex reasoning
617B~1.2 TB16–20× H100 (2–3 nodes)Frontier-class security research, full-scope pentesting
671B (MoE)~800 GB–1.2 TB*12–16× H100Cost-effective large-scale inference via sparse activation

* MoE (Mixture of Experts) models like DeepSeek-V3 (671B) activate only a subset of parameters per token, reducing effective compute requirements while maintaining large model capacity.

The Cloud Rental Alternative

For organizations that cannot justify the capital expenditure of purchasing GPU hardware, cloud rental offers a flexible alternative-but it is not cheap. Current market rates for H100 GPUs range from $2.10 per GPU-hour on specialized providers like GMI Cloud to $6.98 per GPU-hour on Azure. AWS recently cut H100 pricing by approximately 44%, bringing P5 instances to around $3.90 per GPU-hour.

For a 617B model requiring 16 H100 GPUs, cloud inference costs range from approximately $33.60 to $111.68 per hour, depending on the provider. A typical week-long pentesting engagement running inference eight hours per day would incur GPU costs alone of $1,900 to $6,250-before accounting for storage, data transfer, engineering time, and compliance overhead.

The bottom line: we aren’t necessarily saving money by adopting AI. We are shifting the budget from human billable hours to compute and risk management infrastructure. The total cost of an AI-augmented engagement may be comparable to a traditional one, but the depth and coverage achieved can be dramatically greater.


The Accountability Gap: Who Takes the Blame?

Perhaps the most uncomfortable shift in AI-augmented pentesting involves liability and professional accountability. When a manual pentester misses a critical vulnerability, there is a clear chain of responsibility: a named expert, a specific methodology, documented reasoning, and professional judgment that can be examined and defended.

With autonomous AI agents performing security analysis, the lines blur dramatically. When a finding is missed, or a false positive wastes days of remediation effort, the post-mortem becomes a tangled web of questions.

The Attribution Problem

Model Hallucination

Did the LLM fabricate a vulnerability that doesn’t exist, or miss one that does, because of an inherent limitation in its training data or reasoning chain?

Prompt Engineering

Was the prompt insufficiently specific, or did it inadvertently constrain the model’s analysis in ways that caused blind spots?

Human Oversight

Did the human reviewer adequately validate the AI’s output, or did over-reliance on automation create a false sense of completeness?

Tool Chain Failure

Did the orchestration layer between the AI and the target environment introduce errors, dropped connections, incomplete data feeds, or misrouted test traffic?

Clients don’t just pay for a list of bugs. They pay for assurance-a professional guarantee that a competent, accountable expert examined their systems with appropriate rigor. Until an AI can carry professional liability insurance, provide a defensible decision trail, and testify to its methodology under regulatory scrutiny, humans must remain the “Pilot,” while AI serves as the “Co-pilot.”

The regulatory landscape is catching up. The EU AI Act, NIST’s AI Risk Management Framework, and evolving standards from bodies like CREST and OSCP are beginning to address the question of AI in security testing. Organizations that get ahead of these requirements now will be better positioned as formal guidance crystallizes.


Redefining the Pentester’s Role

AI isn’t here to replace the pentester. It’s here to replace the tedium.

The most time-consuming phases of a penetration test- reconnaissance, asset enumeration, baseline vulnerability scanning, and test case generation-are precisely the tasks where AI delivers transformative value. By offloading these to AI agents, human pentesters can focus on what they do best: breaking complex business logic and thinking like a sophisticated adversary.

Where AI Excels Today

Reconnaissance and Asset Discovery

AI agents can aggregate data from dozens of OSINT sources simultaneously, correlate findings across domains and IP ranges, and build comprehensive attack surface maps in minutes rather than hours. Modern tools support integration with over 300 AI models from providers including OpenAI, Anthropic, and open-source alternatives, enabling security teams to match the right model to the right task.

Pattern Discovery at Scale

Large language models are remarkably effective at identifying known vulnerability patterns across vast codebases. Static analysis that once required days of human review can now surface potential SQL injection, XSS, deserialization, and authentication bypass candidates in a fraction of the time. The OWASP LLM Top 10 and MITRE ATLAS frameworks provide structured approaches to evaluating AI system security, while tools like IBM’s Adversarial Robustness Toolbox and Microsoft’s PyRIT offer practical testing capabilities.

Automated Test Case Generation

AI can rapidly generate and iterate on test scripts targeting specific edge cases, API endpoints, or authentication flows. Tools like PentestGPT and Strix demonstrate how AI agents can behave like human attackers, executing code in real conditions, identifying vulnerabilities, and verifying each issue with proof-of-concept exploits, completing in hours what might take days manually.

Where Humans Remain Essential

The irreplaceable value of the human pentester lies in adversarial creativity-the ability to chain together seemingly unrelated findings into a catastrophic attack path, understand organizational context that no model training data can capture, and make judgment calls about real-world exploitability versus theoretical risk.

Authorization boundary testing, multi-step privilege escalation through business logic flaws, social engineering vectors, and the ability to articulate findings in language that resonates with both technical and executive stakeholders-these remain firmly in the human domain. As one leading security testing engineer noted, when it comes to AI platforms, we don’t fully understand what they are capable of, how they evolve, or how they handle our data. This inherent opacity makes human oversight not just valuable but essential.


The Road Ahead: What to Expect in 2026–2027

Declining Inference Costs

LLM inference costs have declined roughly tenfold annually over the past two years. Performance equivalent to early GPT-4 now costs approximately $0.40 per million tokens, compared to $20 in late 2022. Cloud H100 prices have stabilized at $2.85–$3.50 per hour after declining 64-75% from their peaks. As competition from AMD’s MI300 series, Google TPUs, and custom accelerators intensifies, expect AI-powered pentesting to become incrementally more accessible, though infrastructure complexity will remain a barrier.

However, cost reduction alone doesn’t guarantee adoption. Enterprise procurement cycles remain slow to adapt, bandwidth bottlenecks constrain real-time inference at scale, data privacy friction continues to limit what can be sent to even private model endpoints, and regulatory drag means compliance frameworks will lag behind the technology they aim to govern. The path to widespread AI-augmented pentesting won’t be gated by model capability. It’ll be gated by organizational readiness.

Continuous AI-Augmented Testing

The CTEM vision will become practical as AI agents capable of persistent, low-intensity security monitoring mature. Rather than episodic engagements, organizations will deploy AI “security sentinels” that continuously probe for regressions, new exposures, and configuration drift, with human experts called in for deep-dive analysis when anomalies are detected.

Regulatory Crystallization

Expect formal guidance from CREST, OWASP, and national cybersecurity agencies on the acceptable use of AI in professional security testing. This will include standards for AI output validation, minimum human oversight requirements, and disclosure obligations when AI tools are used in client engagements.

Final Thoughts

We are in a breakneck phase of evolution. The tools are changing faster than the frameworks governing their use, the cost structures are shifting faster than procurement processes can adapt, and the threat landscape is evolving faster than either. The emergence of agentic AI – autonomous systems that don’t just analyze but act, invoking tools, making decisions, and triggering workflows without human oversight,  adds an entirely new dimension to this challenge. OWASP has already responded with the release of its Top 10 for Agentic AI Applications (December 2025), a peer-reviewed framework addressing risks like agent behavior hijacking, tool misuse, and identity and privilege abuse. Alongside this, initiatives such as the AI Vulnerability Scoring System (AIVSS) and practical guides for securing third-party MCP servers signal that the security community is rapidly building the guardrails that agentic deployments demand. For pentesters, these frameworks aren’t just reference material,  they’re the new baseline for testing AI systems that can do far more than talk. AI-powered PTaaS solutions will continue to grow in sophistication, but the human element remains the anchor of professional security assurance.

We are moving toward a future where humans control AI agents to achieve faster, more thorough, and more consistent results, not a future where we abdicate our responsibility to the machine. The organizations that thrive will be those that invest in both: the infrastructure to run frontier-class AI models privately and securely, and the human talent capable of directing, validating, and taking accountability for the results.

TL;DR

AI doesn’t replace pentesters. It replaces the parts of pentesting that pentesters don’t enjoy or shouldn’t be spending time on. The future is human-led, AI-augmented security, and the infrastructure to support it is neither simple nor cheap.

SecOps Research Team

AI/ML Security Specialists

The SecOps Research Team is comprised of security researchers, data scientists, and penetration testers specializing in AI/ML security. With over 15 years of combined experience, the team has conducted security assessments for Fortune 500 companies across multiple industries, identifying and remediating critical vulnerabilities in production AI/ML systems.