This is a tutorial on how to reverse engineer shellcode in malware with Radare2. Spoilers!

MalwareTech published a small challenge on his Twitter for reverse engineering embedded shellcode inside of the malware. I thought this was a great opportunity to write a small tutorial on how to do this with Radare2 on Mac.

Screen-Shot-2018-05-23-at-11.53.35-PM

You can download the sample here. After you unpack the archive, you will find a single .exe file with readme instructions saying that this is a static analysis challenge (so no debugging).

You can run the executable to get the MD5 hash of the correct flag:

Screen-Shot-2018-05-23-at-7.41.49-PM

Let's launch Radare and see what's inside of the executable.

radare2 shellcode1.exe
 -- r2 talks to you. tries to make you feel well.
[0x00402270]>

Radare loads up and puts you into the entry point of the program. You can verify that by running ie:

[0x00402270]> ie
[Entrypoints]
vaddr=0x00402270 paddr=0x00001670 baddr=0x00400000 laddr=0x00000000 haddr=0x00000108 type=program

1 entrypoints

Next step is to analyze the binary. This is where Radare is different from the Hopper and IDA - you have to manually start analysis process. The reason for that is that analysis may take a while, and radare developers decided not do it by default. Anyway, type aaa.

[0x00402270]> aaa
[ WARNING : block size exceeding max block size at 0x00401000
[+] Try changing it with e anal.bb.maxsize
[x] Analyze all flags starting with sym. and entry0 (aa)
[x] Analyze function calls (aac)
[x] Analyze len bytes of instructions for references (aar)
[x] Use -AA or aaaa to perform additional experimental analysis.
[x] Constructing a function name for fcn.* and sym.func.* functions (aan)

After the analysis is complete, radare will group sections, functions, symbols, strings, etc under flags. You can see those by typing fs:

[0x00402270]> fs
0   19 * strings
1   11 * symbols
2   10 * sections
3    9 * relocs
4    9 * imports
5    1 * resources
6    4 * functions

Let's take a look at the functions radare discovered. afl stands for analyzed functions list.

[0x00402270]> afl
0x00401000    1 1022         sym.shellcode1.exe__MD5Transform_MD5__CAXQAKQAE_Z
0x00401e50    5 160          sym.shellcode1.exe__Encode_MD5__CAXPAEPAKI_Z
0x00401ef0    5 117          sym.shellcode1.exe__Decode_MD5__CAXPAKPAEI_Z
0x00401f70    1 22           sym.shellcode1.exe___0MD5__QAE_XZ
0x00401f90    1 70           sym.shellcode1.exe__Init_MD5__QAEXXZ
0x00401fe0   10 263          sym.shellcode1.exe__Update_MD5__QAEXPAEI_Z
0x004020f0    4 161          sym.shellcode1.exe__Final_MD5__QAEXXZ
0x004021a0    5 74           sym.shellcode1.exe__writeToString_MD5__QAEXXZ
0x004021f0    1 51           sym.shellcode1.exe__digestMemory_MD5__QAEPADPAEH_Z
0x00402230    1 60           sym.shellcode1.exe__digestString_MD5__QAEPADPAD_Z
0x00402270    1 175          entry0
0x00402326    1 6            sub.ntdll.dll_memset_326
0x0040232c    1 6            sub.ntdll.dll_memcpy_32c
0x00402332    1 6            sub.ntdll.dll_sprintf_332
0x00402338    1 6            sub.ntdll.dll_strlen_338

Interesting, so here we have a number of MD5-related functions and the main function labeled entry0. Given we are looking for the shellcode, it is likely to be kept in data. Let's take a look at the strings with command iz:

[0x00402270]> iz
000 0x00001830 0x00403030  23  24 (.rdata) ascii We've been compromised!
001 0x00001848 0x00403048   4   5 (.rdata) ascii %02x
002 0x000018d2 0x004030d2  11  12 (.rdata) ascii ExitProcess
003 0x000018e0 0x004030e0  12  13 (.rdata) ascii VirtualAlloc
004 0x000018f0 0x004030f0   9  10 (.rdata) ascii HeapAlloc
005 0x000018fc 0x004030fc  14  15 (.rdata) ascii GetProcessHeap
006 0x0000190c 0x0040310c  12  13 (.rdata) ascii KERNEL32.dll
007 0x0000191c 0x0040311c   6   7 (.rdata) ascii memset
008 0x00001926 0x00403126   6   7 (.rdata) ascii memcpy
009 0x00001930 0x00403130   7   8 (.rdata) ascii sprintf
010 0x0000193a 0x0040313a   6   7 (.rdata) ascii strlen
011 0x00001942 0x00403142   9  10 (.rdata) ascii ntdll.dll
012 0x0000194e 0x0040314e  11  12 (.rdata) ascii MessageBoxA
013 0x0000195a 0x0040315a  10  11 (.rdata) ascii USER32.dll
014 0x00001984 0x00403184   4  20 (.rdata) utf32le \n\n㆘㇀ blocks=Basic Latin,Kanbun,CJK Strokes
015 0x000019f6 0x004031f6 135 272 (.rdata) utf16le \a\b\t桳汥捬摯ㅥ攮數㼀〿䑍䀵兀䕁塀Z䐿捥摯䁥䑍䀵䍀塁䅐偋䕁䁉Z䔿据摯䁥䑍䀵䍀塁䅐偅䭁䁉Z䘿湩污䵀㕄䁀䅑塅婘㼀湉瑩䵀㕄䁀䅑塅婘㼀䑍吵慲獮潦浲䵀㕄䁀䅃兘䭁䅑䁅Z唿摰瑡䁥䑍䀵兀䕁偘䕁䁉Z搿杩獥䵴浥牯䁹䑍䀵兀䕁䅐偄䕁䁈Z搿杩獥却牴湩䁧䑍䀵兀䕁䅐偄䑁婀㼀牷瑩呥卯牴湩䁧䑍䀵兀䕁塘Z blocks=Basic Latin,CJK Unified Ideographs,Hangul Compatibility Jamo,CJK Unified Ideographs Extension A,CJK Symbols and Punctuation
000 0x00001c40 0x00404040   9  11 (.data)  utf8 2b\n:ۚB*bb blocks=Basic Latin,Arabic
001 0x00001c4b 0x0040404b   5   6 (.data) ascii z"*iJ
000 0x00001e58 0x00405058 424 424 (.rsrc) ascii <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">\r\n  <trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">\r\n    <security>\r\n      <requestedPrivileges>\r\n        <requestedExecutionLevel level="asInvoker" uiAccess="false"></requestedExecutionLevel>\r\n      </requestedPrivileges>\r\n    </security>\r\n  </trustInfo>\r\n</assembly>PAPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPAD

We see the "We_ve_been_compromised" string and also some gibberish @ 0x00001984. This is interesting because if the task is to find a shellcode, in the simpliest case we know to look for two things:

  1. Encoded string that is likely not to be in ASCII format, hence radare won't be able to display it.
  2. The shellcode itself will likely to be in the form of the bytecode, which again, will show up as gibberish.

Let's take a look at the main function. We are already supposed to be on its offset, but just for the practice sake, we can move to other functions by typing s and function name or address:

[0x00402270]> s 0x00402270

To dissasemble the function, type pdf:

[0x00402270]> pdf
/ (fcn) entry0 175
|   entry0 ();
|           ; var int local_a0h @ ebp-0xa0
|           ; var int local_9ch @ ebp-0x9c
|           ; var int local_98h @ ebp-0x98
|           ; var int local_4h @ ebp-0x4
|           0x00402270      55             push ebp
|           0x00402271      8bec           mov ebp, esp
|           0x00402273      81eca0000000   sub esp, 0xa0
|           0x00402279      56             push esi
|           0x0040227a      8d8d68ffffff   lea ecx, [local_98h]
|           0x00402280      e8ebfcffff     call sym.shellcode1.exe___0MD5__QAE_XZ
|           0x00402285      6a10           push 0x10                   ; 16
|           0x00402287      6a00           push 0
|           0x00402289      ff1508304000   call dword [sym.imp.KERNEL32.dll_GetProcessHeap] ; 0x403008
|           0x0040228f      50             push eax
|           0x00402290      ff1504304000   call dword [sym.imp.KERNEL32.dll_HeapAlloc] ; 0x403004
|           0x00402296      8945fc         mov dword [local_4h], eax
|           0x00402299      8b45fc         mov eax, dword [local_4h]
|           0x0040229c      c70040404000   mov dword [eax], str.2b__:__B_bb ; [0x404040:4]=0x3a0a6232 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
|           0x004022a2      6840404000     push str.2b__:__B_bb        ; 0x404040 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
|           0x004022a7      e88c000000     call sub.ntdll.dll_strlen_338 ; size_t strlen(const char *s)
|           0x004022ac      83c404         add esp, 4
|           0x004022af      8b4dfc         mov ecx, dword [local_4h]
|           0x004022b2      894104         mov dword [ecx + 4], eax
|           0x004022b5      6a40           push 0x40                   ; '@' ; 64
|           0x004022b7      6800100000     push 0x1000
|           0x004022bc      6a0d           push 0xd                    ; 13
|           0x004022be      6a00           push 0
|           0x004022c0      ff1500304000   call dword [sym.imp.KERNEL32.dll_VirtualAlloc] ; 0x403000
|           0x004022c6      898560ffffff   mov dword [local_a0h], eax
|           0x004022cc      6a0d           push 0xd                    ; 13
|           0x004022ce      6868404000     push 0x404068
|           0x004022d3      8b9560ffffff   mov edx, dword [local_a0h]
|           0x004022d9      52             push edx
|           0x004022da      e84d000000     call sub.ntdll.dll_memcpy_32c ; void *memcpy(void *s1, const void *s2, size_t n)
|           0x004022df      83c40c         add esp, 0xc
|           0x004022e2      8b75fc         mov esi, dword [local_4h]
|           0x004022e5      ff9560ffffff   call dword [local_a0h]
|           0x004022eb      6840404000     push str.2b__:__B_bb        ; 0x404040 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
|           0x004022f0      8d8d68ffffff   lea ecx, [local_98h]
|           0x004022f6      e835ffffff     call sym.shellcode1.exe__digestString_MD5__QAEPADPAD_Z
|           0x004022fb      898564ffffff   mov dword [local_9ch], eax
|           0x00402301      6a30           push 0x30                   ; '0' ; 48
|           0x00402303      6830304000     push str.We_ve_been_compromised ; 0x403030 ; "We've been compromised!"
|           0x00402308      8b8564ffffff   mov eax, dword [local_9ch]
|           0x0040230e      50             push eax
|           0x0040230f      6a00           push 0
|           0x00402311      ff1514304000   call dword [sym.imp.USER32.dll_MessageBoxA] ; 0x403014 ; "L1"
|           0x00402317      6a00           push 0
\           0x00402319      ff150c304000   call dword [sym.imp.KERNEL32.dll_ExitProcess] ; 0x40300c
[0x00402270]>

Seems like the program starts with a call to MD5 init function:

0x00402280      e8ebfcffff     call sym.shellcode1.exe___0MD5__QAE_XZ

This is not very interesting for now. In the next set of functions there are two calls to Windows' memory management functions GetProcessHeap & HeapAlloc:

0x00402285      6a10           push 0x10                   ; 16
0x00402287      6a00           push 0
0x00402289      ff1508304000   call dword [sym.imp.KERNEL32.dll_GetProcessHeap] ; 0x403008
0x0040228f      50             push eax
0x00402290      ff1504304000   call dword [sym.imp.KERNEL32.dll_HeapAlloc] ; 0x403004

MSDN reference for GetProcessHeap says it doesn't take any arguments and returns a handle to the calling process's heap.

HANDLE WINAPI GetProcessHeap(void);

Then there is a call to HeapAlloc which has the following definition:

LPVOID WINAPI HeapAlloc(
  _In_ HANDLE hHeap,
  _In_ DWORD  dwFlags,
  _In_ SIZE_T dwBytes
);

Looking at the parameters pushed onto the stack, it looks like the HeapAlloc was called with the following params:

LPVOID WINAPI HeapAlloc(
  _In_ HANDLE hHeap,   # the handle returned by GetProcessHeap
  _In_ DWORD  dwFlags, # 0 
  _In_ SIZE_T dwBytes  # 0x10 = 16 bytes
);

Let's see what this heap is for -

0x00402296      8945fc         mov dword [local_4h], eax
0x00402299      8b45fc         mov eax, dword [local_4h]
0x0040229c      c70040404000   mov dword [eax], str.2b__:__B_bb

We save the pointer to the allocated memory block from HeapAlloc to the address at local_4h and then load the pointer to str.2b__:__B_bb into the address at eax. Generally, with DEP, HeapAlloc will return non-executable memory. Let's verify that the DEP in fact enabled:

[0x00402270]> iI~nx
nx       true

This means, for now, we can assume that this is the encoded string. Let's pull it out. I printed all the characters until the null termination 0x00 as a C array so it will be easier to write a decoder later.

[0x00402270]> pc 39 @ 0x404040
#define _BUFFER_SIZE 39
const uint8_t buffer[39] = {
  0x32, 0x62, 0x0a, 0x3a, 0xdb, 0x9a, 0x42, 0x2a, 0x62, 0x62,
  0x1a, 0x7a, 0x22, 0x2a, 0x69, 0x4a, 0x9a, 0x72, 0xa2, 0x69,
  0x52, 0xaa, 0x9a, 0xa2, 0x69, 0x32, 0x7a, 0x92, 0x69, 0x2a,
  0xc2, 0x82, 0x62, 0x7a, 0x4a, 0xa2, 0x9a, 0xeb, 0x00
};
[0x00402270]>

We still need to find the shellcode responsible for decoding. A little lower we see the VirtualAlloc call

0x004022b5      6a40           push 0x40                   ; '@' ; 64
0x004022b7      6800100000     push 0x1000
0x004022bc      6a0d           push 0xd                    ; 13
0x004022be      6a00           push 0
0x004022c0      ff1500304000   call dword [sym.imp.KERNEL32.dll_VirtualAlloc] ; 0x403000

With the following signature:

LPVOID WINAPI VirtualAlloc(
  _In_opt_ LPVOID lpAddress, 0
  _In_     SIZE_T dwSize, = 0xd0 13
  _In_     DWORD  flAllocationType, 0x1000 = MEM_COMMIT
  _In_     DWORD  flProtect 0x40 = PAGE_EXECUTE_READWRITE
);

The PAGE_EXECUTE_READWRITE parameter allows us to execute code from the heap, which means this allocation will likely hold the decoder. Let's see:

0x004022cc      6a0d           push 0xd                    ; 13
0x004022ce      6868404000     push 0x404068
0x004022d3      8b9560ffffff   mov edx, dword [local_a0h]
0x004022d9      52             push edx
0x004022da      e84d000000     call sub.ntdll.dll_memcpy_32c ; void *memcpy(void *s1, const void *s2, size_t n)
0x004022df      83c40c         add esp, 0xc
0x004022e2      8b75fc         mov esi, dword [local_4h]
0x004022e5      ff9560ffffff   call dword [local_a0h]

Here we see memcpy being called to copy 13 bytes from memory location 0x404068 into the freshly allocated space on executable heap and then called @ 0x004022e5.

Let's see what those 13 bytes look like:

[0x00402270]> px 13 @ 0x404068
- offset -   0 1  2 3  4 5  6 7  8 9  A B  C D  E F  0123456789ABCDEF
0x00404068  8b3e 8b4e 04c0 440f ff05 e2f9 c3         .>.N..D......

It definitely looks like a bytecode. Radare has a very cool feature to decompile the bytecode pD 13 @ 0x404068:

Screen-Shot-2018-05-24-at-1.20.26-AM

And here is our decoder. Looks like it loads the pointer to the string into edi, the length of the string in ecx and then does rotates left by 5 bits on each byte. This is a little annoying as C doesn't have rotate left function by default, so let's just use the decoder as inline assembly.

GCC takes AT&T syntax, so we have to change the disassembly flavor. Btw, you can chain commands in radare with ;.

[0x00402270]> e asm.syntax=att; pD 13 @ 0x404068
            ; DATA XREF from 0x004022ce (entry0)
            0x00404068      8b3e           movl 0(%esi), %edi
            0x0040406a      8b4e04         movl 4(%esi), %ecx          ; [0x4:4]=-1 ; 4
        .-> 0x0040406d      c0440fff05     rolb $5, -1(%edi, %ecx)
        `=< 0x00404072      e2f9           loop 0x40406d
            0x00404074      c3             retl

Now onto writing the decoder. The idea is to pass the pointer to the encoded string to edi, the buffer size to ecx and decode the string in place, so we don't have to rewrite the assembly.

__asm__ __volatile__ (
       "movl %0, %%edi\n"
       "movl %1, %%ecx\n"
       "loop:"
       "rolb $5, -1(%%edi, %%ecx)\n"
       "loop loop\n"
       : /* We are rewritting in place, so no output */
       : "r"(buffer), "r"(_BUFFER_SIZE)
   );

If you are not familiar with inline assembly, let me clarify the example a little bit. The %0, %1 are operands in the order received. Because the placeholders are written with % we have to use %% for registers. After the assembly, we have two :. The first block is for outputs, which in our case are none. The second is for inputs, where we are supplying our buffer and its size.

Since I am using Mac, I need to specify that the code will be compiled for 32-bit assembly, otherwise, GCC will try to use the 64bit registers and the code won't work: gcc -o decode decode.c -m32.

Now all that's left is to run it and get the flag. I leave this to you.


[Spoiler] Here is the code for the complete solution: click me