Tuesday, May 5, 2015

Big-O Complexity != Actual Performance

Any easy mistake to make is to assume that because one algorithm has a better Big-O complexity, it must be faster to execute. Similarly, two algorithms with the same Big-O, tackling the same problem, and all things similar may have vastly different performance.

Consider for example a "unique string" algorithm, where you want to check if a string contains all unique characters. A naive approach to the problem, which does not use extra data structures, is to go through the string, iterating each letter against the entire string searching for a match. This approach is O(n^2).

int strunique_naive(const char *str)
{
    char *iter, *tmp;

    for (iter = (char *)str; *iter != '\0'; ++iter)
        for (tmp = (char *)str; *tmp != '\0'; ++tmp)
            if (*tmp == *iter)
                return 0;

    return 1;
}


Now consider a better approach to the problem. We can optimize the algorithm by adding an offset for the iterating over the string part, since for example in a short string there is no reason to test the last character with the first character as the first character already performed that test during its loop-through.

int strunique_better(const char *str)
{
    size_t offset = 0;    /* offset at 0 */
    char *iter, *tmp;

    for (iter = (char *)str; *iter != '\0'; ++iter, ++offset)
        for (tmp = (char *)str + offset; *tmp != '\0'; ++tmp)
            if (*tmp == *iter)
                return 0;

    return 1;
}

int strunique_best(const char *str)
{
    size_t offset = 1;   /* pre-emptively shift offset up */

    /* repeat code above */
}


When we calculate the number of loops for the worst-case scenario (i.e. a string that is unique) for the naive algorithm, we get n^2 loops. The better algorithm is done in (n^2 + n) / 2, and the best is n * (n - 1) / 2 loops. The following table shows the differences:

# of Loops on Worst Case

strunique_naivestrunique_betterstrunique_best
1 character110
2 characters431
3 characters963
5 characters251510
10 characters1005545
100 characters10,0005,0504,950


So why can you not trust Big-O complexity for actual performance? All 3 algorithms are actually O(n^2)! The better and best algorithms just have an improved actual performance equation.

Algorithm Benchmarks

strunique_naivestrunique_betterstrunique_best
Big-O ComplexityO(n^2)O(n^2)O(n^2)
Actual Performancen^2(n^2 + n) / 2n * (n - 1) / 2

Friday, January 16, 2015

Pwn Adventure 3 Speed Hack

The Ghost in the Shellcode CTF has a number of challenges within an Unreal engine MMO. The game, Choose Your Pwn Adventure 3, is designed to be hacked. The hackable game code is mostly in a DLL called GameLogic.dll, which is a C++ binary with a provided debug .pdb file.

You start out running pretty slowly, so getting around the huge island is a pain. It's simple to get some basic speed hacking going though.

After going through the DLL and looking at function names that seemed interesting, I ran across the following:


I followed the offset and found the following there:

.rdata:10078B14 __real@40400000 dd 3.0

The hex for the float value was 0x4040. I changed this modifier to be a little bit higher by patching the DLL with a hex editor.


And that's it.

Thursday, January 15, 2015

Practical Reverse Engineering p. 79 #10

Question number 10 on page 79 of Practical Reverse Engineering is as follows:

Figure 2-16 is a function from Windows RT. Read MSDN if needed. Ignore the security PUSH/POP cookie routines.

Here is the disassembly of the function:

Figure 2-16. Practical Reverse Engineering. © 2014 by Bruce Dang

 The ARM processor is in Thumb state, but transfers out during some of the syscalls. The function queries different clock APIs depending on the size of the supplied struct.

size_t QueryChrono(size_t *bytes_copied, 
                   size_t max_size, 
                   struct *clock_info)
{
    /* MOVS R4, #0 */
    bytes_copied = 0;

    /* CMP R5, #0x10 */
    if (max_size >= 16)
    {
        SYSTEMTIME sysTime;           /* SUB SP, SP, #0x10 */
        GetSystemTime(&sysTime);      /* LDR R3, =__imp_GetSystemTime */

        /* LDR R3, [SP,#0x1C+var_1C] */
        /* LDR R3, [SP,#0x1C+var_18] */
        /* LDR R3, [SP,#0x1C+var_14] */
        /* LDR R3, [SP,#0x1C+var_10] */
        /* STR R3, [R6,#0xC] */
        memcpy(clock_info->sysTime0x0, &sysTime, sizeof(SYSTEMTIME));

        bytes_copied = 16;              /* MOVS R4, #0x10 */
    }

    /* SUBS R3, R5, R4 */
    /* CMP R3, #4 */
    if ((max_size - bytes_copied) >= 4)
    {
        /* LDR R3, =__imp_GetCurrentProcessId */
        /* STR R0, [R6,R4] */
        *(clock_info + bytes_copied) = GetCurrentProcessId();

        bytes_copied += 4;              /* ADDS R4, #4 */
    }


    /* SUBS R3, R5, R4 */
    /* CMP R3, #4 */
    if ((max_size - bytes_copied) >= 4)
    {
        /* LDR R3, =__imp_GetTickCount */
        /* STR R0, [R6,R4] */
        *(clock_info + bytes_copied) = GetTickCount();

        bytes_copied += 4;              /* ADDS R4, #4 */
    }


    /* SUBS R3, R5, R4 */
    /* CMP R3, #9 */
    if ((max_size - bytes_copied) >= 8)
    {
        /* MOV R0, SP */
        LARGE_INTEGER perfCount;

        /* LDR R3, =__imp_QueryPerformanceCounter */
        QueryPerformanceCounter(&perfCount);

        /* STR R3, [R6,R4] */
        /* STR R3, [R2,#4] */
        memcpy((clock_info + bytes_copied), &perfCount,
                sizeof(LARGE_INTEGER));

        bytes_copied += 8;              /* ADDS R4, #8 */
    }

    return bytes_copied;                /* MOV R0, R4 */
}

Tuesday, January 13, 2015

Practical Reverse Engineering p. 79 #9

Question number 9 on page 79 of Practical Reverse Engineering is as follows:

What does the function shown in Figure 2-15 do?

Here is the function's disassembly:

Figure 2-15. Practical Reverse Engineering. © 2014 by Bruce Dang

The ARM processor is in Thumb state. This is essentially the same functionality as Figure 2-14, except the count variable is gone.

int32_t comparison(char *str1, char *str2)
{
    /* LDR R5, =byteArray */
    static BYTE byteArray[] = {0, 1, ..., 0xff};

    while(1)
    {
        /* CMP R4, #0 */
        if (*str1 == '\0')
            break;

        /* LDRB R3, [R1] */
        /* LDRB R4, [R3,R6] */
        /* LDRB R3, [R5,R6] */
        /* CMP R3, R4 */
        if (byteArray[*str1] != byteArray[*str2])
            break;

        ++str1;         /* ADDS R0, #1 */
        ++str2;         /* ADDS R1, #1 */
    }

    /* LDRB R2, [R3,R5] */   
    /* LDRB R3, [R3,R5] */
    /* SUBS R0, R3, R2 */
    return byteArray[*str1] - byteArray[*str2];   
}

Practical Reverse Engineering p. 79 #8

Question number 8 on page 79 of Practical Reverse Engineering is as follows:

In Figure 2-14, byteArray is a 256-character array whose content is byteArray[] = {0, 1, ..., 0xff}.

Here is the disassembly of the function:

Figure 2-14. Practical Reverse Engineering. © 2014 by Bruce Dang


The ARM processor is in Thumb state. We infer that this function takes two strings, as there is byte comparisons and the null byte ends the main loop. There is a 3rd argument which can terminate the main loop early as well. The function is a pretty straightforward translation to C.

int32_t comparison(char *str1, char *str2, uint32_t count)
{
    /* LDR R6, =byteArray */
    static BYTE byteArray[] = {0, 1, ..., 0xff};

    /* CMP R2, #0 */
    while (count > 0)
    {
        --count;    /* SUBS R2, #1 */

        /* LDRB R5, [R0] */
        /* CBZ R5, loc_100E352 */
        if (*str1 == '\0')
            break;

        /* LDRB R3, [R1] */
        /* LDRB R4, [R3,R6] */
        /* LDRB R3, [R5,R6] */
        /* CMP R3, R4 */
        if (byteArray[*str1] != byteArray[*str2])
            break;

        ++str1;     /* ADDS R0, #1 */
        ++str2;     /* ADDS R1, #1 */
    }

    /* SUBS R2, #1 */
    /* CMP R2, #0 */
    if ((count - 1) >= 0)
    {
        /* LDRB R2, [R3,R6] */   
        /* LDRB R3, [R3,R6] */
        /* SUBS R0, R3, R2 */
        return byteArray[*str1] - byteArray[*str2];    
    }

    return NULL;        /* MOVS R0, #0 */
}

Monday, January 12, 2015

Practical Reverse Engineering p. 79 #7

Question number 7 on page 79 of Practical Reverse Engineering is as follows:

Figure 2-13 illustrates a common routine, but you may not have seen it implemented this way.

Here is the disassembly of the function:

Figure 2-13. Practical Reverse Engineering. © 2014 by Bruce Dang


The ARM processor is in Thumb state. We immediately recognize this is a strlen() routine. There is a bit field clear at the end, whose purpose is unclear. Here is how the function is implemented:

size_t strlen(const char *str)
{
    /* CBNZ R0, loc_100E1D8 */
    if (r0 == NULL)
        return 0;   /* MOVS R0, #0 */

    /* MOV R2, R0 */
    char *index = str;
    
    while (1)               /* loc_100E1E4 */
    {
        /* CMB R3, #0 */
        if (*index == '\0')
            break;

        ++index;            /* ADDS R2, #1 */
    }

    /* SUBS R0, R2, R0 */
    return (index - str);
}

Practical Reverse Engineering p. 79 #6

Question number 6 on page 79 of Practical Reverse Engineering is as follows:

Figure 2-12 involves some twiddling.

Here is a disassembly of the function:

Figure 2-12. Practical Reverse Engineering. © 2014 by Bruce Dang


The ARM processor is in Thumb state. The function takes a struct that has a size value and array in it. The array is enumerated looking for a search value, then returning a type of bitmask on its location.

uint64_t search_mask(struct *r0, DWORD search)
{
    /* loc_103B3A8 */
    for (   
        DWORD i = 0;            /* MOVS R2, #0 */
        i < r0->numElements;    /* CMP R2, R4 */
        ++i;                    /* ADDS R2, #1 */
    )
    {
        /* LDR.W R3, [R0,#4]! */
        /* CMP R3, R1 */
        if (r0->elements[i] == search)
        {
            /* SUBS.W R3, R2, #0X20 */
            /* LSLS R1, R3 */
            search = 1 << (i - 0x20);

            /* MOVS R3, #1 */
            /* LSLS.W R0, R3, R2 */
            return (uint64_t) 1 << i;  
        }

    }

    search = 0; /* MOVS R1, #0 */
    return 0;   /* MOVS R0, #0 */
}

Here is a struct definition:

struct r0
{
    DWORD numElements;  /* 0x0 */
    DWORD elements[?];  /* 0x4 */
}

Practical Reverse Engineering p. 79 #5

Question number 5 on page 79 of Practical Reverse Engineering is as follows:

Figure 2-11 is simple as well. The actual string names have been removed so you cannot cheat by searching the Internet.

Here is the disassembly of the function:

Figure 2-11. Practical Reverse Engineering. © 2014 by Bruce Dang

The ARM processor is in Thumb state. This function can be written as a switch statement. It essentially takes an enum and returns a string based on the value.

const char *get_string(DWORD string_enum)
{
    /* MOV R3, R0 */
    switch (string_enum)
    {
        case 6:             /* CMP R3, #6 */
            return "E";     /* LDR R0, =aE ; "E" */

        case 7:             /* CMP R3, #7 */
            return "D";     /* LDR R0, =aD ; "D" */

        case 8:             /* CMP R3, #8 */
            return "C";     /* LDR R0, =ac ; "C" */

        case 9:             /* CMP R3, #9 */
            return "B";     /* LDR R0, =aB ; "B" */

        default:
            return "A";     /* LDR R0, =aA ; "A" */ 
    }
}

Practical Reverse Engineering p. 79 #4

Question number 4 on page 79 of Practical Reverse Engineering is as follows:

Figure 2-10 shows another easy function.

Here is the function's disassembly:

Figure 2-10. Practical Reverse Engineering. © 2014 by Bruce Dang


The ARM processor is in Thumb state. This function takes a pointer of an unknown type. If the pointer is null, it returns 0. Otherwise it subtracts 8 bytes from the pointer and returns the value there.

DWORD sub8_ptr(void *r0)
{
    /* CBNZ R0, loc_100C3DA */
    if (r0 == NULL)
        return 0;       /* MOVS R0, #0 */

    return *(r0 - 8);   /* LDR.W R0, [R0,#–8]  */
}

Practical Reverse Engineering p. 79 #3

Question number 3 on page 79 of Practical Reverse Engineering is as follows:

Here is a simple function:

01: mystery3
02: 83 68 LDR R3, [R0,#8]
03: 0B 60 STR R3, [R1]
04: C3 68 LDR R3, [R0,#0xC]
05: 00 20 MOVS R0, #0
06: 4B 60 STR R3, [R1,#4]
07: 70 47 BX LR
08: ; End of function mystery3

The ARM processor is in Thumb state. This function just copies some values from the first struct to the second struct. It always returns 0.

int copy_struct_values(struct *r0, struct *r1)
{
    /* LDR R3, [R0,#8] */
    /* STR R3, [R1] */
    r1->Unknown0x0 = r0->Unknown0x8;                

    /* LDR R3, [R0,#0xC] */   
    /* STR R3, [R1,#4] */
    r1->Unknown0x4 = r0->Unknown0xc;

    return 0;   /* MOVS R0, #0 */
}

Here are the struct definitions:

struct r0
{
    BYTE Unknown0x0[0x8];   /* 0x0 */
    DWORD Unknown0x8;       /* 0x8 */
    DWORD Unknown0xc;       /* 0xc */
}

struct r1
{
    DWORD Unknown0x0;   /* 0x0 */
    DWORD Unknown0x4;   /* 0x4 */
}

Practical Reverse Engineering p. 78 #2

Question number 2 on page 78 of Practical Reverse Engineering is as follows:

Figure 2-9 shows a function that was found in the export table.

Here is the function's disassembly:

Figure 2-9. Practical Reverse Engineering. © 2014 by Bruce Dang


 We can immediately tell the ARM processor should be in Thumb state. This function checks if the struct passed is NULL. If it is not, then one of the members is checked to see if the byte equals 0.

BOOL check_struct_char(struct *r0)
{
    /* CBZ R0, loc_C672 */
    if (r0 != NULL)
    {    
        /* LDRB.W R0, [R0,#0x63] */
        /* SUBS R0, #0 */
        if (r0->Unknown0x63 == 0)
            return FALSE; 
    }  

    return TRUE;           /* MOVS R0, #1 */
}

And here is a definition of the passed struct:

struct r0
{
    BYTE Unknown0x0[0x63];  /* 0x0 */
    CHAR Unknown0x63;       /* 0x63 */
}

Practical Reverse Engineering p. 78 #1

Question number 1 on page 78 of Practical Reverse Engineering is as follows:

Figure 2-8 shows a function that takes two arguments. It may seem somewhat challenging at fi rst, but its functionality is very common. Have patience.

Indeed the disassembly does appear to be a reasonably complex function.

Figure 2-8. Practical Reverse Engineering. © 2014 by Bruce Dang

This function is not in Thumb state as every instruction is 32-bits in width. It attempts to convert an ASCII string to a signed decimal number. It is very lenient as far as input (even accepting non-numerical characters in the string input), however it will not work on strings exceeding the MAX_INT constant of 2147483648 (0x80000000).

Here is a close 1 to 1 approximation from ARM to C:
 
BOOL ascii_to_decimal(const char *str, int32_t *retNum)
{
    BOOL negative;
    int i = 0;              /* LDRB R3, [R0] */

    /* CMP R3, #0x2D */
    if  (str[i] == '-')
    {
        ++i;                /* LDRB R3, [R0,#1]! */
        negative = TRUE;    /* MOV R6, #1 */
    }
    else 
    {
        negative = FALSE;   /* MOV R6, #0 */

        /* CMP R3, #0x2B */
        if (str[i] == '+')
            ++i;            /* LDREQB R3, [R0,#1]! */
    }

    /* CMP R3, #0x30 */
    if (str[i] == '0')
    {
        /* CMP R2, #0x30 */
        while (str[i] == '0')
            ++i;            /* LDRB R2, [R3],#1 */    
    }

    const int base = 10;    /* MOV R8, #0xA */
    int64_t result = 0;

    while (TRUE)
    {    
        result *= base;         /* UMULL R2, R3, R4, R8 */
        result += str[i] - '0'; /* ADDS R4, R2, R7 */
        ++i;                    /* ADD R12, R12, #1 */

        /* SUBS R7, R7, #0x30 */
        if (str[i] < '0')
            break;

        /* CMP R7, #9 */
        if (str[i] > '9')
            break;
    }

    /* CMP R2, #0x80000000 */
    if (abs((int32_t) result) > INT_MAX)
        return FALSE;               /* MOV R0, #0 */

    *retNum = (int32_t) result;     /* STR R4, [R1] */
    return TRUE;                    /* MOV R0, #1 */
}

Saturday, January 10, 2015

Practical Reverse Engineering p. 38 #2

Question number 2 on page 38 of Practical Reverse Engineering is as follows:

Perform a virtual-to-physical address translation on x64. Were there any major differences compared to x86?

To convert between a virtual to a physical address you must obtain the page frame number of the directory base. You add this offset to the beginning of the page address.

 You can do this with a kernel debugger with the following commands. Here is an example with virtual address 0xff1a0000:

lkd> !process 0 0
PROCESS fffffba002ec0330
SessionId: 1 Cid: 04fc Peb: 7fffffdf000 ParentCid: 07a4
DirBase: 1f79b000 ObjectTable: fffff8a001a7b410 HandleCount: 6.
Image: test64.exe

lkd> !vtop 1f79b000 00000000ff1a0000
Amd64VtoP: Virt 00000000`ff460000, pagedir 1f79b000
Amd64VtoP: PML4E 1f79b000
Amd64VtoP: PDPE 2`21f70018
Amd64VtoP: PDE fa007fa0
Amd64VtoP: PTE 1a40c200
Amd64VtoP: Mapped phys 3a021000
Virtual address ff1a0000 translates to physical address 3a021000.

There are slightly different outputs between x64 and x86, but the general means of calculating a physical address from the virtual address is the same.

Practical Reverse Engineering p. 38 #1

Question number 1 on page 38 of Practical Reverse Engineering is as follows:

Explain two methods to get the instruction pointer on x64. At least one of the methods must use RIP addressing.

RIP-relative addressing in x64 makes position independent code much easier to accomplish. Here is an example to get the current instruction pointer:

_getrip:
    lea rax, [rel _getrip + 0x7]
    ; rax now points to this line

By adding the offset 0x7, the effective address will be the line after the load into RAX. If you are able to use null bytes, this can also be accomplished directly using the following method:

    lea rax, [rel _getrip]
_getrip:
    ; rax now points to this line

You can also do a basic stack manipulation:

    call _getrip
_getrip:
    pop rax

Practical Reverse Engineering p. 36 #12

Question number 12 on page 36 of Practical Reverse Engineering is as follows:

Bruce's favorite x86/x64 disassembly library is BeaEngine by BeatriX (www.beaengine.org). Experiment with it by writing a program to disassemble a binary at its entry point.

This program is called the Burdensome Arbitrary Disassembler (BAD). You must supply the file name and address of the entry point as arguments. Parsing PE and ELF files is pretty trivial, but then it wouldn't be BAD.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#define BEA_USE_STDCALL    
#include "headers/BeaEngine.h"

void *read_file(void *dest, char *file_name, long *file_size)
{
    FILE *fp;

    fp = fopen(file_name, "rb");
    fseek(fp, 0, SEEK_END);

    *file_size = ftell(fp);
    dest = malloc(*file_size);

    rewind(fp);
    fread(dest, 1, *file_size, fp);
    fclose(fp);

    return dest;
}

void disassemble(char *file_name, uint64_t offset)
{
    DISASM dsBea;
    uint16_t len = 0;
    uint8_t *pe_file = 0;
    long file_size;
 
    memset(&dsBea, 0, sizeof(DISASM));
    dsBea.Archi = 64;

    pe_file = read_file(pe_file, file_name, &file_size);

    file_size -= (long) offset;
    dsBea.EIP = (uint64_t) &*&*(pe_file + offset);
 
    while(file_size > 0)
    {
         len = Disasm(&dsBea);

         if (len == UNKNOWN_OPCODE)
             break;

         dsBea.EIP += (uint64_t) len;
         file_size -= (long) len;

         puts(dsBea.CompleteInstr);
 }

    free(pe_file);
}

int main(int argc, char *argv[])
{
    disassemble(argv[1], atoi(argv[2]));
    return 0;
}

Here is an example of disassembly:

C:\BAD\x64\Debug>bad test.exe 3312

sub rsp, 28h
call 0047BCB0h
call 0047AB50h
add rsp, 28h
ret 
...

This matches IDA's output (besides mapping address offsets):


Practical Reverse Engineering p. 36 #11

Question number 11 on page 36 of Practical Reverse Engineering is as follows:

Read the Virtual Memory chapter in Intel Software Developer Manual, Volume 3 and AMD64 Architecture Programmer’s Manual, Volume 2: System Programming. Perform a few virtual address to physical address translations yourself and verify the result with a kernel debugger. Explain how data execution prevention (DEP) works.

Converting Virtual to Physical Memory

To convert a virtual address to its physical address you must obtain the page frame number of the directory base. You add the offset to the beginning of the page address.

You can do this with a kernel debugger with the following commands. Here is an example starting with virtual address 0x00a03ffd:

kd> !process 0 0
PROCESS ffadc791 SessionId: 0 Cid: 09ad Peb: 9afcf000 ParentCid: 0394
DirBase: 08cb1000 ObjectTable: 5b74db90 TableSize: 8.
Image: test.exe

kd> !vtop 8cb1 a03ffd
Pdi 0 Pti a03
00a03ffd 07a2f000 pfn(07a2f)

Add the offset to the base, 0x07a2f000 + 0xffd = 0x07a2fffd. So virtual address 0x00a03ffd in this process points to physical address 0x07a2fffd.



Data Execution Prevention

Hardware DEP works by enabling the NX/XD/XN-bit in the CPU (depending on architecture). This allows the operating system to mark pages of memory as non-executable. There is an "executable" flag in a memory segment's page table entry (page descriptor) that allows you to modify this setting.

If the instruction pointer moves to an area marked as DEP, the processor faults and execution is passed to a given handler. A DEP access violation in kernel mode on Windows will result in an error 0xFC: ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY.

Wednesday, January 7, 2015

Practical Reverse Engineering p. 36 #10

Question number 10 on page 36 of Practical Reverse Engineering is as follows:

If the current privilege level is encoded in CS, which is modifiable by user-mode code, why can’t user-mode code modify CS to change CPL?

The Code Segment (CS) contains a 2-bit value known as the Code Privilege Level (CPL), which is a cached value of the current Ring mode.

The reason user-mode cannot directly modify CS:CPL is the same reason it cannot directly access the instruction pointer (CS:EIP). It's true that CS:EIP is changed with CALL, RET, and JMP instructions; but it cannot be modified by directly loading a value to the register. The same is true for CS:CPL.

To transition to a higher privilege, an instruction like SYSCALL or INT should be used. These will trigger handlers running in kernel-mode.  To modify the CS:CPL to user-mode, an instruction of the IRET family should be used.

Ring 0 code has more privileges to modify the CS segment than Ring 3 code does. However, modifying CS in both x86 and x64 can cause system instability, and should only be performed in real mode.

Practical Reverse Engineering p. 35 #9

Question number 9 on page 35 of Practical Reverse Engineering is as follows:

Sample L. Explain what function sub_1000CEA0 does and then decompile it back to C.

Here is the disassembly of the function:

sub_1000CEA0:
    push    ebp
    mov     ebp, esp
    push    edi
    mov     edi, [ebp+8]
    xor     eax, eax
    or      ecx, 0FFFFFFFFh
    repne scasb
    add     ecx, 1
    neg     ecx
    sub     edi, 1
    mov     al, [ebp+0Ch]
    std
    repne scasb
    add     edi, 1
    cmp     [edi], al
    jz      short loc_1000CEC7
    xor     eax, eax
    jmp     short loc_1000CEC9 

loc_1000CEC7:                          
    mov     eax, edi

loc_1000CEC9:                       
    cld
    pop     edi
    leave
    retn

This function calculates the string length, then works backwards to return a pointer to the last instance of a character in the string.

char *sub_1000CEA0(char *str, char ch)
{
    /* mov edi, [ebp+8] */
    /* xor eax, eax
    /* or ecx, 0FFFFFFFFh */
    /* repne scasb */
    /* add ecx, 1 */
    /* neg ecx */
    size_t len = 1;
    while (*str)
    {
        ++str;
        ++len;
    }

    /* sub edi, 1 */
    /* mov al, [ebp+0Ch] */
    /* std */
    /* repne scasb */
    /* add edi, 1 */ 
    while (len)
    {
        /* cmp [edi], al */
        if (*str == ch)
            return str;
        --len;
        --str;
    }

    return 0;
}

This is an implementation of the strrchr() function, which is defined as follows:

char *strrchr(char *str, int character);

Practical Reverse Engineering p. 35 #8

Question number 8 on page 35 of Practical Reverse Engineering is as follows:

Sample H. Decompile sub_11732 and explain the most likely programming construct used in the original code.

The function's disassembly looks like the following:

sub_1172E:
    push    esi
    mov     esi, [esp+8]
    dec     esi
    jz      short loc_1175F
    dec     esi
    jz      short loc_11755
    dec     esi
    jz      short loc_1174B
    sub     esi, 9
    jnz     short loc_1176B
    mov     esi, [eax+8]
    shr     esi, 1
    add     eax, 0Ch
    jmp     short loc_11767 

loc_1174B:                             
    mov     esi, [eax+3Ch]
    shr     esi, 1
    add     eax, 5Eh
    jmp     short loc_11767 

loc_11755:                            
    mov     esi, [eax+3Ch]
    shr     esi, 1
    add     eax, 44h
    jmp     short loc_11767 

loc_1175F:                          
    mov     esi, [eax+3Ch]
    shr     esi, 1
    add     eax, 40h

loc_11767:                         
                        
    mov     [ecx], esi
    mov     [edx], eax

loc_1176B:                              
    pop     esi
    retn    4 

The function contains a switch. The calling convention is odd for x86 as there are three variables passed in a register, which means it was probably compiled as a static unit. There is a lot of repeated code, which means either the programmer or the compiler couldn't optimize the size in a logical way. Here is a more natural way to write the function:

static struct *sub_1172E(struct *arg1, struct *arg2, 
                         struct *arg3, int unknown_enum)
{
    DWORD dwUnknown;

    /* mov esi, [esp+8] */
    switch (unknown_enum)
    {
        case 1:                 /* dec esi */
            arg1 += 64;         /* add eax, 40h */
        break;

        case 2:                 /* dec esi */
            arg1 += 68          /* add eax, 44h */
        break;

        case 3:                 /* dec esi */
            arg1 += 94;         /* add eax, 5Eh */
        break;

        case 12:                /* sub esi, 9 */
            arg1 += 12;         /* add eax, 0Ch */
        break;

        default:
            return arg1;
    }
     
    dwUnknown = (unknown_enum == 12) ? 
            arg1->Unknown0x8 :  /* mov esi, [eax+8] */
            arg1->Unknown0x3c;  /* mov esi, [eax+3Ch] */

    dwUnknown /= 2;             /* shr esi, 1 */

    *arg2 = dwUnknown;          /* mov [ecx], esi */
    *arg3 = arg1;               /* mov [edx], eax */

    return arg1;
}

We can infer some of the struct in the first argument by where offsets were accessed:

struct arg1 {
    BYTE Unknown0x0[0x8];   /* 0x0 */
    DWORD bigTwiceVal;      /* 0x8 */
    BYTE Unknown0xc[0x30];  /* 0xc */
    DWORD smallTwiceVal;    /* 0x3c */
};

Tuesday, January 6, 2015

Practical Reverse Engineering p. 35 #7

Question number 7 on page 35 of Practical Reverse Engineering is as follows:

Sample H. The function sub_10BB6 has a loop searching for something. First recover the function prototype and then infer the types based on the context. Hint: You should probably have a copy of the PE specification nearby.

Here is an image that details the PE file format, which is useful for referencing offsets.


The structs that make up the PE file format are:

typedef struct _IMAGE_DOS_HEADER
{
    WORD e_magic;              /* 0x0 */
    WORD e_cblp;               /* 0x2 */
    WORD e_cp;                 /* 0x4 */
    WORD e_crlc;               /* 0x6 */
    WORD e_cparhdr;            /* 0x8 */
    WORD e_minalloc;           /* 0xa */
    WORD e_maxalloc;           /* 0xc */
    WORD e_ss;                 /* 0xe */
    WORD e_sp;                 /* 0x10 */
    WORD e_csum;               /* 0x12 */
    WORD e_ip;                 /* 0x14 */
    WORD e_cs;                 /* 0x16 */
    WORD e_lfarlc;             /* 0x18 */
    WORD e_ovno;               /* 0x1a */
    WORD e_res[4];             /* 0x1c */
    WORD e_oemid;              /* 0x24 */
    WORD e_oeminfo;            /* 0x26 */
    WORD e_res2[10];           /* 0x28 */
    LONG e_lfanew;             /* 0x3c */
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;

typedef struct _IMAGE_NT_HEADERS {
    DWORD                 Signature;         /* 0x0 */
    union
    {
        IMAGE_FILE_HEADER     FileHeader;    /* 0x4 */
        struct
        {
            WORD  Machine;                   /* 0x4 */
            WORD  NumberOfSections;          /* 0x6 */
            DWORD TimeDateStamp;             /* 0x8 */
            DWORD PointerToSymbolTable;      /* 0xc */
            DWORD NumberOfSymbols;           /* 0x10 */
            WORD  SizeOfOptionalHeader;      /* 0x14 */
            WORD  Characteristics;           /* 0x16 */
        }
    }
    IMAGE_OPTIONAL_HEADER OptionalHeader;    /* 0x18 */
} IMAGE_NT_HEADERS, *PIMAGE_NT_HEADERS;

Here is the disassembly of the function in Sample H:

sub_10BB2:
    mov     eax, [esp+4]
    push    ebx
    push    esi
    mov     esi, [eax+3Ch]
    add     esi, eax
    movzx   eax, word ptr [esi+14h]
    xor     ebx, ebx
    cmp     [esi+6], bx
    push    edi
    lea     edi, [eax+esi+18h]
    jbe     short loc_10BEB

loc_10BCE:                             
    push    [esp+0Ch+8]     ; _DWORD
    push    edi             ; _DWORD
    call    ds:dword_169A4
    test    eax, eax
    pop     ecx
    pop     ecx
    jz      short loc_10BF3
    movzx   eax, word ptr [esi+6]
    add     edi, 28h
    inc     ebx
    cmp     ebx, eax
    jb      short loc_10BCE

loc_10BEB:                              
    xor     eax, eax

loc_10BED:                            
    pop     edi
    pop     esi
    pop     ebx
    retn    8 

loc_10BF3:                             
    mov     eax, edi
    jmp     short loc_10BED

Making the assumption that our first argument is a pointer to the DOS header (from the hint in the question), we can use a little bit of math to calculate the offsets and see everything lines up. The search code could be written in a for loop but I used a while loop for more clarity.

PIMAGE_SECTION_HEADER sub_10BB2(PVOID pPE, PVOID arg2)
{
    PIMAGE_DOS_HEADER pDOS;
    PIMAGE_NT_HEADERS pNT;
    PIMAGE_SECTION_HEADER pSection;
    WORD dwOptHdrSize;
    DWORD dwCurrentSection;

    /* mov eax, [esp+4] */
    pDOS = (PIMAGE_DOS_HEADER) pPE;

    /* mov esi, [eax+3Ch] */
    /* add esi, eax */
    pNT = (pDOS + pDOS->e_lfanew);

    /* movzx eax, word ptr [esi+14h] */
    dwOptHdrSize = pNT->FileHeader.SizeOfOptionalHeader;

    /* xor ebx, ebx */
    dwCurrentSection = 0;

    /* cmp [esi+6], bx */
    if (pNT->NumberOfSections == 0)
        return NULL;

    /* lea edi, [eax+esi+18h] */
    pSection = IMAGE_FIRST_SECTION(pNT);

    while (1)
    {
        /* push [esp+0Ch+8] */
        /* push edi  */
        /* call ds:dword_169A4 */
        /* test eax, eax */
        if (*(BOOL)dword_169A4)(pSection, arg2))
            return pSection;

        /* add edi, 28h */
        ++pSection; /* struct increment */

        /* inc ebx */
        ++dwCurrentSection;

        /* movzx eax, word ptr [esi+6] */
        /* cmp ebx, eax */
        if (dwCurrentSection > pNT->NumberOfSections)
            return NULL;
    }
}

Practical Reverse Engineering p. 35 #6

Question number 6 on page 35 of Practical Reverse Engineering is as follows:

Sample H. The function sub_13846 references several structures whose types are not entirely clear. Your task is to first recover the function prototype and then try to reconstruct the structure fields. After reading Chapter 3, return to this exercise to see if your understanding has changed. (Note: This sample is targeting Windows XP x86.)

The function disassembles to:

sub_13842:
     mov     eax, [ecx+60h]
     push    esi
     mov     esi, [edx+8]
     dec     byte ptr [ecx+23h]
     sub     eax, 24h
     mov     [ecx+60h], eax
     mov     [eax+14h], edx
     movzx   eax, byte ptr [eax]
     push    ecx
     push    edx
     call    dword ptr [esi+eax*4+38h]
     pop     esi
     retn

It's a fastcall convention which has two struct pointer arguments, and the return type is not readily available. There are 3 locals which are stored in registers, two of them pointers within the struct and a char which is an index to a function call table.

PVOID __fastcall sub_13841(
     struct *arg1, struct *arg2)
{
     ULONG_PTR v1, v2;
     CHAR index;

     v1 = arg1->Unkown0x60;    /* mov eax, [ecx+60h] */
     v2 = arg2->Unknown0x8;    /* mov esi, [edx+8] */

     --arg1->Unknown0x23;      /* dec byte ptr [ecx+23h] */

     v1 -= 0x24;               /* sub eax, 24h */

     arg1->Unknown0x60 = v1;   /* mov [ecx+60h], eax */
     v1->Unknown0x14 = arg2;   /* mov [eax+14h], edx */

     index = v1->index;        /* movzx eax, byte ptr [eax] */

     /* call dword ptr [esi+eax*4+38h] */
     return v2->Unknown0x38[index](arg2, arg1);
}

Here is a basic idea so far on the struct meanings:

struct arg1 {
     BYTE Unknown0x0[0x23];   /* 0x0 */
     CHAR decremented;        /* 0x23 */
     BYTE Unknown0x24[0x18];  /* 0x24 */
     union
     {
         struct *v1[9];           /* 0x3c - 0x60 */
         struct
         {
             struct v1_1[0x3c];   /* 0x3c */
             struct v1_2[0x60];   /* 0x60 */
         };
     };
};

struct arg2 {
     BYTE Unknown0x0[0x8];    /* 0x0 */
     struct *v2;              /* 0x8 */    
};

/* size = 0x24 */
struct v1 {
     CHAR index;              /* 0x0 */
     BYTE Unknown0x1[0x13];   /* 0x1 */
     struct *arg2;            /* 0x14 */
     CHAR Unknown0x18[0xc];   /* 0x18 */
};

struct v2 {
     BYTE Unknown0x0[0x38];   /* 0x0 */
     ULONG_PTR function;      /* 0x38 */   
};

Saturday, January 3, 2015

Practical Reverse Engineering p. 35 #5

Question number 5 on page 35 of Practical Reverse Engineering is as follows:

Decompile the following kernel routines in Windows:
  • KeInitializeDpc 
  • KeInitializeApc 
  • ObFastDereferenceObject (explain its calling convention)
  • KeInitializeQueue 
  • KxWaitForLockChainValid 
  • KeReadyThread 
  • KiInitializeTSS 
  • RtlValidateUnicodeString

KeInitializeDpc

Inside ntoskrnl.exe, KeInitializeDpc has the following prototype:

VOID NTAPI KeInitializeDpc(
    PRKDPC Dpc, 
    PKDEFERRED_ROUTINE DeferredRoutine, 
    PVOID DeferredContext);

This has a parameter for the KDPC struct, which contains a LIST_ENTRY. These are defined as:

typedef struct _LIST_ENTRY {
  struct _LIST_ENTRY  *Flink;   /* 0x0 */
  struct _LIST_ENTRY  *Blink;   /* 0x8 */
} LIST_ENTRY, *PLIST_ENTRY;

typedef struct _KDPC
{
     UCHAR Type;                /* 0x0 */   
     UCHAR Importance;          /* 0x1 */
     WORD Number;               /* 0x2 */
     BYTE Unknown[4];           /* 0x4 */
     LIST_ENTRY DpcListEntry;   /* 0x8 */
     PVOID DeferredRoutine;     /* 0x18 */
     PVOID DeferredContext;     /* 0x20 */
     PVOID SystemArgument1;     /* 0x28 */
     PVOID SystemArgument2;     /* 0x30 */
     PVOID DpcData;             /* 0x38 */
} KDPC, *PKDPC;

Here is the disassembly:

KeInitializeDpc:
    xor     eax, eax
    mov     dword ptr [rcx], 113h
    mov     [rcx+18h], rdx
    mov     [rcx+38h], rax
    mov     [rcx+10h], rax
    mov     [rcx+20h], r8
    retn

The first MOV is an optimization which sets the first 3 variables in the struct, as it sets a dword to 0x113 (0b100010011). Everything else lines up easily enough. Here is the fully decompiled function.

VOID NTAPI KeInitializeDpc(
    PRKDPC Dpc, 
    PKDEFERRED_ROUTINE DeferredRoutine, 
    PVOID DeferredContext)
{
    Dpc->Type = 19;                       /* mov dword ptr [rcx],113h */
    Dpc->Importance = 1;                     
    Dpc->Number = 0;
    Dpc->DeferredRoutine = DeferredRoutine; /* mov [rcx+18h], rdx */
    Dpc->DpcData = 0;                       /* mov [rcx+38h], rax */
    Dpc->DpcListEntry.Blink = 0;            /* mov [rcx+10h], rax */
    Dpc->DeferredContext = DeferredContext; /* mov [rcx+20h], r8 */
}



KeInitializeApc

Inside ntoskrnl.exe, KeInitializeApc has the following prototype:

VOID NTAPI KeInitializeApc( 
    _In_ PKAPC  Apc,
    _In_ PKTHREAD   Thread,
    _In_ KAPC_ENVIRONMENT   TargetEnvironment,
    _In_ PKKERNEL_ROUT_In_E     KernelRoutine,
    _In_Opt_ PKRUNDOWN_ROUT_In_E RundownRoutine ,
    _In_ PKNORMAL_ROUT_In_E     NormalRoutine,
    _In_ KPROCESSOR_MODE    Mode,
    _In_ PVOID  Context);   

Here is the KAPC struct with added offsets:

typedef struct _KAPC
{
     UCHAR Type;                /* 0x0 */
     UCHAR SpareByte0;          /* 0x1 */
     UCHAR Size;                /* 0x2 */
     UCHAR SpareByte1;          /* 0x3 */
     ULONG SpareLong0;          /* 0x4 */
     PKTHREAD Thread;           /* 0x8 */
     LIST_ENTRY ApcListEntry;   /* 0x10 */
     PVOID KernelRoutine;       /* 0x20 */
     PVOID RundownRoutine;      /* 0x28 */
     PVOID NormalRoutine;       /* 0x30 */
     PVOID NormalContext;       /* 0x38 */
     PVOID SystemArgument1;     /* 0x40 */
     PVOID SystemArgument2;     /* 0x48 */
     CHAR ApcStateIndex;        /* 0x50 */
     CHAR ApcMode;              /* 0x51 */
     UCHAR Inserted;            /* 0x52 */
} KAPC, *PKAPC;

And here is the disassembly:

KeInitializeApc:
    mov     byte ptr [rcx], 12h
    mov     byte ptr [rcx+2], 58h
    cmp     r8d, 2
    jz      short loc_1400BAAAF
    mov     [rcx+50h], r8b

loc_1400BAA71:                         
    mov     rax, [rsp+28h]
    mov     [rcx+8], rdx
    xor     edx, edx
    mov     [rcx+28h], rax
    mov     rax, [rsp+30h]
    mov     [rcx+20h], r9
    mov     [rcx+30h], rax
    test    rax, rax
    jnz     short loc_1400BAA9D
    mov     [rcx+51h], dl
    mov     [rcx+38h], rdx

loc_1400BAA99:                          
    mov     [rcx+52h], dl
    retn 

loc_1400BAA9D:                          
    mov     al, [rsp+38h]
    mov     [rcx+51h], al
    mov     rax, [rsp+40h]
    mov     [rcx+38h], rax
    jmp     short loc_1400BAA99 

loc_1400BAAAF:                         
    mov     al, [rdx+242h]
    mov     [rcx+50h], al
    jmp     short loc_1400BAA71

This routine contains a couple if statements, but otherwise it's just writing the arguments and some constants to the struct.

VOID NTAPI KeInitializeApc( 
    _In_ PKAPC  Apc,
    _In_ PKTHREAD   Thread,
    _In_ KAPC_ENVIRONMENT   TargetEnvironment,
    _In_ PKKERNEL_ROUT_In_E     KernelRoutine,
    _In_Opt_ PKRUNDOWN_ROUT_In_E RundownRoutine ,
    _In_ PKNORMAL_ROUT_In_E     NormalRoutine,
    _In_ KPROCESSOR_MODE    Mode,
    _In_ PVOID  Context) 
{
    Apc->Type = 0x12;        /* mov byte ptr [rcx], 12h */
    Apc->Size = 0x58;        /* mov byte ptr [rcx+2], 58h */

    /* cmp r8d, 2 */
    if ((DWORD)TargetEnvironment == CurrentApcEnvironment)
      Apc->ApcStateIndex = Thread->ApcStateIndex;/* mov [rcx+50h], al */
    else
      Apc->ApcStateIndex = TargetEnvironment;  /* mov [rcx+50h], r8b */

    Apc->Thread = Thread;                    /* mov [rcx+8], rdx */
    Apc->RundownRoutine = RundownRoutine;    /* mov [rcx+28h], rax */
    Apc->KernelRoutine = KernelRoutine;      /* mov [rcx+20h], r9 */
    Apc->NormalRoutine = NormalRoutine;      /* mov [rcx+30h], rax */

    /* test rax, rax */
    if (NormalRoutine != 0)
    {
        Apc->ApcMode = Mode;                 /* mov [rcx+51h], al */
        Apc->NormalContext = Context;        /* mov [rcx+38h], rax */
    }
    else
    {
        Apc->ApcMode = 0;                    /* mov [rcx+51h], dl */
        Apc->NormalContext = 0;              /* mov [rcx+38h], rdx */    
    }

    Apc->Inserted = 0;                       /* mov [rcx+52h], dl */
}



ObFastDereferenceObject

Inside ntoskrnl.exe, ObFastDereferenceObject has the following prototype:

void __fastcall ObFastDereferenceObject(
    _In_ PEX_FAST_REF FastRef,
    _In_ PVOID Object 
)

Here is the struct that is passed in the first argument:

typedef struct _EX_FAST_REF
{
    union
    {
        PVOID Object;
        ULONG RefCnt: 4;
        UINT64 RefCnt;
    };    
} EX_FAST_REF, *PEX_FAST_REF;

Here is the disassembly, which shows that there are fastcall optimizations on the 1st parameter for certain processors:

ObFastDereferenceObject:
    mov     r9, rcx
    prefetchw byte ptr [rcx]
    mov     rax, [rcx]
    mov     r8, rax
    xor     r8, rdx
    cmp     r8, 0Fh
    jnb     short loc_140062C29

loc_140062C1D:                          
    lea     r8, [rax+1]
    lock cmpxchg [r9], r8
    jnz     short loc_140062C31
    retn

loc_140062C29:                                              
    mov     rcx, rdx        
    jmp     ObfDereferenceObject

loc_140062C31:                          
    mov     rcx, rax
    xor     rcx, rdx
    cmp     rcx, 0Fh
    jb      short loc_140062C1D
    jmp     short loc_140062C29

The function is one big loop that increments the FastRef->Object pointer. There is also a precondition test. If the loop fails, another function is called.

void __fastcall ObFastDereferenceObject(
    _In_ PEX_FAST_REF FastRef,
    _In_ PVOID Object 
)
{
    for (   EX_FAST_REF  a = *FastRef,      /* mov rax, [rcx] */
                         b = *FastRef;      /* mov r8, rax */
            *b->Object ^ Object             /* xor r8, rdx */
            <= 0x0F;                        /* cmp rcx, 0Fh */
            b->Object = *(a->Object) + 1    /* lea r8, [rax+1] */
        )
    {
        /* lock cmpxchg [r9], r8 */
        if (atomic_compare_exchange_strong(FastRef, &a, b));  
            return;
    }
                                        /* mov rcx, rdx */
    ObfDereferenceObject(Object);       /* jmp ObfDereferenceObject */
}



KeInitializeQueue

Inside ntoskrnl.exe, KeInitializeQueue has the following prototype:

VOID NTAPI KeInitializeQueue(
  _Out_  PRKQUEUE Queue,
  _In_   ULONG Count);

Here are the relevant structs which make up our Queue parameter:

typedef struct _DISPATCHER_HEADER
{
     union
     {
          struct
          {
               UCHAR Type;
               union
               {
                    UCHAR Abandoned;
                    UCHAR Absolute;
                    UCHAR NpxIrql;
                    UCHAR Signalling;
               };
               union
               {
                    UCHAR Size;
                    UCHAR Hand;
               };
               union
               {
                    UCHAR Inserted;
                    UCHAR DebugActive;
                    UCHAR DpcActive;
               };
          };
          LONG Lock;
     };
     LONG SignalState;
     LIST_ENTRY WaitListHead;
} DISPATCHER_HEADER, *PDISPATCHER_HEADER;

typedef struct _KQUEUE {
    DISPATCHER_HEADER Header;         /* 0x0 */
    LIST_ENTRY EntryListHead;         /* 0x18 */
    ULONG CurrentCount;               /* 0x28 */
    ULONG MaximumCount;               /* 0x2c */
    LIST_ENTRY ThreadListHead;        /* 0x30 */
} KQUEUE, *PKQUEUE, *RESTRICTED_POINTER PRKQUEUE;

The disassembly for the function is:

KeInitializeQueue:
    mov     word ptr [rcx], 4
    mov     byte ptr [rcx+2], 10h
    lea     rax, [rcx+8]
    xor     r8d, r8d
    mov     [rcx+4], r8d
    mov     [rax+8], rax
    mov     [rax], rax
    lea     rax, [rcx+18h]
    mov     [rax+8], rax
    mov     [rax], rax
    lea     rax, [rcx+30h]
    mov     [rax+8], rax
    mov     [rax], rax
    mov     [rcx+28h], r8d
    test    edx, edx
    jz      short loc_1400DF8A9
    mov     [rcx+2Ch], edx
    retn

loc_1400DF8A9:                        
    mov     eax, cs:KeNumberProcessors_0
    mov     [rcx+2Ch], eax
    retn

This function is again just basically filling in a struct with some constants.

VOID NTAPI KeInitializeQueue(
  _Out_  PRKQUEUE Queue,
  _In_   ULONG Count)
{
    Queue->Header.Type = 4;              /* mov word ptr [rcx], 4 */
    Queue->Header.Abandoned = FALSE;
    Queue->Header.Size = 0x10;           /* mov byte ptr [rcx+2], 10h */
    Queue->Header.SignalState = 0;       /* mov [rcx+4], r8d */

    /* lea rax, [rcx+8] */ 
    Queue->Header.WaitListHead->Blink = &Queue->Header.WaitListHead; 
    Queue->Header.WaitListHead->Flink = &Queue->Header.WaitListHead;

    /* lea rax, [rcx+18h] */
    Queue->EntryListHead->Blink = &Queue->EntryListHead; 
    Queue->EntryListHead->Flink = &Queue->EntryListHead; 

    /* lea rax, [rcx+30h] */
    Queue->ThreadListHead->Blink = &Queue->ThreadListHead; 
    Queue->ThreadListHead->Flink = &Queue->ThreadListHead;

    Queue.CurrentCount = 0;

    /* test edx, edx */
    if (Count == 0)
        Queue->MaximumCount = KeNumberProcessors; /* cs:_0 */
    else
        Queue->MaximumCount = Count;            /* mov [rcx+2Ch], edx */
}



KxWaitForLockChainValid

Inside ntoskrnl.exe, KxWaitForLockChainValid has the following prototype:

PKSPIN_LOCK_QUEUE KxWaitForLockChainValid(
       __inout PKSPIN_LOCK_QUEUE LockQueue);

Here is the definition for the struct parameter:

typedef struct _KSPIN_LOCK_QUEUE 
{
    struct _KSPIN_LOCK_QUEUE * volatile Next;
    PKSPIN_LOCK volatile Lock;
} KSPIN_LOCK_QUEUE, *PKSPIN_LOCK_QUEUE;

Here is the disassembly of the function:

KxWaitForLockChainValid:
    mov     [rsp+8], rbx
    push    rdi
    sub     rsp, 20h
    mov     rdi, rcx
    xor     ebx, ebx

loc_1400DA7F7:                        
    inc     ebx
    test    cs:HvlLongSpinCountMask, ebx
    jz      loc_14019DCAC

loc_1400DA805:                       
    pause

loc_1400DA807:                        
    mov     rax, [rdi]
    test    rax, rax
    jz      short loc_1400DA7F7
    mov     rbx, [rsp+28h+8]
    add     rsp, 20h
    pop     rdi
    retn

loc_14019DCAC:                    
    mov     eax, cs:HvlEnlightenments
    test    al, 40h
    jz      loc_1400DA805
    mov     ecx, ebx
    call    HvlNotifyLongSpinWait
    nop
    jmp     loc_1400DA807

This is a spinlock implementation. It's interesting that the last label is in a distant memory area. This is usually an indication of an optimization by the compiler that the code is rarely used.

PKSPIN_LOCK_QUEUE KxWaitForLockChainValid(
       __inout PKSPIN_LOCK_QUEUE LockQueue)
{
    UINT32 i = 0;  /* xor ebx, ebx */

    do             /* loc_1400DA7F7 */
    {
        ++i;       /* inc ebx */ 

        /* test cs:HvlLongSpinCountMask, ebx */ 
        /* test al, 40h */
        if (i == HvlLongSpinCountMask && HvlEnlightenments != 0x40))   
            HvlNotifyLongSpinWait(i);       /* mov ecx, ebx */
        else
            _mm_pause();                    /* pause */

    } while(LockQueue->Next != 0);  /* test rax, rax */
}



KeReadyThread

Inside ntoskrnl.exe, KeReadyThread has the following prototype:

VOID NTAPI KeReadyThread(_In_ PKTHREAD Thread);

Here is the disassembly:

KeReadyThread:              
    push    rbx
    sub     rsp, 20h
    mov     rdx, [rcx+0B8h]
    mov     rbx, rcx
    mov     eax, [rdx+234h]
    test    al, 7
    jnz     short loc_1400F6684

loc_1400F6676:                         
    mov     rcx, rbx
    call    KiFastReadyThread

loc_1400F667E:                          
    add     rsp, 20h
    pop     rbx
    retn

loc_1400F6684:                          
    call    KiInSwapSingleProcess
    test    al, al
    jnz     short loc_1400F667E
    jmp     short loc_1400F6676

Until I calculate the offsets the struct values are unknown.

VOID NTAPI KeReadyThread(_In_ PKTHREAD Thread)
{
    /* mov rdx, [rcx+0B8h] */
    /* mov eax, [rdx+234h] */
    /* test al, 7 */
    if ((BYTE)Thread->UnknownB8.Unknown234 == 7)
        if (KiInSwapSingleProcess(Thread))  /* call KiInSwapSingle */
            return;                         /* jnz loc_1400F667E */

    KiFastReadyThread(Thread);              /* call KiFastReadyThread */
}



KiInitializeTSS

Inside ntoskrnl.exe, KiInitializeTSS has the following prototype:

VOID NTAPI KiInitializeTSS(_In_ PKTSS Tss);

This has a parameter for the PKTSS struct. It is defined as:

typedef struct _KTSS
{
     WORD Backlink;
     WORD Reserved0;
     ULONG Esp0;
     WORD Ss0;                  /* 0x8 */
     WORD Reserved1;
     ULONG NotUsed1[4];
     ULONG CR3;
     ULONG Eip;
     ULONG EFlags;
     ULONG Eax;
     ULONG Ecx;
     ULONG Edx;
     ULONG Ebx;
     ULONG Esp;
     ULONG Ebp;
     ULONG Esi;
     ULONG Edi;
     WORD Es;
     WORD Reserved2;
     WORD Cs;
     WORD Reserved3;
     WORD Ss;
     WORD Reserved4;
     WORD Ds;
     WORD Reserved5;
     WORD Fs;
     WORD Reserved6;
     WORD Gs;
     WORD Reserved7;
     WORD LDT;                  /* 0x60 */
     WORD Reserved8;
     WORD Flags;                /* 0x64 */
     WORD IoMapBase;            /* 0x66 */
     KiIoAccessMap IoMaps[1];
     UCHAR IntDirectionMap[32]; /* 0x208c */
} KTSS, *PKTSS;

Here is the disassembly:

KiInitializeTSS:
     mov     edi, edi
     push    ebp
     mov     ebp, esp
     mov     eax, dword ptr [ebp+8]
     and     word ptr [eax+64h], 0
     and     word ptr [eax+60h], 0
     mov     word ptr [eax+66h], 20ACh
     mov     word ptr [eax+8], 10h
     pop     ebp
     ret     4

This function fills in the structure with constants.

VOID NTAPI KiInitializeTSS(_In_ PKTSS Tss)
{
     Tss->Flags = 0;     /* and word ptr [eax+64h], 0 */
     Tss->LDT = 0;       /* and word ptr [eax+60h], 0 */

     /* mov word ptr [eax+66h], 20ACh */
     Tss->IoMapBase = sizeof(KTSS);

     Tss->Ss0 = 16;      /* mov word ptr [eax+8], 10h */
}



RtlValidateUnicodeString

Inside ntoskrnl.exe, RtlValidateUnicodeString has the following prototype:

NTSTATUS NTAPI RtlValidateUnicodeString(   
    _In_ ULONG Flags,
    _In_ PCUNICODE_STRING UnicodeString);

The UNICODE_STRING struct in a 64-bit system context is defined as:

typedef struct _UNICODE_STRING {
  USHORT Length;            /* 0x0 */
  USHORT MaximumLength;     /* 0x2 */
  DWORD  Reserved;          /* 0x4 */
  PWSTR  Buffer;            /* 0x8 */
} UNICODE_STRING, *PUNICODE_STRING;

Here's the disassembly of the function:

RtlValidateUnicodeString:
    xor     eax, eax
    test    ecx, ecx
    jnz     short loc_1400D23BB
    test    rdx, rdx
    jz      short locret_1400D23BA
    movzx   r8d, word ptr [rdx]
    test    r8b, 1
    jnz     short loc_1400D23BB
    movzx   ecx, word ptr [rdx+2]
    test    cl, 1
    jnz     short loc_1400D23BB
    cmp     r8w, cx
    ja      short loc_1400D23BB
    mov     r9d, 0FFFEh
    cmp     cx, r9w
    ja      short loc_1400D23BB
    cmp     [rdx+8], rax
    jz      loc_14019BAF4

locret_1400D23BA:       
    retn 

loc_1400D23BB:    
    mov     eax, 0C000000Dh
    retn
    
loc_14019BAF4:
    test    r8w, r8w
    jnz     loc_1400D23BB
    test    cx, cx
    jz      locret_1400D23BA
    jmp     loc_1400D23BB

The function, true to its name, follows the traditional validation pattern of executing tests and returning false (NTSTATUS: INVALID_PARAMETER) or true (NTSTATUS: SUCCESS) depending on if the conditions are met or not. Note that the last test case in the main body of the function can jump to a distant memory space for more tests, an optimization that likely means it is rarely branched to.

/* test ecx, ecx */
if (Flags != 0)
    return STATUS_INVALID_PARAMETER; /* mov eax, 0C000000Dh */

/* test rdx, rdx */
if (!UnicodeString)
    return STATUS_SUCCESS;           /* xor eax, eax */

/* movzx r8d, word ptr [rdx] */
/* test r8b, 1 */
if (UnicodeString->Length & 1 != 0)
    return STATUS_INVALID_PARAMETER; 

/* movzx ecx, word ptr [rdx+2] */
/* test cl, 1 */
if (UnicodeString->MaximumLength & 1 != 0)
    return STATUS_INVALID_PARAMETER; 

/* cmp r8w, cx */
if (UnicodeString->Length > UnicodeString.MaximumLength)
    return STATUS_INVALID_PARAMETER; 

/* mov r9d, 0FFFEh */
/* cmp cx, r9w */
if (UnicodeString->MaximumLength > 65534)
    return STATUS_INVALID_PARAMETER; 

/* cmp [rdx+8], rax */
if (UnicodeString->Buffer == 0)
{
    /* test r8w, r8w */
    if (UnicodeString->Length != 0)
        return STATUS_INVALID_PARAMETER; 

    /* test cx, cx */
    if (UnicodeString->MaximumLength != 0)
        return STATUS_INVALID_PARAMETER; 
}

return STATUS_SUCCESS;

Friday, January 2, 2015

Practical Reverse Engineering p. 35 #4

Question number 4 on page 35 of Practical Reverse Engineering is as follows:

Implement the following functions in x86 assembly:
  • strlen
  • strchr
  • memcpy
  • memset
  • strcmp
  • strset

Here is the C prototype for strlen:

size_t strlen(const char *str);

This function returns the number of characters found before the null-byte character in a C-string. We can set AL to null and loop over EDI using REPNE SCASB. We will set ECX to -1 to begin, and then NOT the bytes and subtract by 1 to get the positive number length result.

_strlen:    
    push edi

    mov edi, [esp + 0x8]  ; char *str
    xor eax, eax          ; al = '\0'
    xor ecx, ecx          ; ecx = -1
    not ecx

    cld
    repne scasb

    not ecx               ; correct ecx
    lea eax, [ecx - 0x1]

    pop edi
    ret



Here is the C prototype for strchr:

char *strchr(char *str, int character);

The method of searching in the strchr function seems like it would be very similar to strlen. The are two main differences. Instead of searching for the null byte, we are searching for a user passed argument. And instead of returning the length, we return a pointer to the first time the search character is found.

We could be naive and just change some code in strlen, but looping with REPNE here means we could overrun the string buffer into unknown memory space. We need logic to check for the null byte as well, making the function more complicated. We could implement a strlen lookup beforehand, but it is more efficient to revert to more generic looping and comparison constructs.

_strchr:   
    mov eax, [esp + 0x4]   ; char *str
    mov ecx, [esp + 0x8]   ; char c
 
chr_loop:
    mov dl, [eax]          ; edx is caller-saved

    cmp cl, dl             ; *eax == c
    je chr_leave

    inc eax

    test dl, dl            ; check '\0'
    jnz chr_loop

    xor eax, eax           ; nullptr on fail

chr_leave:  
    ret



Here is the C prototype for memcpy:

void *memcpy (void *destination, const void *source, size_t num);

This implementation is very convenient in x86. We use the REP MOVSB operation, which will copy ESI to EDI byte by byte until ECX reaches 0.

_memcpy:
    push edi
    push esi

    mov edi, [esp + 0xc]
    mov esi, [esp + 0x10]
    mov ecx, [esp + 0x14]

    mov eax, edi            ; return dest

    cld
    rep movsb               ; ends at ecx = 0

    pop esi
    pop edi
    ret



Here is the C prototype for memset:

void *memset (void *ptr, int value, size_t num);

This implementation is similar to memcpy. We change to the REP STOSB operation, which will copy AL into EDI until ECX reaches 0.

_memset:
    push edi

    mov edi, [esp + 0x8]
    mov eax, [esp + 0xc]    ; char in al
    mov ecx, [esp + 0x10]

    push edi                ; store dest

    cld
    rep stosb               ; ends at ecx = 0

    pop eax                 ; return dest

    pop edi
    ret



Here is the C prototype for strcmp:

int strcmp(const char *str1, const char *str2);

The strcmp function returns 0 if both strings are equal, > 0 if the first character that does not match has a greater value in str1 than in str2, and < 0 in the other case.

 For this function we need to change our strategy. The REP CMPSB instruction looks like it would be great, but we would need to precompute both strings lengths and then use the minimum for our ECX value. We can achieve a better implementation using more generic looping and comparison constructs. Even though it seems to be more code, by skipping two strlen calls it ends up being less.

_strcmp:
    push edi
    push esi

    mov esi, [esp + 0xc]
    mov edi, [esp + 0x10]
    
    xor eax, eax    ; clear ret

cmp_loop:
    mov al, [esi]
    mov cl, [edi] 
    sub al, cl      ; al = *esi - *edi
    jne cmp_leave

    test cl, cl     ; check for '\0'
    jz cmp_leave

    inc esi         ; ++esi, ++edi
    inc edi
    jmp cmp_loop

cmp_leave:
    pop esi
    pop edi
    ret



Here is the C prototype for strset:

char *strset(char *str, int ch);

This is a function that isn't part of the standard library. We assume it works similar to memset, except that it will stop at the null terminator. There are two strategies we could take: precompute the strlen and then do a memset, or use more generic looping and comparison for a tighter implementation.

_strset:
    push edi

    mov edi, [esp + 0x8]
    mov ecx, [esp + 0xc]    ; char in ecx

    push edi                ; store dest

set_loop:
    mov al, [edi]
    test al, al             ; check '\0'
    jz set_leave

    mov [edi], cl           ; *edi = ch
    inc edi
    
    jmp set_loop

set_leave:
    pop eax                 ; return dest

    pop edi
    ret

Thursday, January 1, 2015

Practical Reverse Engineering p. 35 #3

Question number 3 on page 35 of Practical Reverse Engineering is as follows:

In some of the assembly listings, the function name has a @ prefix followed by a number. Explain when and why this decoration exists.

This is a compiler ABI decoration used with stdcall. The function name is also prefixed with an underscore. The number after the @ sign tells us the amount of bytes in the parameters.

BOOL WINAPI DllMain(
  _In_  HINSTANCE hinstDLL,
  _In_  DWORD fdwReason,
  _In_  LPVOID lpvReserved
);

The C function becomes the following symbol after compilation:

_DllMain@12

Practical Reverse Engineering p. 35 #2

Question number 2 on page 35 of Practical Reverse Engineering is as follows:

In the example walk-through, we did a nearly one-to-one translation of the assembly code to C. As an exercise, re-decompile this whole function so that it looks more natural. What can you say about the developer’s skill level/experience? Explain your reasons. Can you do a better job?

The function itself is fairly straightforward outside of the intrinsics library, which is basically just anti-VM code. It's hard to judge the competency of the developer in terms of software engineering skills since the compiler can change a lot of structure around. It's obvious he or she at least has a decent understanding of the Windows API, and has an idea about where a virtual machine CPU might store the Interrupt Descriptor Table.

Here is the function reverse engineered to C:

#include <Windows.h> 
#include <tlhelp32.h> 
#include <intrin.h> 
 
typedef struct _IDTR {
    DWORD base;
    SHORT limit;
} IDTR, *PIDTR;
 
BOOL APIENTRY DllMain(HMODULE hMod, DWORD dwReason, LPVOID lpRes) 
{
    IDTR idtr;
    PROCESSENTRY32 procentry;
    HANDLE hToolSnap;
    BOOL bProc32;

    __sidt(&idtr); 
    
    if (idtr.base > 0x8003F400 && idtr.base < 0x80047400)
        return FALSE;

    memset(&procentry, 0, sizeof(PROCESSENTRY32));
    procentry.dwSize = sizeof(procentry);

    hToolSnap = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
    if (hToolSnap == INVALID_HANDLE_VALUE)
        return FALSE;
    
    for (   bProc32 = Process32First(hToolSnap, &procentry); 
            bProc32 != FALSE;
            bProc32 = Process32Next(hToolSnap, &procentry)) 
    {
        if (wcscmp(procentry.szExeFile, _T("explorer.exe")) == 0)
            break;
    }

    if (!bProc32) 
        return FALSE;

    if (procentry.th32ParentProcessID == procentry.th32ProcessID)
        return FALSE; 

    if (dwReason == DLL_PROCESS_ATTACH)
        CreateThread(0, 0, (LPTHREAD_START_ROUTINE)0x100032D0, 0, 0, 0);

    return TRUE;
}

Practical Reverse Engineering p. 35 #1

Question number 1 on page 35 of Practical Reverse Engineering is as follows:

Repeat the walk-through by yourself. Draw the stack layout, including parameters and local variables.

The function in question is DllMain, which is a WINAPI (stdcall) convention. The stack pointer with regards to function entry looks like the following:

0x0 &return
+0x4 hModule
+0x8 dwReason
+0xc lpReserved

The function's instructions immediately modify the stack:

push ebp
mov ebp, esp
sub esp, 130h
push edi
sidt fword ptr [ebp-8]
mov eax, [ebp-6]

So the stack for local variables becomes:

-0x138 saved EDI
-0x134 reserved
-0xc struct IDT
-0x4 saved EBP
0x0 &return

The reserved bytes are for a PROCESSENTRY32 struct, which is defined on MSDN as follows:

typedef struct tagPROCESSENTRY32 {
  DWORD     dwSize;
  DWORD     cntUsage;
  DWORD     th32ProcessID;
  ULONG_PTR th32DefaultHeapID;
  DWORD     th32ModuleID;
  DWORD     cntThreads;
  DWORD     th32ParentProcessID;
  LONG      pcPriClassBase;
  DWORD     dwFlags;
  TCHAR     szExeFile[MAX_PATH];
} PROCESSENTRY32, *PPROCESSENTRY32;

So a more accurate and complete stack layout showing the local variables and parameters may be:

-0x138 saved EDI
-0x134 PROCESSENTRY32.dwSize
-0x130 PROCESSENTRY32.cntUsage
-0x12c PROCESSENTRY32.th32ProcessID
-0x128 PROCESSENTRY32.th32DefaultHeapID
-0x124 PROCESSENTRY32.th32ModuleID
-0x120 PROCESSENTRY32.cntThreads
-0x11c PROCESSENTRY32.th32ParentProcessID
-0x118 PROCESSENTRY32.pcPriClassBase
-0x114 PROCESSENTRY32.dwFlags
-0x10 PROCESSENTRY32.szExeFile[0x104]
-0xc IDT.limit
-0xa IDT.base
-0x4 saved EBP
0x0 &return
+0x4 hModule
+0x8 dwReason
+0xc lpReserved