Tuesday, 1 December 2020

Ghidra and the Lost Gems (Fixing Misidentified Code)

Introduction

Using Ghidra disassembler to disassemble non-x86/x64 programs from architectures such as MIPS, Motorola, PowerPC, ... etc can be more error-prone than disassembling x86/x64 programs. One of the reasons for these errors is that Ghidra can misidentify some code bytes as data bytes. To address this issue Ghidra offers the experimental Aggressive Instruction Finder analysis, however, even when this analysis is enabled, Ghidra still misses a lot of code locations (bytes) and leaves them non-disassembled. This can be very annoying especially when using Ghidra's cross-references, since many disassembled functions can be without any caller functions in their cross-references list, as Ghidra is unable to locate these callers.


Figure1: These folks know what I am talking about 😀


Ghidra BruteforceDisassembly.py Script

To fix code bytes (locations) misidentification, I wrote a Ghidra python script that attempts to force the disassembly of misidentified (non-disassembled) code bytes. 

Figure2: BruteforceDisassembly.py Ghidra script
    

The script follows a simple methodology, it first prompts the user to specify the code bytes of interest. Ideally, the user is interested in recovering the missed non-disassembled functions, for this reason, it would make more sense to search for code bytes of functions prologues. For instance, a popular function prologue for x86 architecture is "push ebp"=0x55 followed by "mov ebp, esp"=0x8bec, thus, we can be interested in finding the bytes 558bec. Since Ghidra in general performs well with x86 architecture, I am going to showcase with the Motorola architecture instead, where the functions prologues contain the bytes 4e56. But first I enable Aggressive Instruction Finder analysis to allow Ghidra to try harder to find code bytes.

Figure3: Enable Ghidra's Aggressive Instruction Finder analysis


After running the script it prompts the user to enter the targeted code bytes.

Figure4: Enter the code bytes for functions prologues


Next, the script prompts the user to enter the first instruction name, this is to specify what instruction we are interested in. For instance, if the script found matched bytes then to identify whether if they are truly code bytes (rather than data bytes) we need to make sure that the disassembled bytes will lead to the targeted instruction. For example, in this case, the specified bytes 4e56 if disassembled correctly it should be disassembled to the instruction link.

Figure5: Enter the targeted instruction (first instruction in the disassembled location)

After running the script, it has successfully identified and fixed 115 code locations that were misidentified as data locations.

Figure6: The script fixed 115 misidentified code locations


The BruteforceDisassembly.py script and the program firmware.bin can be found at the GitHub repository