Analyzing Windows Crash Dump Files

发布时间:2012-9-25 16:34
分类名称:Debug_Crack


From(Codeproject) http://www.codeproject.com/Articles/30600/Analyzing-Windows-Crash-Dump-Files

Introduction

This article will focus on using the Debugging Tools for Windows in order to analyze a crash dump. The intention therein is to encourage the reader to use these techniques if his or her system crashes. It is entirely possible to offer this as a learned skill to those who have systems that crash a lot. Analyzing a crash dump file that is generated by the Operating System can be an easy task once a few of the necessary principles are understood, as well as the tools needed to perform an analysis. Tools are needed for analyzing crash dumps. The tools needed to analyze a crash dump would be the Debugging Tools for Windows debuggers. After installing those tools, you would download the symbols files to cache them locally. During the debugging process, these symbol files can also be downloaded from the Microsoft Symbol Server by setting the path to the environment:

set PATH=srv*c:\symbols*http://msdl.microsoft.com/download/symbols

You should use the /M switch at the end of the line when running Vista. Notice how the symbols are cached locally in a directory called c:\symbols. But, what are symbols? Symbols are made when a program is being built, the compilation process translates the human-readable source code to the machine's assembly language. This code is normally used to build an object file, which contains a symbol table describing all the objects in the file that have external linkage. Symbols refer to variables and functions in the running program by the names given to them by the programmer in the source code. In order to display and interpret these names, the debugger requires information about the types of the variables and functions in the program, and about which instructions in the executable file correspond to which lines in the source code files. Such information takes the form of a symbol table, which the compiler and linker include in the executable file during the linking process to build that executable. Therefore, the downloaded symbols would be for Microsoft code alone. As we will see, a third party driver will not have symbols, and also uses a calling convention that omits a stack frame pointer. This third party driver would call an Operating System function and that would cause the crash, but it is likely that the third party driver passed the function some erroneous data. Having said that, another powerful debugger is the livekd.exe written by Mark Russinovich. As we will see, he is also the author of the tool that causes a crash for the educational sake of how to analyze crash dump files and put that knowledge to practical use.

Before we discuss these tools and how they are used, we must first understand that normally when the system crashes, something went wrong in the kernel mode. A device driver or an Operating System function running in kernel-mode incurs an unhandled exception, such as a memory access violation, an example of which would be either an attempt to write to a read-only page, or an attempt to read an address that isn't currently mapped and is therefore not a valid memory location. Stated loosely, an executing thread attempts to or does write to a memory block that it does not own and corrupts the state of that memory block.

Crash dump analysis resides under the topic of memory analysis. A fundamental aspect of memory analysis is that the locations of data used by the Operating System are not the same as the physical locations needed to locate data in a memory dump. Because there is generally not enough physical memory to contain all running processes simultaneously, the Windows Operating System must simulate a larger memory space. This is why configuring a full memory dump is not very practical, as user mode code and data are normally not used for crash dump analysis. If something went wrong in kernel mode, then configuring a kernel dump crash file would be the best choice to analyze a system crash. These settings are found in the Advanced Settings tab on the applet in the Control Panel that also contains the device manager and the remote settings.

A Brief Look at Threads and Processes

A thread is a unit of execution context. Threads are the units of scheduling, and contain the execution state: the register values, the instruction pointer, and the stack pointer. A process is a container that has at least one thread, a handle table, a security token, and an address space. Threads share the private address space, so it is up to the programmer to synchronize access to shared data within the address space among these threads. In fact, part of the Windows memory protection scheme is premised on the fact that when a process (threads within) is executing, the address space of that process is mapped into the microprocessor's memory management hardware. Therefore, a process can't see the address space of another process by virtue of the fact that it is not present—it is currently not loaded into the microprocessor's memory management hardware. This does not mean that it cannot access the address space of another process. In order to do so, it has to follow Windows security principles, open that process, and use special APIs to gain access to that remote process's address space.

The Windows Memory Manager creates the illusion of a flat virtual address space, when in fact, the hardware unit of the microprocessor maps the virtual address space to the physical address. This larger memory space simulation is achieved by creating a virtual address space for each process that is translated to physical storage locations through a series of data structures. The main data structures are the page directory and the page table. Mapping the virtual address space to the physical address is done so in the granularity of a page (4 kilobytes of physical memory). When a user mode application needs to map its code and data onto the virtual address space, the process may represent to the system an instance of a running program. But, as an application needs to map its code and data onto the virtual address space, the actual Operating System also needs to map itself, as well as the configured device drivers, and the data that is used by device drivers that is stored on the kernel-mode heap. The virtual address used by a process does not represent the actual physical location of an object in memory. Instead, the system maintains a page map for each process, which is an internal data structure used to translate virtual addresses into corresponding physical addresses.

Another thing about memory protection is that the address space consists of both the user's address space as well as part of the address space that is dedicated towards mapping the kernel, the drivers, and the data they both use. It would pose a security risk if user mode components like Notepad could reach into kernel mode and read the data out of there or even modify it. For this reason, Windows relies on the help of the memory management hardware to mark pages that represent kernel address space as being system pages. User process memory addresses are separate, as all kernel mode components share a single address space: user threads cannot access kernel memory.

The memory management hardware on processors that Windows run on prevent anything running in user mode from accessing pages that are marked as system pages. So, in order for a thread to make a system call and thus enter Operating System code and access kernel memory, a transition has to occur. When a thread has to make a system call, that thread makes a call function in a DLL that performs a special instruction that safely transitions into this elevated processor access mode. On an x86 architecture, this elevated processor access mode is called Ring 0. So, kernel-mode code runs in ring 0, and user mode code runs in ring 3. Threads are constantly switching back and forth from user-mode to kernel-mode and back every time they make a system call. When that switch is made, the thread is now executing in kernel mode, and now the Operating System and the drivers have access to that kernel-mode protected memory.

Interrupt Request Levels: IRQLs

x86 interrupt controllers perform a level of interrupt prioritization, but Windows imposes its own interrupt priority scheme known as interrupt request levels (IRQLs). This scheme is actually a software concept that is used by Windows to prioritize its own work. It is basically the priority of what's happening on the processor at that point. There are a few IRQLs that are normally related to crashes. One is the lowest level, and is called the PASSIVE_LEVEL, during which no interrupts are masked: no software or hardware interrupts are masked. By definition, when the system is running user-mode code, the IRQL is at PASSIVE_LEVEL. The only time an IRQL can be elevated to higher levels is if the system is executing kernel mode code in response to software generated interrupts or hardware generated interrupts that trigger the execution of interrupt service routines or deferred procedure calls. Even when running kernel-mode code, the system tries to keep the IRQL at PASSIVE_LEVEL because it is more responsive to devices that are interrupting the system to keep their interrupts unmasked. The next IRQL relevant to system crashes is the DISPATCH_LEVEL. DISPATCH_LEVEL is the highest software interrupt level, and scheduler operations are mapped to this level. When the scheduler is operating on the system, it raises the IRQL to DISPATCH_LEVEL. Other operations can raise the IRQL to DISPATCH_LEVEL, but when another operation raises the IRQL to DISPATCH_LEVEL, the scheduler is disabled. A way that a thread running in kernel mode can ensure that it is not preempted by another thread on that processor is to raise the interrupt level to DISPATCH_LEVEL. This turns off the scheduler, and now that thread can run through whatever operation is performing to completion. When it is done, it drops the interrupt level down to PASSIVE_LEVEL and re-enables the scheduler. A side effect of having the scheduler off at DISPATCH_LEVEL is that a driver executing at DISPATCH_LEVEL or a level higher cannot take a page fault. It cannot reference a piece of memory that is marked as pageable that is not present because to do so would trigger the memory manager in the page-in handler, which would be forced to issue a disk I/O (hard fault). It would then suspend that thread until the I/O is complete, until the data that was referenced has been brought in from disk (either a mapped file or a page file). In that process of placing the thread in a wait state, it is basically calling the scheduler and informing it that it must find another thread to run on the CPU. But at DISPATCH_LEVEL, the scheduler is off. This is a violation of the Windows internal synchronization architecture, and thus by the system, an illegal operation.

The Stack in Contrast with the Heap

The stack is an abstract data structure that is read recursively, from bottom to top. The heap is a dynamically allocated amount of memory used for building programs when the size of their data structures cannot be determined statically. That is, the data structures will grow and shrink as the program dictates the need for heap allocations. The heap grows from the lower memory addresses to the higher addresses, a manner of opposite of that of the stack. It is not possible for the heap and the stack to run into each other. The data section of an application program stores global and static variables on the heap. The BSS section of an application program stores globally initialized variables on the heap. The stack represents the data that the hardware records and that the device drivers that are calling Operating System functions record that allow nested function invocation. So, when a device driver calls the Operating System, its information that is stored in the stack is used to pass parameters to the Operating System and return back to the function that called it. So, the stack stores the parameters passed, the return address, and local variables (information local to the function that is processing the request). In Windows, each thread has two stacks: one for user-mode execution of the thread, and the stack that resides in the user address space and therefore is accessible to any thread in the process. When a thread enters kernel-mode, having invoked a system call, that thread now runs off of its kernel mode stack. The kernel mode stack resides in the kernel address, and therefore is not accessible to the threads running in user mode.

The return address is saved at the time when Function 1 makes its call to Function 2. That is what the hardware saves, so that when Function 2 returns, the hardware knows where it should pick up execution inside of Function 1. Function 2, after it is called, begins by setting up its frame pointer. It saves that to the stack; it might use some local buffers that it wants to use temporarily while it is executing. These local buffers are allocated on the stack and are seen as Local Variable 1 and Local Variable 2. Function 2, when it calls Function 3, passes the arguments to the function on the stack in the same way it was passed arguments. The stack frame pointers clearly delineate the areas that correspond to each of the functions in that nesting.

The scenario above illustrates a very simple calling convention for the debugging analysis engine to analyze. Other calling conventions are different, however. Once the calling convention inside the kernel itself is called, the frame pointer is omitted: no frame pointer is pushed onto the stack, making it difficult for the analysis engine in the debugger unless the analysis has symbols. The analysis engine has symbols for all of Microsoft code, but if you have third party drivers on the stack and they are using calling conventions that don't use the frame pointer, it is difficult for the analysis engine to figure out where the stack frames are.

When you open up a crash dump file in the Windows debugger, it performs a basic analysis, and essentially makes a guess as to who the culprit is. When you open the debugger, it internally invokes a command that you can explicitly use, called !analyze (!analyze -v load). !Analyze displays the stop code and parameters and a guess at the offending driver. !Analyze basically looks at the stack. Sometimes, the bug check parameters point to the instruction pointer (cs:ip) that is the offending instruction. Using the loaded module list (the !lm command), it can determine what driver that instruction fits. In other cases, !analyze uses heuristics to walk the stack and determine what was happening at the time of the crash, and then performs a sort of profiling. If the crash occurred inside of the Operating System but a caller of the function of the Operating System that triggered the crash was a third party driver, the debugger might guess and state that the crash was probably caused by .. and then point the finger at the third party driver even though the crash itself might have been caused by a Windows Operating System function. But, it is very likely that the called function was passed some erroneous data (a pointer to a corrupt data structure, or some parameter that was invalid). If the debugger states that the crash was by, say, ntokrnl.exe, or some file system driver, don't believe it. Microsoft has gathered multitudes of data involving crashes and has the data to prove that at least 80% of the crashes are caused by 3rd party device drivers. This means that you have to do more digging.

Mark Russinovich, who is now employed by Microsoft, wrote an application called "Notmyfault.exe" that is downloadable freeware in a zip file. The purpose of this application is solely meant for educational purposes to help users and the like learn how to analyze and interpret a crash dump file. When the Operating System sees something wrong that is out of any legal bounds, it will call an Operating System function called KernelBugEx (documented in the Windows DDK). This function takes the stop code and four parameters that are interpreted in a per-stop code basis. KernelBugCheckEx masks out all interrupts on all processors in the displays to then switch the display mode into a low-resolution VGA graphics mode, which then lets the system paint a blue screen. The example that will be shown below describes some of the contents of the blue screen while the memory dump begins, dumping physical memory to disk. You can use the Notmyfault utility to generate a crash. You simply select the type of Operating System scenario that would cause a system crash: