Scheduling, Thread Context and IRQL

发布时间:2014-5-23 13:13
分类名称:Driver


名词解释:

Schedule
调度,线程被内核的调度器所调度,就是停止一个线程,运行另一个线程的过程。

Scheduler
调度器,线程被来回切换,就是由调度器实现的,调度器运行在DISPATCH_LEVEL中断级别。

Priority
优先权,线程被内核的调度器调度时,会选择高具有优先权的线程,先让它运行。此属性针对的是某个线程。

pre-empted/preempt
抢占,当前线程运行过程中,会被一个高优先级的线程抢占,高优先级的线程会获得更多的时间片。

Interrupt
中断,中断是针对CPU(Processsor)的。

Suspend
挂起,挂起某个线程,如果不调用恢复函数,线程无法继续运行。挂起的原理为给线程排队一个APC,在此APC上等待(KeWaitForSingleObject)一个信号量,进入Wait State。

Wait State
等待状态,线程调用调用WaitForSingleObject,KeWaitForSingleObject等类型的函数,是自己变为等待状态。

调度,优先权,抢占是针对某个线程而已,而中断针对的是当前的CPU。

 

以下是微软的经典文章,在重要的地方,我用中文做了解释。

A thread's scheduling priority and the processor's current IRQL determine whether a running thread can be pre-empted or interrupted.

下面说明了抢占和中断的区别之一:
抢占的表现形式为挂起一个线程,运行另一个线程,由于线程被切换,所以线程的上下文(Context)也随之切换。
而中断表现形式则是还是当前线程,不会切换线程,在此线程上下文中暂时运行一个高级别中断处理程序 ,就如同在线程运行过程中,突然插入了一个调用函数,当中断程序处理完成后,当前线程继续运行。中断不受调度器影响。

In thread pre-emption, the operating system replaces the running thread with another thread, usually of higher thread priority, on the same processor. The effect of pre-emption on an individual thread is to make the processor unavailable for a while. In thread interruption, the operating system forces the current thread to temporarily run code at a higher interrupt level. The effect of interruption on an individual thread is similar to that of a forced procedure call.

Thread Scheduling

The Microsoft? Windows? operating system schedules individual threads, not entire processes, for execution. Every thread has a scheduling priority (its thread priority), which is a value from 0 to 31, inclusive. Higher numbers indicate higher priority threads.

Each thread is scheduled for a quantum时间片), which defines the maximum amount of CPU time for which the thread can run before the kernel looks for other threads at the same priority to run. The exact duration of a quantum varies depending on what version of Windows is installed, the type of processor on which Windows is running, and the performance settings that have been established by a system administrator. (For more details, see Inside Windows 2000.)

After a thread is scheduled, it runs until one of the following occurs:

 

Kernel-mode threads do not have priority over user-mode threads. A kernel-mode thread can be pre-empted by a user-mode thread that has a higher scheduling priority.

Thread priorities in the range 1-15 are called dynamic priorities. Thread priorities in the range from 16-31 are called real-time priorities. Thread priority zero is reserved for the zero-page thread, which zeroes free pages for use by the memory manager.

Every thread has a base priority and a current priority. The base priority is usually inherited from the base priority for the thread's process. The current priority is the thread's priority at any given time. For kernel-mode driver code that runs in the context of a user thread, the base priority is the priority of the user process that originally requested the I/O operation. For kernel-mode driver code that runs in the context of a system worker thread, such as a work item, the base priority is the priority of the system worker threads that service its queue.

To improve system throughput, the operating system sometimes adjusts thread priorities. If a thread's base priority is in the dynamic range, the operating system can temporarily increase ("boost") or decrease its priority, thus making its current priority different from its base priority. If a thread's base priority is in the real-time range, its current priority and base priority are always the same; threads running at real-time priorities never receive a priority boost. In addition, a thread that is running at a dynamic priority can never be boosted to a real-time priority. Therefore, applications that create threads with base priorities in the real-time range can be confident that these threads always have a higher priority than those in the dynamic range.

The system boosts a thread's priority when the thread completes an I/O request, when it stops waiting for an event or semaphore, or when it has not been run for some time despite being ready to run (called "CPU starvation"). Threads involved in the Graphical User Interface (GUI) and the user's foreground process also receive a priority boost in some situations. The amount of the increase depends on the reason for the boost and, for I/O operations, on the type of device involved. Drivers can affect the boost their code receives by:

 

Constants defined in ntddk.h and wdm.h indicate the appropriate priority boost for each device, event, and semaphore.

A thread's scheduling priority is not the same as the interrupt request level (IRQL) at which the processor operates.

Thread Context and Driver Routines

Most Windows drivers do not create threads; instead, a driver consists of a group of routines that are called in an existing thread that was created by an application or system component.

Kernel-mode software developers use the term "thread context" in two slightly different ways. In its narrowest meaning, thread context is the value of the thread's CONTEXT structure. The CONTEXT structure contains the values of the hardware registers, the stacks, and the thread's private storage areas. The exact contents and layout of this structure will vary according to the hardware platform. When Windows schedules a user thread, it loads information from the thread's CONTEXT structure into the user-mode address space.

From a driver developer's perspective, however, "thread context" has a broader meaning. For a driver, the thread context includes not only the values stored in the CONTEXT structure, but also the operating environment they define—particularly, the security rights of the calling application. For example, a driver routine might be called in the context of a user-mode application, but it can in turn call a ZwXxx routine to perform an operation in the context of the operating system kernel. This paper uses "thread context" in this broader meaning.

The thread context in which driver routines are called depends on the type of device, on the driver's position in the device stack, and on the other activities currently in progress on the system. When a driver routine is called to perform an I/O operation, the thread context might contain the user-mode address space and security rights of the process that requested the I/O. However, if the calling process was performing an operation on behalf of another user or application, the thread context might contain the user-mode address space and security rights of a different process. In other words, the user-mode address space might contain information that pertains to the process that requested the I/O, or it might instead contain information that pertains to a different process.

一般情况下最高层的驱动,都运行在一个确定的线程上下文中,此上下文就是发起IO请求的线程上下文。

The dispatch routines of file system drivers (FSDs), file system (FS) filter drivers, and other highest-level drivers normally receive I/O requests in the context of the thread that initiated the request. These routines can access data in the user-mode address space of the requesting process, provided that they validate pointers and protect themselves against user-mode errors.

Most other routines in FSDs, FS filters, and highest-level drivers—and most routines in lower-level drivers—are called in an arbitrary thread context. Although the highest-level drivers receive I/O requests in the context of the requesting thread, they often forward those requests to their lower level drivers on different threads. Consequently, you can make no assumptions about the contents of the user-mode address space at the time such routines are called.

For example, when a user-mode application requests a synchronous I/O operation, the highest-level driver's I/O dispatch routine is called in the context of the thread that requested the operation. The dispatch routine queues the I/O request for processing by lower-level drivers. The requesting thread then enters a wait state until the I/O is complete. A different thread de-queues the request, which is handled by lower-level drivers that run in the context of whatever thread happens to be executing at the time they are called.

A few driver routines run in the context of a system thread. System threads have the address space of the system process and the security rights of the operating system itself. Work items queued with the IoXxxWorkItem routines run in a system thread context, and so do all DriverEntry and AddDevice routines. No user-mode requests arrive in a system thread context.

Driver Threads

Although a driver can create a new, driver-dedicated thread by calling PsCreateSystemThread, drivers rarely do so. Switching thread context is a relatively time-consuming operation that can degrade driver performance if it occurs often. Therefore, drivers should create dedicated threads only to perform continually repeated or long-term activities, such as polling a device or managing multiple data streams, as a network driver might do.

To perform a short-term, finite task, a driver should not create its own thread; instead, it can temporarily "borrow" a system thread by queuing a work item. The system maintains a pool of dedicated threads that all drivers share. When a driver queues a work item, the system dispatches it to one of these threads for execution. Drivers use work items to run code in the kernel address space and security context or to call functions that are available only at IRQL PASSIVE_LEVEL. For example, a driver's IoCompletion routine (which can run at IRQL DISPATCH_LEVEL) should use a work item to call a routine that runs at IRQL PASSIVE_LEVEL.

To queue a work item, a driver allocates an object of type IO_WORKITEM and calls the IoQueueWorkItem routine, specifying the callback routine to perform the task and the queue in which to place the work item. The kernel maintains three queues for work items:

 

A system worker thread removes the work item from the queue and runs the driver-specified callback routine in a system thread context at IRQL PASSIVE_LEVEL. The operating system ensures that the driver is not unloaded while the callback routine is running. To synchronize the actions of the callback routine with other driver routines, the driver can use one of the Windows synchronization mechanisms. For more information about synchronization, see the companion white paper, "Locks, Deadlocks, and Synchronization."

Because the system has a limited supply of dedicated worker threads, the tasks assigned to them should be completed quickly. For example, a driver should not have a work item that runs continuously until the driver is unloaded. Instead, the driver should queue a work item only when it is needed, and the work item routine should exit when it has completed its work. For the same reasons, drivers should never include an infinite loop (such as might occur in a file system driver) in a work item. Drivers should also avoid queuing excessive numbers of work items, because tying up the system worker threads can deadlock the system. Instead of queuing a separate work item routine for each individual operation, the driver should have a single work item routine that performs any outstanding work and then exits when there is no more immediate work to perform.

Interrupt Request Levels

An interrupt request level (IRQL) defines the hardware priority at which a processor operates at any given time. In the Windows Driver Model, a thread running at a low IRQL can be interrupted to run code at a higher IRQL.

The number of IRQLs and their specific values are processor-dependent. The IA64 and AMD64 architectures have 16 IRQLs and the x86-based processors have 32. (The difference is due primarily to the types of interrupt controllers that are used with each architecture.) Table 1 is a list of the IRQLs for x86, IA64, and AMD64 processors.

Table 1. Interrupt Request Levels

IRQL

IRQL value

Description

x86

IA64

AMD64

PASSIVE_LEVEL

0

0

0

User threads and most kernel-mode operations

APC_LEVEL

1

1

1

Asynchronous procedure calls and page faults

DISPATCH_LEVEL

2

2

2

Thread scheduler and deferred procedure calls (DPCs)

CMC_LEVEL

N/A

3

N/A

Correctable machine-check level (IA64 platforms only)

Device interrupt levels (DIRQL)

3-26

4-11

3-11

Device interrupts

PC_LEVEL

N/A

12

N/A

Performance counter (IA64 platforms only)

PROFILE_LEVEL

27

15

15

Profiling timer for releases earlier than Windows 2000

SYNCH_LEVEL

27

13

13

Synchronization of code and instruction streams across processors

CLOCK_LEVEL

N/A

13

13

Clock timer

CLOCK2_LEVEL

28

N/A

N/A

Clock timer for x86 hardware

IPI_LEVEL

29

14

14

Interprocessor interrupt for enforcing cache consistency

POWER_LEVEL

30

15

14

Power failure

HIGH_LEVEL

31

15

15

Machine checks and catastrophic errors; profiling timer for Windows XP and later releases

 

When a processor is running at a given IRQL, interrupts at that IRQL and lower are masked off (blocked) on the processor. For example, a processor that is running at IRQL=DISPATCH_LEVEL can be interrupted only by a request at an IRQL greater than DISPATCH_LEVEL.

The system schedules all threads to run at IRQLs below DISPATCH_LEVEL, and the system's thread scheduler itself (also called "the dispatcher") runs at IRQL=DISPATCH_LEVEL. Consequently, a thread that is running at or above DISPATCH_LEVEL has, in effect, exclusive use of the current processor. Because DISPATCH_LEVEL interrupts are masked off on the processor, the thread scheduler cannot run on that processor and thus cannot schedule any other thread.

On a multiprocessor system, each processor can be running at a different IRQL. Therefore, one processor could run a driver's InterruptService routine at DIRQL while a second processor runs driver code in a worker thread at PASSIVE_LEVEL. Because more than one thread could thus attempt to access shared data simultaneously, drivers must protect shared data by using an appropriate synchronization method. Drivers should use a lock that raises the IRQL to the highest level at which any code that accesses the data can run. For example, a driver uses a spin lock to protect data that can be accessed at IRQL=DISPATCH_LEVEL. For more information about synchronization mechanisms, see the companion white paper, "Locks, Deadlocks, and Synchronization."

On a single-processor system, raising the IRQL to DISPATCH_LEVEL or higher has the same effect as using a spin lock because raising the IRQL can prevent the pre-emption or interruption of the currently executing code. For example, when a driver's StartIo routine is running at DISPATCH_LEVEL on a single-processor system, other driver code that runs at APC_LEVEL or PASSIVE_LEVEL cannot run until the IRQL drops. Similarly, when a driver's InterruptService routine is running at DIRQL, the DPC queued by that routine cannot run until the InterruptService routine exits. In fact, the operating system's spin lock acquisition and release routines raise the IRQL on single-processor systems; they do not actually manipulate a lock object. On multiprocessor machines, however, spin lock acquisition routines raise the IRQL on the current processor while other processors spin on the lock.

For detailed information about locking, see the companion white paper, "Locks, Deadlocks, and Synchronization."

Processor-specific and Thread-specific IRQLs

As previously mentioned, the system's thread scheduler runs at IRQL=DISPATCH_LEVEL. IRQLs at or above DISPATCH_LEVEL are processor specific. Hardware and software interrupts at these levels are targeted at individual processors. The following processor-specific IRQLs are commonly used by drivers:

 

IRQLs below DISPATCH_LEVEL are thread specific. Software interrupts at these levels are targeted at individual threads. Drivers use the following thread-specific IRQLs:

IRQL被分为两种一种为针对单独的CPU:Processor-specific一种是针对单独的线程:Thread-specific针对单独的线程时,可以将线程想象为一个"伪CPU",它只有俩个中断,APC和PASSIVE,而且这俩个中断的处理方式和真CPU中断处理方式不同,例如一个APC中断有时候并不会立即执行。见下文解释。而且在这俩个级别上,线程受到调度器调度,调度器只考虑优先级,不考虑当前的LEVEL,所以会导致一个运行在PASSIVE的高优先级线程抢占一个运行在APC_LEVEL相对较低优先级的线程。

APC_LEVEL是个比较复杂的"中断级别"。

 

The thread scheduler considers only thread priority, and not IRQL, when preempting a thread. If a thread running at IRQL=APC_LEVEL blocks, the scheduler might select a new thread for the processor that was previously running at PASSIVE_LEVEL.

实际上,还有个中间件级:intermediate level相当于运行在PASSIVE_LEVEL当禁止了部分APC中断(如没有禁止:special kernel-mode APC)。

Although only two thread-specific IRQL values are defined, the system actually implements three levels. The system implements an intermediate level between PASSIVE_LEVEL and APC_LEVEL. Code running at this level is said to be in a critical region. Code that is running at IRQL=PASSIVE_LEVEL calls KeEnterCriticalRegion to raise the IRQL to this level and calls KeLeaveCriticalRegion to return the IRQL to PASSIVE_LEVEL.

The following sections provide more information about the operating environment for driver code at each of these levels.

IRQL PASSIVE_LEVEL

While the processor is operating at PASSIVE_LEVEL, the operating system uses the scheduling priorities of the current threads to determine which thread to run. PASSIVE_LEVEL is the processor's normal operating state, at which any interrupt can occur. Any thread that is running at PASSIVE_LEVEL is considered preemptible because it can be replaced by a thread that has a higher scheduling priority. A thread that is running at PASSIVE_LEVEL is also considered interruptible because it can be interrupted by a request at a higher IRQL.

Occasionally, driver code that is running at IRQL PASSIVE_LEVEL must call a system service routine or perform some other action that requires running at a higher IRQL (usually DISPATCH_LEVEL). Before making the call or performing the action, the driver must raise its IRQL to the required level; immediately after completing the action, the driver must lower the IRQL.

Code that is running at PASSIVE_LEVEL is considered to be working on behalf of the current thread. An application that creates a thread can suspend that thread while the thread is running kernel-mode code at PASSIVE_LEVEL. Therefore, driver code that acquires a lock at IRQL=PASSIVE_LEVEL must ensure that the thread in which it is running cannot be suspended while it holds the lock; thread suspension would disable access the to driver's device. This problem is usually resolved by using locks that raise the IRQL. Another solution is to enter a critical region whenever it tries to acquire such a lock. This issue is covered in greater detail in the companion paper, "Locks, Deadlocks, and Synchronization."

这里说了下suspend,suspend会导致线程无法访问device,为了防止访问异常,驱动需要上锁,或者提升IRQL。为什么? Suspend为挂起的意思,相当于停止了某个线程,使得线程处于等待(wait)状态(等待某个信号量上),如果不Resume的话,是不会运行的。在WRK源码中,函数 KeSuspendThread实现了挂起操作,其调用KiInsertQueueApc排队了一个APC到要挂起的线程上,在APC中等待一个信号量。排队一个APC相当于发生了一个APC_LEVEL"中断",所以禁用挂起的方式通过获取spin lock(DISPATCH_LEVEL)或者提升IRQL(大于PASSIVE_LEVEL),这样在解锁或降低IRQL之前,APC得不到响应。

IRQL PASSIVE_LEVEL in a critical region

Code that is running at PASSIVE_LEVEL in a critical region is effectively running at an intermediate level between PASSIVE_LEVEL and APC_LEVEL. Calls to KeGetCurrentIrql return PASSIVE_LEVEL. Driver code can determine whether it is operating in a critical region by calling the function KeAreApcsDisabled (available in Windows XP and later releases).

Driver code that is running above PASSIVE_LEVEL (either at PASSIVE_LEVEL in a critical region or at APC_LEVEL or higher) cannot be suspended. 为甚不能挂起?)Almost every operation that a driver can perform at PASSIVE_LEVEL can also be performed in a critical region. Two notable exceptions are raising hard errors and opening a file on storage media.

为何在intermediate level也不能挂起一个线程?因为在此级别上,禁止了部分APC排队。只有Special kernel-mode APC无法禁止。但KeSuspendThread排队的是一个Normal kernel-mode APC,所以也能禁止挂起操作。APC有三个分类,下文会有解释。

IRQL APC_LEVEL

APC_LEVEL is a thread-specific IRQL that is most commonly associated with paging I/O. Applications cannot suspend code that is running at IRQL=APC_LEVEL. The system implements fast mutexes (a type of synchronization mechanism) at APC_LEVEL. The KeAcquireFastMutex routine raises the IRQL to APC_LEVEL, and KeReleaseFastMutex returns the IRQL to its original value. 注意fast mutex只能用在2003server以后的系统中。

The only difference between a thread that is running at PASSIVE_LEVEL with APCs disabled and a thread that is running at APC_LEVEL is that while running at APC_LEVEL, the thread cannot be interrupted to deliver a special kernel-mode APC.

Asynchronous Procedure Calls (APCs)

Asynchronous procedure calls (APCs) are software interrupts that are targeted at a specific thread. The system uses APCs to perform work in the context of a particular thread, such as writing back the status of an I/O operation to the requesting application.

How a target thread responds to APCs depends on the thread's state and the type of APC. The following briefly summarizes the actions; for a complete description, see "Do Waiting Threads Receive Alerts and APCs" under "Synchronization" in the "Kernel-Mode Drivers Architecture Design Guide" of the Windows Driver Development Kit (DDK).

APC的排队细则在WRK中的KeInsertQueueApc函数中,首先通过KeInitializeApc初始化一个APC,然后调用KeInsertQueueApc->KiInsertQueueApc将APC排队,如何排队,就是下面说的细则。一个APC在内核中,表示为:KAPC结构体,如下:

typedef struct _KAPC {
UCHAR Type;
UCHAR SpareByte0;
UCHAR Size;
UCHAR SpareByte1;
ULONG SpareLong0;
struct _KTHREAD *Thread;
LIST_ENTRY ApcListEntry;
PKKERNEL_ROUTINE KernelRoutine;        // Kernel APC
PKRUNDOWN_ROUTINE RundownRoutine;
PKNORMAL_ROUTINE NormalRoutine;        // Normal APC
PVOID NormalContext;

//
// N.B. The following two members MUST be together.
//

PVOID SystemArgument1;
PVOID SystemArgument2;
CCHAR ApcStateIndex;
KPROCESSOR_MODE ApcMode;    //usermode or kernel mode
BOOLEAN Inserted;
} KAPC, *PKAPC, *PRKAPC;

下面阐述的就是KiInsertQueueApc代码中的细则:
Every thread has two kernel-mode APC queues, one for APC_LEVEL callbacks and another for critical region callbacks. Each time the system adds an APC to a queue, it checks to see whether the target thread is currently running. If so, the system requests an interrupt on the appropriate processor. If the thread is still running when the system services the interrupt, the APC runs immediately, if appropriate. If the target thread is not running, the APC is added to the queue and runs the next time the thread is scheduled; the interrupt does not cause the target thread to run immediately. If the current IRQL is too high to run the APC, the APC runs the next time the IRQL is lowered below the level of the APC. If the thread is waiting at a lower IRQL, the system wakes the thread temporarily to deliver the APC, and then the thread resumes waiting.

如果排队APC的线程处于运行的状态,此线程就相当于一个伪CPU,给线程排队APC会发送一个APC软中断,如果目前线程不是在APC_LEVEL,发送完后还在继续执行,那么此线程会被"中断",去执行相应的APC回调函数。如果目前的线程运行在APC_LEVEL或者APC_LEVEL以上,那么中断无法响应,只是将APC排队到线程中,等到降到APC_LEVEL一下后,再次调度此线程时,调用排队的APC。

如果排队的APC处于等待状态,如调用KeWaitXXX进入等待状态,要看其传入的参数:Alertable和WaitMode,还有APC的类型。

如果APC的类型为special kernel-mode APC,无论这两个参数是什么,只要当前线程的IRQL<APC_LEVEL,线程会被被暂时唤醒,去执行对应的APC,然后线程继续恢复等待状态。唤醒的目的只是去执行APC,原线程仍然没有执行一行代码。所以看起来等待并没有被结束。

如果APC的类型为Normal kernel-mode APC,无论这两个参数是什么,只要当前线程的IRQL<APC_LEVEL且线程不在一个APC中或者不在intermediate level,线程会被被暂时唤醒,去执行对应的APC,然后线程继续恢复等待状态。唤醒的目的只是去执行APC,原线程仍然没有执行一行代码。所以看起来等待并没有被结束。

如果APC的类型为Normal use-mode APC,只有Alertable参数为TRUE,而且WaitMode为UserMode的时候,会停止线程等待,线程重新回到调度状态,调度器会调度其APC。

The Alertable and WaitMode parameters to KeWaitForSingleObject and KeWaitForMultipleObjects determine how the system handles user-mode APCs while the thread is waiting. Table 8 is a summary of the effects of these parameters on APC delivery.

Table 8. Effects of the Alertable and WaitMode Parameters on APC Delivery

Value of Alertable and WaitMode parameters

Special
kernel-mode APC

Normal
kernel-mode APC


User-mode APC

Terminate wait?

Deliver and run APC?

Terminate wait?

Deliver and run APC?

Terminate wait?

Deliver and run APC?

Alertable = TRUE

WaitMode = UserMode

No

If (A*), then Yes

No

If (B**), then Yes

Yes

Yes, after thread returns to user mode

Alertable = TRUE

WaitMode = KernelMode

No

If (A), then Yes

No

If (B), then Yes

No

No

Alertable = FALSE

WaitMode = UserMode

No

If (A), then Yes

No

If (B), then Yes

No

No (with exceptions, such as CTRL+C to terminate)

Alertable = FALSE

WaitMode = KernelMode

No

If (A), then Yes

No

If (B), then Yes

No

No

 

*A: IRQL < APC_LEVEL.

**B: IRQL < APC_LEVEL, thread not already in an APC, and thread not in a critical region.

Both user-mode code and kernel-mode code can issue APCs. The system defines three types of APCs:

 

User-mode APCs are primarily used in completing I/O operations. Win32? APIs such as ReadFileEx and WriteFileEx allow the caller to specify an I/O completion callback routine. To run the callback routine, the system queues a user-mode APC to the thread that requested the I/O. A user-mode application can queue a user-mode APC directly by calling the Win32 API QueueUserAPC. User-mode APCs are beyond the scope of this document.

Normal kernel-mode APCs are delivered at the intermediate level that corresponds to PASSIVE_LEVEL in a critical region. The system delivers a normal kernel-mode APC when the target thread is already running at PASSIVE_LEVEL or when the thread is returning to PASSIVE_LEVEL after exiting from a critical region or after lowering the IRQL. Normal kernel-mode APCs have two routines, a Special Routine that runs at APC_LEVEL, and a Normal Routine that subsequently runs at PASSIVE_LEVEL in a critical region. The special routine typically frees the APC structure.

Special kernel-mode APCs are delivered at APC_LEVEL. The system delivers them if the target thread is running at an IRQL below APC_LEVEL or if the target thread is returning to an IRQL below APC_LEVEL.

Normal kernel-mode APCs and special kernel-mode APCs are queued when kernel-mode operating system code calls an undocumented internal routine; drivers cannot queue kernel-mode APCs directly.

The I/O manager queues the special kernel-mode APC for I/O completion whenever an I/O request completes. When a device's drivers complete a buffered I/O request, the I/O Manager queues this APC to the user-mode thread that originated the I/O request. When the APC runs, the operating system restores the thread's context and the I/O manager copies data from the driver's kernel-space output buffer to the requesting thread's user-space buffer.

IRQL DISPATCH_LEVEL

DISPATCH_LEVEL is the highest software interrupt level and the first processor-specific level. The Windows thread scheduling and dispatching components (collectively called "the dispatcher") run at IRQL DISPATCH_LEVEL. Some other kernel-mode support routines, some driver routines, and all deferred procedure calls (DPCs) also run at IRQL DISPATCH_LEVEL. While the processor operates at this level, one thread cannot pre-empt another; only a hardware interrupt can interrupt the running thread. To maximize overall system throughput, driver code that runs at DISPATCH_LEVEL should perform only the minimum amount of required processing.

Because code that is running at DISPATCH_LEVEL cannot be pre-empted, the operations that a driver can perform at DISPATCH_LEVEL are restricted. Any code that must wait for an object that another thread sets or signals asynchronously—such as an event, semaphore, mutex, or timer—cannot run at DISPATCH_LEVEL because the waiting thread cannot block while waiting for the other thread to perform the action. Waiting for a nonzero period on such an object while at DISPATCH_LEVEL causes the system to deadlock and eventually to crash.

Deferred procedure calls (DPCs) are, in effect, software interrupts targeted at processors. DPCs (including DpcForIsr, CustomDpc, and CustomTimerDpc routines) are always called at IRQL DISPATCH_LEVEL in an arbitrary thread context. Drivers usually use DPCs for the following:

 

The kernel maintains a queue of DPCs for each processor and runs DPCs from this queue just before the processor's IRQL drops below DISPATCH_LEVEL.

Each DPC is assigned to the queue for the same processor on which the code that queues it is running. The kernel removes DPC objects from the head of the queue and adds them to its tail. A driver can change the processor for which a DPC object is queued by calling KeSetTargetProcessorDpc. A driver can also change the DPC's relative location in the queue by calling KeSetImportanceDpc. However, drivers rarely need to change either of these characteristics.

If a device interrupts while either its DpcForIsr or CustomDpc routine is running, its InterruptService routine pre-empts the DPC and queues a DPC object as it normally would. In a single-processor system, the DPC object is placed at the end of the single DPC queue, where it runs in sequence with any other DPCs in the queue after the InterruptService routine and the current DPC complete. In a multi-processor system, however, the second interrupt could occur on a different processor.

For example, assume a device interrupts on Processor 1 while its DpcForIsr routine is running on Processor 0. The system runs the InterruptService routine on Processor 1 to handle the interrupt. When the InterruptService routine queues its DpcForIsr routine, the system places the DPC object into the DPC queue of Processor 1. Thus, a driver's InterruptService routine can run at the same time as its DPC routine, and the same DPC routine can run on two or more processors at the same time. If both routines attempt to access the same data simultaneously, serious errors can occur. Drivers must use spin locks to protect shared data in these scenarios.

IRQL DIRQL

DIRQL describes the range of IRQLs that physical devices can generate. Each processor architecture has a range of DIRQLs, as shown in Table 1, "Interrupt Request Levels." The DIRQL for each device instance is available to its driver in the CM_RESOURCE_LIST structure passed by the PnP manager as part of the IRP_MN_START_DEVICE request. The driver, in turn, passes this IRQL to the I/O manager when it calls IoConnectInterrupt to connect its interrupt object. Multiple devices can interrupt at the same DIRQL.

Note

Microsoft has made several enhancements to the interrupt architecture in the next release of Windows, codenamed "Longhorn." For information about these pending changes, see the white paper "Interrupt Architecture Enhancements in Microsoft Windows, Codenamed Longhorn," which is available at http://www.microsoft.com/whdc/hwdev/bus/pci/MSI.mspx.

 

Two types of driver routines run at DIRQL:

 

Function drivers for physical devices usually include these routines; filter and file system drivers never do.

InterruptService routines must run at DIRQL so that they can handle the current interrupt without receiving further interrupts from the interrupt controller. A driver's InterruptService routine should first determine whether its device is the source of the interrupt. If so, the routine stops the device from generating further interrupts, saves any necessary context information, and queues a deferred procedure call (DPC) to run later at DISPATCH_LEVEL. If the interrupt was generated by some other device, the routine should simply return FALSE.

A driver's InterruptService routine runs on the same processor on which its device interrupted; in turn, its DpcForIsr (or CustomDpc) routine runs on the same processor as the InterruptService routine that queued it.

InterruptService routines must follow these important rules:

 

InterruptService routines should perform only the tasks that cannot be deferred until the processor is at a lower IRQL. InterruptService routines that run for longer than a minimal time can reduce performance across the operating system.

SynchCritSection routines also run at DIRQL. A driver uses a SynchCritSection routine to access data that is shared with an InterruptService routine. Like InterruptService routines, SynchCritSection routines should perform only the minimum set of required tasks: accessing hardware registers, writing data that is shared with the InterruptService routine, and so forth. For example, a driver might need a SynchCritSection routine to re-enable device interrupts from its DpcForIsr routine.

A driver cannot call a SynchCritSection routine directly. Instead, the driver calls KeSynchronizeExecution, passing a pointer to the SynchCritSection routine. KeSynchronizeExecution raises the IRQL on the processor to DIRQL for the device, acquires the device's interrupt spin lock, and then starts the routine. Before returning to the caller, KeSynchronizeExecution releases the interrupt spin lock and lowers the IRQL on the current processor to its previous value.

While running at DIRQL, driver code must conform to the guidelines described in the section "Guidelines for Running at IRQL DISPATCH_LEVEL or Higher."

IRQL HIGH_LEVEL

Certain bug-check and non-maskable interrupt (NMI) callback routines run at IRQL HIGH_LEVEL. Because no interrupts can occur at IRQL HIGH_LEVEL, these routines are guaranteed to run without interruption.

The lack of interrupts, however, means that actions of the callback routines are severely restricted. In addition to the restrictions listed in the section "Guidelines for Running at IRQL DISPATCH_LEVEL or Higher," the following rules apply to code that runs at HIGH_LEVEL:

Guidelines for Running at IRQL DISPATCH_LEVEL or Higher

Driver code that runs at IRQL DISPATCH_LEVEL or above must conform to the following guidelines:

Changing the IRQL at which Driver Code Runs

In general, the IRQL at which the operating system calls a driver routine is appropriate for the tasks that such a routine must perform. For example, a DpcForIsr routine usually must call IoStartNextPacket, which in turn calls the driver's StartIo routine. The DpcForIsr, IoStartNextPacket, and StartIo routines must all be called at DISPATCH_LEVEL to ensure that the I/O operations they perform complete without pre-emption by other user threads.

In some situations, however, driver code must raise the IRQL so that the driver can call a routine at DISPATCH_LEVEL. For example, if a driver calls KeGetCurrentProcessorNumber at IRQL< DISPATCH_LEVEL, a processor switch could occur between instructions, thus returning the incorrect value to the caller. Therefore, drivers sometimes must call KeRaiseIrql before calling KeGetCurrentProcessorNumber.

Although a driver can safely raise the IRQL when necessary, a driver must NEVER lower the IRQL without first raising it. Furthermore, a driver routine must never lower the IRQL below the setting at which it was called. Doing so can disrupt operations that the caller of the driver routine was relying on to complete atomically; such a disruption usually causes the system to crash.

Occasionally, however, driver code that runs at IRQL>=DISPATCH_LEVEL must communicate with code at a lower IRQL. For example, a driver might need to issue a synchronous device control request to its device after completion of an I/O operation. IoCompletion routines can be called at IRQL = DISPATCH_LEVEL, but IoBuildDeviceIoControlRequest must be called at PASSIVE_LEVEL. In this situation, the driver should use the IoAllocateWorkItem and IoQueueWorkItem routines (which can be called at IRQL = DISPATCH_LEVEL) to allocate and queue a work item routine that builds and sends the device control request. The work item routine will be called in the context of a system thread at IRQL = PASSIVE_LEVEL.

The following are guidelines for changing IRQL:

Never call KeLowerIrql with an IRQL lower than the IRQL at which you called KeRaiseIrql.

Standard Driver Routines, IRQL, and Thread Context

Table 2 is a list of the standard driver routines, the IRQL at which each routine is called, and the thread context in which the routine runs. In addition to the routines that are listed here, there are many device-type-specific driver routines that are called at DISPATCH_LEVEL. For details, see the device-specific documentation in the Windows DDK.

Table 2. IRQL and Thread Context for Standard Driver Routines

Routine

Caller's IRQL

Thread context

AdapterControl

DISPATCH_LEVEL

Arbitrary

AdapterListControl

DISPATCH_LEVEL

Arbitrary

AddDevice

PASSIVE_LEVEL

System

BugCheckCallback

HIGH_LEVEL

Arbitrary: depends on state of operating system when the bug check occurred

BugCheckDumpIoCallback

HIGH_LEVEL

Arbitrary: depends on state of operating system when the bug check occurred

BugCheckSecondaryDumpDataCallback

HIGH_LEVEL

Arbitrary: depends on state of operating system when the bug check occurred

Cancel

DISPATCH_LEVEL

Arbitrary

ControllerControl

DISPATCH_LEVEL

Arbitrary

CsqAcquireLock

IRQL of the routine that called IoCsqXxx. Usually <= DISPATCH_LEVEL

Arbitrary

CsqCompleteCanceledIrp

<= DISPATCH_LEVEL

Arbitrary

CsqInsertIrp

IRQL of the lock acquired by CsqAcquireLock. Usually <= DISPATCH_LEVEL

Arbitrary

CsqInsertIrpEx

IRQL of the lock acquired by CsqAcquireLock. Usually <= DISPATCH_LEVEL

Arbitrary

CsqPeekNextIrp

IRQL of the lock acquired by CsqAcquireLock. Usually <= DISPATCH_LEVEL

Arbitrary

CsqReleaseLock

IRQL of the lock acquired by CsqAcquireLock. Usually <= DISPATCH_LEVEL

Arbitrary

CsqRemoveIrp

IRQL of the lock acquired by CsqAcquireLock. Usually <= DISPATCH_LEVEL

Arbitrary

CustomDpc

DISPATCH_LEVEL

Arbitrary

CustomTimerDpc

DISPATCH_LEVEL

Arbitrary

DispatchCleanup

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchClose (for FSD, FS filters, and other highest-level drivers)

APC_LEVEL

Arbitrary

DispatchClose (for all other drivers)

PASSIVE_LEVEL

Arbitrary

DispatchCreate

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchCreateClose

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchDeviceControl (for devices not in paging path)

PASSIVE_LEVEL

Non-arbitrary for FSD and FS filters; arbitrary for other drivers

DispatchDeviceControl (for devices in paging path)

<= DISPATCH_LEVEL

Arbitrary

DispatchFlushBuffers

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchInternalDeviceControl

Depends on the device type, but always <= DISPATCH_LEVEL

Arbitrary

DispatchPnp

PASSIVE_LEVEL

Arbitrary

DispatchPower (if the DO_POWER_PAGABLE flag is not set in the device object)

<= DISPATCH_LEVEL

Arbitrary

DispatchPower (if the DO_POWER_PAGABLE flag is set in the device object)

PASSIVE_LEVEL

Arbitrary

DispatchQueryInformation

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchRead

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchRead (for devices in paging path)

APC_LEVEL

Arbitrary

DispatchRead and DispatchWrite routines of drivers in the storage stack

<= DISPATCH_LEVEL

Arbitrary

DispatchReadWrite (for devices not in paging path)

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchReadWrite (for devices in paging path)

APC_LEVEL

Arbitrary

DispatchSetInformation

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchShutdown

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DispatchSystemControl

PASSIVE_LEVEL

Arbitrary

DispatchWrite (for devices in paging path)

APC_LEVEL

Arbitrary

DispatchWrite (for devices not in paging path)

PASSIVE_LEVEL

Non-arbitrary for FSD, FS filter, and other highest-level drivers; arbitrary for other drivers

DllInitialize

PASSIVE_LEVEL

System or arbitrary

DllUnload

PASSIVE_LEVEL

Arbitrary

DpcForIsr

DISPATCH_LEVEL

Arbitrary

DriverEntry

PASSIVE_LEVEL

System

InterruptService

DIRQL for the associated interrupt object

Arbitrary

IoCompletion

<= DISPATCH_LEVEL

Arbitrary

IoTimer

DISPATCH_LEVEL

Arbitrary

Reinitialize

PASSIVE_LEVEL

System

StartIo

DISPATCH_LEVEL

Arbitrary

SynchCritSection

DIRQL for the associated interrupt object

Arbitrary

Unload

PASSIVE_LEVEL

System

Interrupting a Thread: Examples

Some simple examples can show what happens when a thread is pre-empted or interrupted. This section presents a single-processor example and a dual-processor example.

Single-Processor Example

Figure 1 shows a hypothetical example of thread interruption on a single-processor system. For the sake of simplicity, the example omits the thread scheduler, clock interrupts, and so forth.

Figure 1. Thread Interruption on a Single-Processor System

The figure shows how thread execution proceeds over time, as follows:

There is no guarantee that Thread A will exhaust its quantum. A thread can be interrupted or pre-empted any number of times during its quantum.

Multiprocessor Example

Figure 2 shows a hypothetical example of thread interruption on a multiprocessor system. This example omits clock interrupts and so forth, but it shows the system's thread scheduler.

Figure 2. Thread Interruption on a Multiprocessor System

The figure shows thread execution on two processors, starting at the same time, as follows:

  1. Processor 0 is running Thread A, while Processor 1 is running Thread B. Both threads run at IRQL=PASSIVE_LEVEL.
  2. Device 1 interrupts on Processor 0, so the system raises IRQL on Processor 0 to DIRQL for Device 1 and runs the Device 1 InterruptService (ISR) routine. The Device 1 InterruptService routine queues a DpcForIsr routine. By default, the DpcForIsr routine is added to the queue for the same processor (Processor 0) on which the InterruptService routine is running.
  3. Because the DpcForIsr routine runs on the same processor as the InterruptService routine, but at a lower IRQL, it does not start until after the InterruptService routine exits.
  4. Device 1 interrupts again—this time on Processor 1—so the system raises IRQL on Processor 1 to DIRQL for Device 1 and runs the Device 1 InterruptService (ISR) routine on Processor 1. The Device 1 InterruptService routine queues a DpcForIsr routine to Processor 1 (the default behavior). The DpcForIsr routine starts on Processor 1 after the InterruptService routine exits.
  5. The same DpcForIsr routine is now running on two processors at the same time, in response to two different interrupts from the same device. The driver must use spin locks to protect shared memory that the routines might access. In this case, the DpcForIsr routine on Processor 0 acquires the lock first, so Processor 1 spins while waiting for the lock.
  6. The DpcForIsr routine on Processor 0 releases the lock, completes its work, and exits. The system now runs the next ready thread on Processor 0—in this case, Thread A.
  7. After the DpcForIsr routine on Processor 0 releases the lock, the DpcForIsr routine on Processor 1 acquires the lock and performs its tasks. After it exits, the system's thread scheduling code runs to determine which thread to run next.
  8. When the thread scheduler exits, no additional DPCs have been queued, so the highest-priority ready thread (Thread C) runs on Processor 1.