当前位置: 首页 > 文档资料 > Linux 内核揭密 >

Interrupts and Interrupt Handling. Part 2.

优质
小牛编辑
129浏览
2023-12-01

Start to dive into interrupt and exceptions handling in the Linux kernel

We saw some theory about interrupts and exception handling in the previous part and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest code lines as we saw it for example in the Linux kernel booting process chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.

If you've read the previous parts, you can remember that the earliest place in the Linux kernel x86_64 architecture-specific source code which is related to the interrupt is located in the arch/x86/boot/pm.c source code file and represents the first setup of the Interrupt Descriptor Table. It occurs right before the transition into the protected mode in the go_to_protected_mode function by the call of the setup_idt:

void go_to_protected_mode(void)
{
	...
	setup_idt();
	...
}

The setup_idt function is defined in the same source code file as the go_to_protected_mode function and just loads the address of the NULL interrupts descriptor table:

static void setup_idt(void)
{
	static const struct gdt_ptr null_idt = {0, 0};
	asm volatile("lidtl %0" : : "m" (null_idt));
}

where gdt_ptr represents a special 48-bit GDTR register which must contain the base address of the Global Descriptor Table:

struct gdt_ptr {
	u16 len;
	u32 ptr;
} __attribute__((packed));

Of course in our case the gdt_ptr does not represent the GDTR register, but IDTR since we set Interrupt Descriptor Table. You will not find an idt_ptr structure, because if it had been in the Linux kernel source code, it would have been the same as gdt_ptr but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the Interrupt Descriptor Table with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the IDT with NULL.

After the setup of the Interrupt descriptor table, Global Descriptor Table and other stuff we jump into protected mode in the - arch/x86/boot/pmjump.S. You can read more about it in the part which describes the transition to protected mode.

We already know from the earliest parts that entry to protected mode is located in the boot_params.hdr.code32_start and you can see that we pass the entry of the protected mode and boot_params to the protected_mode_jump in the end of the arch/x86/boot/pm.c:

protected_mode_jump(boot_params.hdr.code32_start,
			    (u32)&boot_params + (ds() << 4));

The protected_mode_jump is defined in the arch/x86/boot/pmjump.S and gets these two parameters in the ax and dx registers using one of the 8086 calling conventions:

GLOBAL(protected_mode_jump)
	...
	...
	...
	.byte	0x66, 0xea		# ljmpl opcode
2:	.long	in_pm32			# offset
	.word	__BOOT_CS		# segment
...
...
...
ENDPROC(protected_mode_jump)

where in_pm32 contains a jump to the 32-bit entry point:

GLOBAL(in_pm32)
	...
	...
	jmpl	*%eax // %eax contains address of the `startup_32`
	...
	...
ENDPROC(in_pm32)

As you can remember the 32-bit entry point is in the arch/x86/boot/compressed/head_64.S assembly file, although it contains _64 in its name. We can see the two similar files in the arch/x86/boot/compressed directory:

  • arch/x86/boot/compressed/head_32.S.
  • arch/x86/boot/compressed/head_64.S;

But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for x86_64. Let's look at the arch/x86/boot/compressed/Makefile:

vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
...
...

We can see here that head_* depends on the $(BITS) variable which depends on the architecture. You can find it in the arch/x86/Makefile:

ifeq ($(CONFIG_X86_32),y)
...
	BITS := 32
else
	BITS := 64
	...
endif

Now as we jumped on the startup_32 from the arch/x86/boot/compressed/head_64.S we will not find anything related to the interrupt handling here. The startup_32 contains code that makes preparations before the transition into long mode and directly jumps in to it. The long mode entry is located in startup_64 and it makes preparations before the kernel decompression that occurs in the decompress_kernel from the arch/x86/boot/compressed/misc.c. After the kernel is decompressed, we jump on the startup_64 from the arch/x86/kernel/head_64.S. In the startup_64 we start to build identity-mapped pages. After we have built identity-mapped pages, checked the NX bit, setup the Extended Feature Enable Register (see in links), and updated the early Global Descriptor Table with the lgdt instruction, we need to setup gs register with the following code:

movl	$MSR_GS_BASE,%ecx
movl	initial_gs(%rip),%eax
movl	initial_gs+4(%rip),%edx
wrmsr

We already saw this code in the previous part. First of all pay attention on the last wrmsr instruction. This instruction writes data from the edx:eax registers to the model specific register specified by the ecx register. We can see that ecx contains $MSR_GS_BASE which is declared in the arch/x86/include/uapi/asm/msr-index.h and looks like:

#define MSR_GS_BASE             0xc0000101

From this we can understand that MSR_GS_BASE defines the number of the model specific register. Since registers cs, ds, es, and ss are not used in the 64-bit mode, their fields are ignored. But we can access memory over fs and gs registers. The model specific register provides a back door to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the fs and gs. So the MSR_GS_BASE is the hidden part and this part is mapped on the GS.base field. Let's look on the initial_gs:

GLOBAL(initial_gs)
	.quad	INIT_PER_CPU_VAR(irq_stack_union)

We pass irq_stack_union symbol to the INIT_PER_CPU_VAR macro which just concatenates the init_per_cpu__ prefix with the given symbol. In our case we will get the init_per_cpu__irq_stack_union symbol. Let's look at the linker script. There we can see following definition:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(irq_stack_union);

It tells us that the address of the init_per_cpu__irq_stack_union will be irq_stack_union + __per_cpu_load. Now we need to understand where init_per_cpu__irq_stack_union and __per_cpu_load are what they mean. The first irq_stack_union is defined in the arch/x86/include/asm/processor.h with the DECLARE_INIT_PER_CPU macro which expands to call the init_per_cpu_var macro:

DECLARE_INIT_PER_CPU(irq_stack_union);

#define DECLARE_INIT_PER_CPU(var) \
       extern typeof(per_cpu_var(var)) init_per_cpu_var(var)

#define init_per_cpu_var(var)  init_per_cpu__##var

If we expand all macros we will get the same init_per_cpu__irq_stack_union as we got after expanding the INIT_PER_CPU macro, but you can note that it is not just a symbol, but a variable. Let's look at the typeof(per_cpu_var(var)) expression. Our var is irq_stack_union and the per_cpu_var macro is defined in the arch/x86/include/asm/percpu.h:

#define PER_CPU_VAR(var)        %__percpu_seg:var

where:

#ifdef CONFIG_X86_64
    #define __percpu_seg gs
endif

So, we are accessing gs:irq_stack_union and getting its type which is irq_union. Ok, we defined the first variable and know its address, now let's look at the second __per_cpu_load symbol. There are a couple of per-cpu variables which are located after this symbol. The __per_cpu_load is defined in the include/asm-generic/sections.h:

extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];

and presented base address of the per-cpu variables from the data area. So, we know the address of the irq_stack_union, __per_cpu_load and we know that init_per_cpu__irq_stack_union must be placed right after __per_cpu_load. And we can see it in the System.map:

...
...
...
ffffffff819ed000 D __init_begin
ffffffff819ed000 D __per_cpu_load
ffffffff819ed000 A init_per_cpu__irq_stack_union
...
...
...

Now we know about initial_gs, so let's look at the code:

movl	$MSR_GS_BASE,%ecx
movl	initial_gs(%rip),%eax
movl	initial_gs+4(%rip),%edx
wrmsr

Here we specified a model specific register with MSR_GS_BASE, put the 64-bit address of the initial_gs to the edx:eax pair and execute the wrmsr instruction for filling the gs register with the base address of the init_per_cpu__irq_stack_union which will be at the bottom of the interrupt stack. After this we will jump to the C code on the x86_64_start_kernel from the arch/x86/kernel/head64.c. In the x86_64_start_kernel function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early Interrupt Descriptor Table with the interrupts handlers entries or early_idt_handlers. You can remember it, if you have read the part about the Early interrupt and exception handling and can remember following code:

for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
	set_intr_gate(i, early_idt_handlers[i]);

load_idt((const struct desc_ptr *)&idt_descr);

but I wrote Early interrupt and exception handling part when Linux kernel version was - 3.18. For this day actual version of the Linux kernel is 4.1.0-rc6+ and Andy Lutomirski sent the patch and soon it will be in the mainline kernel that changes behaviour for the early_idt_handlers. NOTE While I wrote this part the patch already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:

for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
	set_intr_gate(i, early_idt_handler_array[i]);

load_idt((const struct desc_ptr *)&idt_descr);

AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is early_idt_handler_arry:

extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];

where NUM_EXCEPTION_VECTORS and EARLY_IDT_HANDLER_SIZE are defined as:

#define NUM_EXCEPTION_VECTORS 32
#define EARLY_IDT_HANDLER_SIZE 9

So, the early_idt_handler_array is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous early_idt_handlers was defined in the arch/x86/kernel/head_64.S. The early_idt_handler_array is defined in the same source code file too:

ENTRY(early_idt_handler_array)
...
...
...
ENDPROC(early_idt_handler_common)

It fills early_idt_handler_arry with the .rept NUM_EXCEPTION_VECTORS and contains entry of the early_make_pgtable interrupt handler (more about its implementation you can read in the part about Early interrupt and exception handling). For now we come to the end of the x86_64 architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the setup_arch function and other places, but this is the end of the x86_64 early code.

Setting stack canary for the interrupt stack

The next stop after the arch/x86/kernel/head_64.S is the biggest start_kernel function from the init/main.c. If you've read the previous chapter about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first init process with the pid - 1. The first thing that is related to the interrupts and exceptions handling is the call of the boot_init_stack_canary function.

This function sets the canary value to protect interrupt stack overflow. We already saw a little some details about implementation of the boot_init_stack_canary in the previous part and now let's take a closer look on it. You can find implementation of this function in the arch/x86/include/asm/stackprotector.h and its depends on the CONFIG_CC_STACKPROTECTOR kernel configuration option. If this option is not set this function will not do anything:

#ifdef CONFIG_CC_STACKPROTECTOR
...
...
...
#else
static inline void boot_init_stack_canary(void)
{
}
#endif

If the CONFIG_CC_STACKPROTECTOR kernel configuration option is set, the boot_init_stack_canary function starts from the check stat irq_stack_union that represents per-cpu interrupt stack has offset equal to forty bytes from the stack_canary value:

#ifdef CONFIG_X86_64
        BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif

As we can read in the previous part the irq_stack_union represented by the following union:

union irq_stack_union {
	char irq_stack[IRQ_STACK_SIZE];

    struct {
		char gs_base[40];
		unsigned long stack_canary;
	};
};

which defined in the arch/x86/include/asm/processor.h. We know that union in the C programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - gs_base which is 40 bytes size and represents bottom of the irq_stack. So, after this our check with the BUILD_BUG_ON macro should end successfully. (you can read the first part about Linux kernel initialization process if you're interesting about the BUILD_BUG_ON macro).

After this we calculate new canary value based on the random number and Time Stamp Counter:

get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

and write canary value to the irq_stack_union with the this_cpu_write macro:

this_cpu_write(irq_stack_union.stack_canary, canary);

more about this_cpu_* operation you can read in the Linux kernel documentation.

Disabling/Enabling local interrupts

The next step in the init/main.c which is related to the interrupts and interrupts handling after we have set the canary value to the interrupt stack - is the call of the local_irq_disable macro.

This macro defined in the include/linux/irqflags.h header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the CONFIG_TRACE_IRQFLAGS_SUPPORT kernel configuration option:

#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
...
#define local_irq_disable() \
         do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
...
#else
...
#define local_irq_disable()     do { raw_local_irq_disable(); } while (0)
...
#endif

They are both similar and as you can see have only one difference: the local_irq_disable macro contains call of the trace_hardirqs_off when CONFIG_TRACE_IRQFLAGS_SUPPORT is enabled. There is special feature in the lockdep subsystem - irq-flags tracing for tracing hardirq and softirq state. In our case lockdep subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The trace_hardirqs_off function defined in the kernel/locking/lockdep.c:

void trace_hardirqs_off(void)
{
         trace_hardirqs_off_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);

and just calls trace_hardirqs_off_caller function. The trace_hardirqs_off_caller checks the hardirqs_enabled field of the current process and increases the redundant_hardirqs_off if call of the local_irq_disable was redundant or the hardirqs_off_events if it was not. These two fields and other lockdep statistic related fields are defined in the kernel/locking/lockdep_insides.h and located in the lockdep_stats structure:

struct lockdep_stats {
...
...
...
int     softirqs_off_events;
int     redundant_softirqs_off;
...
...
...
}

If you will set CONFIG_DEBUG_LOCKDEP kernel configuration option, the lockdep_stats_debug_show function will write all tracing information to the /proc/lockdep:

static void lockdep_stats_debug_show(struct seq_file *m)
{
#ifdef CONFIG_DEBUG_LOCKDEP
	unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
	                         hi2 = debug_atomic_read(hardirqs_off_events),
							 hr1 = debug_atomic_read(redundant_hardirqs_on),
    ...
	...
	...
    seq_printf(m, " hardirq on events:             %11llu\n", hi1);
    seq_printf(m, " hardirq off events:            %11llu\n", hi2);
    seq_printf(m, " redundant hardirq ons:         %11llu\n", hr1);
#endif
}

and you can see its result with the:

$ sudo cat /proc/lockdep
 hardirq on events:             12838248974
 hardirq off events:            12838248979
 redundant hardirq ons:               67792
 redundant hardirq offs:         3836339146
 softirq on events:                38002159
 softirq off events:               38002187
 redundant softirq ons:                   0
 redundant softirq offs:                  0

Ok, now we know a little about tracing, but more info will be in the separate part about lockdep and tracing. You can see that the both local_disable_irq macros have the same part - raw_local_irq_disable. This macro defined in the arch/x86/include/asm/irqflags.h and expands to the call of the:

static inline void native_irq_disable(void)
{
        asm volatile("cli": : :"memory");
}

And you already must remember that cli instruction clears the IF flag which determines ability of a processor to handle an interrupt or an exception. Besides the local_irq_disable, as you already can know there is an inverse macro - local_irq_enable. This macro has the same tracing mechanism and very similar on the local_irq_enable, but as you can understand from its name, it enables interrupts with the sti instruction:

static inline void native_irq_enable(void)
{
        asm volatile("sti": : :"memory");
}

Now we know how local_irq_disable and local_irq_enable work. It was the first call of the local_irq_disable macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the start_kernel function from the init/main.c and we just disabled local interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called cli. This function was removed and now we have local_irq_{enabled,disable} to disable or enable interrupts on the current processor. After we've disabled the interrupts with the local_irq_disable macro, we set the:

early_boot_irqs_disabled = true;

The early_boot_irqs_disabled variable defined in the include/linux/kernel.h:

extern bool early_boot_irqs_disabled;

and used in the different places. For example it used in the smp_call_function_many function from the kernel/smp.c for the checking possible deadlock when interrupts are disabled:

WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
                     && !oops_in_progress && !early_boot_irqs_disabled);

Early trap initialization during kernel initialization

The next functions after the local_disable_irq are boot_cpu_init and page_address_init, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel initialization process). The next is the setup_arch function. As you can remember this function located in the arch/x86/kernel/setup.c source code file and makes initialization of many different architecture-dependent stuff. The first interrupts related function which we can see in the setup_arch is the - early_trap_init function. This function defined in the arch/x86/kernel/traps.c and fills Interrupt Descriptor Table with the couple of entries:

void __init early_trap_init(void)
{
        set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
        set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
#ifdef CONFIG_X86_32
        set_intr_gate(X86_TRAP_PF, page_fault);
#endif
        load_idt(&idt_descr);
}

Here we can see calls of three different functions:

  • set_intr_gate_ist
  • set_system_intr_gate_ist
  • set_intr_gate

All of these functions defined in the arch/x86/include/asm/desc.h and do the similar thing but not the same. The first set_intr_gate_ist function inserts new an interrupt gate in the IDT. Let's look on its implementation:

static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
{
        BUG_ON((unsigned)n > 0xFF);
        _set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
}

First of all we can see the check that n which is vector number of the interrupt is not greater than 0xff or 255. We need to check it because we remember from the previous part that vector number of an interrupt must be between 0 and 255. In the next step we can see the call of the _set_gate function that sets a given interrupt gate to the IDT table:

static inline void _set_gate(int gate, unsigned type, void *addr,
                             unsigned dpl, unsigned ist, unsigned seg)
{
        gate_desc s;

        pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
        write_idt_entry(idt_table, gate, &s);
        write_trace_idt_entry(gate, &s);
}

Here we start from the pack_gate function which takes clean IDT entry represented by the gate_desc structure and fills it with the base address and limit, Interrupt Stack Table, Privilege level, type of an interrupt which can be one of the following values:

  • GATE_INTERRUPT
  • GATE_TRAP
  • GATE_CALL
  • GATE_TASK

and set the present bit for the given IDT entry:

static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
                             unsigned dpl, unsigned ist, unsigned seg)
{
        gate->offset_low        = PTR_LOW(func);
        gate->segment           = __KERNEL_CS;
        gate->ist               = ist;
        gate->p                 = 1;
        gate->dpl               = dpl;
        gate->zero0             = 0;
        gate->zero1             = 0;
        gate->type              = type;
        gate->offset_middle     = PTR_MIDDLE(func);
        gate->offset_high       = PTR_HIGH(func);
}

After this we write just filled interrupt gate to the IDT with the write_idt_entry macro which expands to the native_write_idt_entry and just copy the interrupt gate to the idt_table table by the given index:

#define write_idt_entry(dt, entry, g)           native_write_idt_entry(dt, entry, g)

static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
        memcpy(&idt[entry], gate, sizeof(*gate));
}

where idt_table is just array of gate_desc:

extern gate_desc idt_table[];

That's all. The second set_system_intr_gate_ist function has only one difference from the set_intr_gate_ist:

static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
{
        BUG_ON((unsigned)n > 0xFF);
        _set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
}

Do you see it? Look on the fourth parameter of the _set_gate. It is 0x3. In the set_intr_gate it was 0x0. We know that this parameter represent DPL or privilege level. We also know that 0 is the highest privilege level and 3 is the lowest.Now we know how set_system_intr_gate_ist, set_intr_gate_ist, set_intr_gate are work and we can return to the early_trap_init function. Let's look on it again:

set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);

We set two IDT entries for the #DB interrupt and int3. These functions takes the same set of parameters:

  • vector number of an interrupt;
  • address of an interrupt handler;
  • interrupt stack table index.

That's all. More about interrupts and handlers you will know in the next parts.

Conclusion

It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Links

  • IDT
  • Protected mode
  • List of x86 calling conventions
  • 8086
  • Long mode
  • NX
  • Extended Feature Enable Register
  • Model-specific register
  • Process identifier
  • lockdep
  • irqflags tracing
  • IF
  • Stack canary
  • Union type
  • this_cpu_* operations
  • vector number
  • Interrupt Stack Table
  • Privilege level
  • Previous part