UTLK - Understanding the Linux K
Book
内核版本是2.6.11
Chapter 1 Introduction
Chapter 1.1 Linux Versus Other Unix-Like Kernels
抢占式内核(preemptive)
Chapter 1.6 An Overview of Unix Kernels
Synchronization and Critical Regions
Kernel preemption disabling
由于内核支持抢占,进入同步区前要关闭抢占;退出同步区要恢复抢占。内核里的自旋锁就是关闭抢占,没有关闭中断。
Interrupt disabling
对于单CPU关中断能起到作用,但会影响性能。对于多CPU不起作用。
Spin locks
Chapter 3. Processes
3.1 Processes
3.2 Process Descriptor
用task_struct
描述进程
Process State
- TASK_RUNNING
- TASK_INTERRUPTIBLE
- TASK_UNINTERRUPTIBLE:很少用,例如在一个驱动在probe设备的状态,如果中断设备会进入不可预见的状态。
- TASK_STOPPED
- TASK_ZOMBIE //进程结束,但父进程没有调用wait,此时kernel不能回收
task_struct
等资源 -
EXIT_DEAD //TASK_ZOMBIE状态后,父进程调用wait,进入此状态
image.png
Identifying a Process
由于linux使用轻量级线程,多个线程做成 thread group,他们的leader的PID作为所有线程的PID。 getpid()其实是current->tgid 而不是current->pid
Process descriptors handling
thread_info和kernel stack共用8K内存(两个page)
image.png
Identifying the current process
current_thread_info展开
movl $0xffffe000, %ecx
andl %esp, %ecx
movl %ecx, p
current宏展开
current_thread_info( )->task
The lists of TASK_RUNNING processes
Linux 2.6修改了runqueue,不是一个很长的list。而是按照优先级,分成了很多list。而每个CPU上又都有自己的runqueue。
void enqueue_task(struct task_struct *p, struct prio_array *array)
{
sched_info_queued(p);
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
}
dequeue_task()代码类似
Relationships Among Processes
image.pngThe pidhash table and chained lists
这是简化版本
image.png
这个是现实版本,可以看到pid_list组成的链表,每个元素的nr都一样,它们同属一个线程组,nr就是它们的PID
image.png
- do_each_task_pid(nr, type, task)
- while_each_task_pid(nr, type, task)
- find_task_by_pid_type(type, nr)
- find_task_by_pid(nr)
How Processes Are Organized
- TASK_STOPPED, EXIT_ZOMBIE, or EXIT_DEAD不在任何队列里
- TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE分别有自己的队列
Wait queues
队列头
struct __wait_queue_head {
wq_lock_t lock;
struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;
队列元素
struct __wait_queue {
unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
void *private;
wait_queue_func_t func;
struct list_head task_list;
};
typedef struct __wait_queue wait_queue_t;
唤醒wait queue有两种方式:
- 只唤醒其中一个,exclusive processes,flags为1,add_wait_queue_exclusive()
- 所有都唤醒,nonexclusive processe,flags为0, add_wait_queue()
Handling wait queues
- init_waitqueue_head()
- init_waitqueue_entry()
- sleep_on()
- interruptible_sleep_on()
- interruptible_sleep_on_timeout()
void sleep_on(wait_queue_head_t *q)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_UNINTERRUPTIBLE;
add_wait_queue(q, &wait); //插入wait queue
schedule( ); //进入睡眠
remove_wait_queue(q, &wait); //醒来后从wait queue里删除
}
wake_up()=>__wake_up()=>__wake_up_common()
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int sync, void *key)
{
struct list_head *tmp, *next;
list_for_each_safe(tmp, next, &q->task_list) {
wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
unsigned flags = curr->flags;
if (curr->func(curr, mode, sync, key) &&
(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
}
其中的curr->func默认为try_to_wake_up
,主要是将这个进程的state设置为TASK_RUNNING
wait相关宏,sleep_on基本不用了
- wait_event(wq, condition)
- wait_event_interruptible(wq, condition)
wake相关宏
- wake_up
- wake_up_nr 苏醒几个thread
- wake_up_all 全部苏醒
- wake_up_interruptible
Process Resource Limits
系统调用getrlimit(),setrlimit()
Process Switch
hardware context swtich
在老版本linux中,使用硬件办法切换,far jump
linux2.4使用软件切换,即用软件的办法保存寄存器,恢复寄存器
TSS段,User Mode到Kernel Mode需要读取TSS段,获取新的EIP,CS等值
thread_struct //一些硬件寄存器被保存在这里
tss_struct //等同TSS段
schedule
局部变量prev,next分别指向切换前的进程,切换后的进程
switch_to宏调用__switch_to,其“调用”通过jmp而非call,所以ret后,会来next->thread.eip,而prev->thread.eip存储了lable 1的位置
3.4 Creating Processes
创建进程三种系统调用
- clone() 创建light weight process
- fork()
- vfork()
他们都会调用do_fork()
user_struct记录user相关数据
Chapter 4. Interrupts and Exceptions
4.8 Work Queues(不同于第二版)
Work Queue出现在2.6版本中,目的是替代2.4版本中的task queue。Work Queue运行在某个kernel thread中,不能访问用户数据。
/*
* The per-CPU workqueue (if single thread, we always use the first
* possible cpu).
*
* The sequence counters are for flush_scheduled_work(). It wants to wait
* until until all currently-scheduled works are completed, but it doesn't
* want to be livelocked by new, incoming ones. So it waits until
* remove_sequence is >= the insert_sequence which pertained when
* flush_scheduled_work() was called.
*/
struct cpu_workqueue_struct {
spinlock_t lock;
long remove_sequence; /* Least-recently added (next to run) */
long insert_sequence; /* Next to add */
struct list_head worklist;
wait_queue_head_t more_work;
wait_queue_head_t work_done;
struct workqueue_struct *wq;
struct task_struct *thread;
int run_depth; /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;
/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
*/
struct workqueue_struct {
struct cpu_workqueue_struct *cpu_wq;
const char *name;
struct list_head list; /* Empty if single thread */
};
### Work queue functions
-
create_workqueue("foo")
创建名为foo的work queue。其中创建n个线程,n等于CPU个数。 -
create_singlethread_workqueue()
类似,但之创建一个线程。 -
destroy_workqueue()
销毁work queue -
queue_work()
将work_struct
插入队列中 -
flush_workqueue()
接收work_struct
参数, block当前进程,知道所有这个work queue上的函数执行完毕
真正运行是在worker_thread
=>run_workqueue
中运行。
The predefined work queue
The predefined work queue被叫做events
4.9 Returning from Interrupts and Exceptions (不同于第二版)
从中断、异常返回ret_from_intr()或者ret_from_exception().
The entry points
等价的汇编中断返回时候已经关闭中断,所以只有异常返回时候需要关中断。内核把thread_info指针放入ebp。根据进入中断时候的压栈情况,判断返回内核,还是返回用户态。
Resuming a kernel control path
如果需要内核抢占,则调用preempt_schedule_irq()
判断thread_info的preempt_count是否为0,如果是则内核被抢占,否则继续,最后从中断请求返回(
iret
)。
Checking for kernel preemption
如果需要抢占,调用preempt_schedule_irq()
。它为preempt_count加上PREEMPT_ACTIVE,内核大锁计数器lock_depth为-1,打开本地中断,进入schedule()
。退出schedule()
后,关闭本地中断。删掉PREEMPT_ACTIVE标记,将内核大锁计数器lock_depth恢复以前的值。
Chapter 5. Kernel Synchronization
5.1 How the Kernel Services Requests (不同于第二版)
Kernel Preemption
- preemptive和nonpreemptive kernels都可以自愿放弃CPU。preemptive的kernel允许被更高优先级的进程打断。
- 抢占内核和非抢占内核Linux都是通过switch_to宏,非抢占内核linux在内核态,不允许切换内核。
例子1:进程A进入kernel模式下的异常处理函数,此时来了有一个中断,进入ISP。 - 对于preemptive kernel,ISP返回后会切换到别的进程。
- 对于nonpreemptive kernel,ISP返回后不会切换到别的进程。
例子2:在kernel模式下,如果当前进程的时间片用完
- 对于preemptive kernel,切换到别的进程。
- 对于nonpreemptive kernel,不会切换到别的进程。
preemptive kernel的目的:让用户模式下的反应加快。
preempt_count
如果大于0,则关闭preemptive kernel。它发生在如下情况:
- ISP运行中
- kernel运行softirq或者tasklet
- 手动修改
preempt_count
,如preempt_disable就是将其加一。
在UP中,如果变量不被中断或异常处理函数访问,那么它是安全的,因为不会被打断。
When Synchronization Is Necessary
例子1: 多个ISP访问某资源,对于单核CPU可以简单的关闭中断实现保护。
例子2: 多个系统调用共享某资源,对于单核CPU可以简单的关闭preemptive kernel。
When Synchronization Is Not Necessary
- ISP中关闭某IRQ。
- ISP、softirqs、tasklets是nonpreemptable and non-blocking。
- ISP不会被其他kerenl模块打断。
- softirq、tasklet不会被给定CPU交织。
- tasklet不会同时被多个CPU执行。
5.2 Synchronization Primitives
Spin lock
对于linux2.4 , UP的获取Spin Lock什么也没做,因为linux 2.4是 nonpreemptive的
for SMP // copy from asm/spinlock.h
#define spin_lock_init(x) do { *(x) = SPIN_LOCK_UNLOCKED; } while(0)
for UP //copy from include/spinlock.h
#define spin_lock_init(lock) do { } while(0)
#define spin_lock(lock) (void)(lock) /* Not "unused variable". */
#define spin_is_locked(lock) (0)
#define spin_trylock(lock) ({1; })
#define spin_unlock_wait(lock) do { } while(0)
#define spin_unlock(lock) do { } while(0)
Chapter 12. The Virtual Filesystem
五种标准UNIX文件类型
- Regular file
- Directory
- Symbolic link
- Device
- Pipe
12.1 The Role of the Virtual Filesystem (VFS)
根目录所在的文件系统是root filesystem,其他文件系统需要mount在根下面的子目录下。
The Common File Model
read() => sys_read() => file->f_op->read(...)
The Common File Model分为:
- The superblock object
- The inode object
- The file object
- The dentry object
几种cache的比较
- disk cache:减少IO
- dentry cache : pathname => dentry
- inode cache
- page cache
- hardware cache
- memory cache : to bypass the Kernel Memory Allocator
12.2 VFS Data Structures
12.7 File Locking
两个文件同时写同一个文件,结果无法预测。
- advisory lock
- mandatory lock: open( ) , read( ) , and write( )被执行前,kernel都会检查是否满足锁要求。
注:NFS4不支持mandatory lock,参看内核代码nfs_flock
Linux File Locking
- flock()是系统调用,只针对advisory lock
- fcntl()是系统调用,
- lockf()不是系统调用,内部调用fcntl()
- lease强制锁(基于fcntl) ,租期为
/proc/sys/fs/lease-break-time
, usually 45 seconds
操作mandatory lock:
- mount时候带上
-o mand option
参数,它代表MS_MANDLOCK
(默认此功能关闭) - 设置set-group bit (SGID) ,清除group-execute permission bit
- 使用fcntl()获得、释放锁
chmod g+s /mnt/testfile
chmod g-x /mnt/testfile
租约的使用:
- invoke a fcntl() system call with a F_SETLEASE or F_GETLEASE command.
- fcntl() invocation with the F_SETSIG command may be used to change the type of signal to be sent to the lease process holder
系统调用:
- flock( ) : 设置fl->fl_flags为FL_FLOCK
- fcntl( ) :设置fl->fl_flags为FL_POSIX
- lease : 设置fl->fl_flags为FL_LEASE
#define IS_POSIX(fl) (fl->fl_flags & FL_POSIX)
#define IS_FLOCK(fl) (fl->fl_flags & FL_FLOCK)
#define IS_LEASE(fl) (fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))
File-Locking Data Structures
struct lock_file
是一个单链表,struct inode
中的i_flock描述了这个单链表第一个元素的位置。
// The i_flock list is ordered by:
struct file_lock {
struct file_lock *fl_next; /* singly linked list for this inode */
struct hlist_node fl_link; /* node in global lists */
struct list_head fl_block; /* wait在这个lock上所有的进程*/
fl_owner_t fl_owner;
unsigned int fl_flags;
unsigned char fl_type;
unsigned int fl_pid;
int fl_link_cpu; /* what cpu's list is this on? */
struct pid *fl_nspid;
wait_queue_head_t fl_wait;
struct file *fl_file;
loff_t fl_start;
loff_t fl_end;
wait_queue_head_t fl_wait;
}
//active locks, /proc/locks里显示的就是这些
static DEFINE_PER_CPU(struct hlist_head, file_lock_list);
//The blocked_hash is used to find POSIX lock loops for deadlock detection.
static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
FL_FLOCK Locks
An FL_FLOCK lock is always associated with a file object.
sys_flock()
{
//对于nfs,f.file->f_op->flock是nfs_flock()
if (f.file->f_op && f.file->f_op->flock)
error = f.file->f_op->flock(f.file,
(can_sleep) ? F_SETLKW : F_SETLK,
lock);
else
error = flock_lock_file_wait(f.file, lock);
}
flock_lock_file_wait() => flock_lock_file()
找到filp->f_dentry->d_inode->i_flock,对其加锁、释放锁。
FL_POSIX Locks
An FL_POSIX lock is always associated with a process and with an inode.
sys_fcntl( )
会根据参数调用fcntl_getlk()
或fcntl_setlk()
对于NFS4来说,filp->f_op->lock是nfs_lock
,filp->f_op->flock是nfs_flock
int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, struct file_lock *conf)
{
if (filp->f_op && filp->f_op->lock)
return filp->f_op->lock(filp, cmd, fl);
else
return posix_lock_file(filp, fl, conf);
}
Chapter 15. The Page Cache
inode cache
和dentry cache
存的是元数据。page cache
是存储page的disk cache。
15.1 The Page Cache
如果一个page不在page cache中,会分配一个,并把磁盘的内容复制到这个Cache里。
Page cache可以是以下几类:
- Pages containing data of regular files
- Pages containing directorie
- Pages containing data directly read from block device files (skipping the filesystem layer)
- Pages containing data of User Mode processes that have been swapped out on disk
- Pages belonging to files of special filesystems
page属于一个owner,对应一个inode。
read()和write()会依赖page cache,但当使用O_DIRECT
flag时例外,IO会直接操作读取。
page cache满足两点要求:
- 从cache里快速搜索到page。
- 对于不同类型(普通文件、设备文件等)的owner,处理方法不一样。
所有的page都有一个owner,对应一个inode。page通过owner的inode和offset定位。
The address_space Object
它是内嵌在inode
中的数据结构。
struct inode
中的i_mapping
指向了struct address_space
struct page
中的mapping
指向了struct address_space
struct page
中的index
记录着位置。
struct address_space {
struct inode *host; /* owner: inode, block_device 指向嵌入address_space的那个inode*/
struct radix_tree_root page_tree; /* radix tree of all pages */
spinlock_t tree_lock; /* and lock protecting it */
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutex i_mmap_mutex; /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
struct backing_dev_info *backing_dev_info; /* device readahead, etc */
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
void *private_data; /* ditto */
} __attribute__((aligned(sizeof(long))));
struct inode {
...
struct address_space *i_mapping;//总是指向owner的page cache
struct address_space i_data;//如果owner是个文件,address_space嵌入在inode中
...
}
struct page {
...
struct address_space *mapping;//指向拥有这个page的inode的i_mapping
pgoff_t index;
unsigned long private; //如果存page buffer,指向第一个buffer head
}
- 如果page的owner是一个regular file,inode中的i_data就是address_space,inode中的i_mapping指针也指向它。同时address_space中的host指向这个inode。
- 如果page是来自于block device,address_space存在于bdev设备对象所关联的inode上。
The Radix Tree
image.pngPage Cache Handling Functions
- find_get_page()
- find_get_pages()
- add_to_page_cache()
- remove_from_page_cache()
- read_cache_page()
- find_or_create_page()
15.2 Storing Blocks in the Page Cache
自2.4.10
后the buffer cache
就不用了,为block分配的buffer被称作buffer page
。一个buffer page
和buffer head
描述符联系起来。buffer head
的意义在于block的数据如何从page里定位。
Block Buffers and Buffer Heads
struct buffer_head {
unsigned long b_state; /* buffer state bitmap (see above) */
struct buffer_head *b_this_page;/* circular list of page's buffers 指向同一个page里的下一个buffer head*/
struct page *b_page; /* the page this bh is mapped to */
sector_t b_blocknr; /* start block number */
size_t b_size; /* size of mapping */
char *b_data; /* pointer to data within the page */
struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another mapping */
struct address_space *b_assoc_map; /* mapping this buffer is
associated with */
atomic_t b_count; /* users using this buffer_head */
};
b_bdev
+b_blocknr
可以定位到具体block设备上的具体位置。
Managing the Buffer Heads
bh有其对应的slab分配器。
- alloc_buffer_head()
- free_buffer_head()
- bh中的b_count,记录reference count
Buffer Pages
两种情况内核创建buffer pages:
- When reading or writing pages of a file that are not stored in contiguous disk blocks. 对于这种情况,page会被插入到这个文件的address_space中。
- When accessing a single disk block (for instance, when reading a superblock or an inode block).对于这种情况,page会被插入到这个设备文件的address_space中。这被称为
block device buffer pages
(有时称作blockdev’s pages
).本章重点讲述这种情况。
一个例子是,VFS读取一个block的内容,申请一个page,并创建4个bh
image.png
Allocating Block Device Buffer Pages
static int
grow_buffers(struct block_device *bdev, sector_t block, int size)
- bdev:block device
- block: block pos
- size: block size
Releasing Block Device Buffer Pages
int try_to_release_page(struct page *page, gfp_t gfp_mask)
Searching Blocks in the Page Cache
struct buffer_head *
__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
先在LRU中寻找,如果找不到在从address_space中找
struct buffer_head *
__getblk(struct block_device *bdev, sector_t block, unsigned size)
__getblk()调用__find_get_block(),如果不成功则调用__getblk_slow()从内存中分配
struct buffer_head *
__bread(struct block_device *bdev, sector_t block, unsigned size)
__bread()调用__getblk(),如果内容不是最新,则调用__bread_slow()
Submitting Buffer Heads to the Generic Block Layer
int submit_bh(int rw, struct buffer_head *bh)
//for example
submit_bh(READ | REQ_META | REQ_PRIO, bh);
内部通过bio发生实际IO
void ll_rw_block(int rw, int nr, struct buffer_head *bhs[])
//for example
ll_rw_block(WRITE, 1, &bh);
15.3 Writing Dirty Pages to Disk
Chapter 16. Accessing Files
操作文件
- Canonical mode:
O_SYNC
andO_DIRECT
flags都没有设置 - Synchronous mode:
O_SYNC
- Memory mapping mode: mmap( )
- Direct I/O mode:
O_DIRECT
- Asynchronous mode
16.1 Reading and Writing a File
Reading and Writing a File
- 读文件 : 它是page-based的,内核如果读几个字节,如果它不在内存中,则先分配一个page,将它加入到page cache中。然后将结果从page拷贝到用户态内存中。读的入口函数一般是
generic_file_read()
- 写文件 : 写的入口函数一般是
generic_file_write()
Reading from a File
generic_file_read()是针对block类型文件的。
generic_file_read() => __generic_file_aio_read() => do_generic_mapping_read()
do_generic_file_read()做了以下几件事:
- filp->f_mapping作为address_space对象
- address_space对象的owner即
address_space
结构体中的host
字段,如果是块设备则是inode in the bdev special filesystem
,而不是filp->f_dentry->d_inode - 读会分解成若干个page大小的read。对于分解的每一个page,做如下操作:
- 在需要预读的情况下调用
page_cache_readahead()
- 调用
find_get_page()
向address_space对象获得一个page。如果为null,则分配一个并加入到address_space对象 - 如果有
PG_uptodate
标记,说明是updated的数据,直接返回 - 执行lock_page(),对page上锁
- 调用readpage方法
- 将结果拷贝回用户态内存,由
file_read_actor()
负责:- 试图从page对应的内存拷贝到用户态内存。
- 如果失败说明,page对应的内存是高端内存,需要重新映射,调用kmap()
- 将重新映射的地址内容拷贝拷贝到用户空间
- 调用kunmap()释放映射
- 在需要预读的情况下调用
The readpage method for regular files
ext3_readpage() => mpage_readpage()
mpage_readpage的分析:
- 此函数会传进一个
get_block
函数,它可以帮助计算从文件的block位置,转换成块设备上的具体位置(逻辑位置) - 如果page的private域有值,说明此page对应bh,并且这4个block在磁盘上不连续
- 依次调用get_block,判断这4个block是否在磁盘连续。
- 如果连续,则调用submit_bio一次把这4个连续的block从磁盘读上来。
- 如果不连续,则调用block_read_full_page,即依次对于每个block进行读取。
The readpage method for block device files
blkdev_readpage() => block_read_full_page
block_read_full_page : 依次对每个block进行读取
Read-Ahead of Files
read-ahead对顺序读有帮助,对随机读没帮助。
16.3 Direct I/O Transfers
16.4 Asynchronous I/O
Asynchronous I/O in Linux 2.6
异步IO可以在操作系统不支持异步IO的情况下实现。aio_read/aio_write可以创建有一个子进程,并正子进程中调用同步read/write。
Linux 2.6内核支持以下AIO系统调用
- io_setup() : Initializes an asynchronous context for the current process
- io_submit() : Submits one or more asynchronous I/O operations
- io_getevents() : Gets the completion status of some outstanding asynchronous I/O operations
- io_cancel() : Cancels an outstanding I/O operation
- io_destroy() : Removes an asynchronous context for the current process
The asynchronous I/O context
在使用io_submit
之前需要初始换AIO context。AIO context和kioctx
对象关联。所有的kioctx
对象组成链表,粗放在ioctx_list中。每个kioctx
关联一个AIO ring,它是用户模式下的buffer,同时可以在内核模式下看到。
io_setup
就是为了创建AIO context。
Submitting the asynchronous I/O operations
系统调用io_submit
做了以下事情:
- 查询iocb是否有效
Chapter 18. The Ext2 and Ext3 Filesystems
18.2 Ext2 Disk Data Structures
第一个block被用作boot,不被ext2所管理。其他block被分成大小一样的block group