Linux Progamming

UTLK - Understanding the Linux K

2019-01-31  本文已影响0人  帆子_8c3a

Book
内核版本是2.6.11

Chapter 1 Introduction

Chapter 1.1 Linux Versus Other Unix-Like Kernels

抢占式内核(preemptive)

Chapter 1.6 An Overview of Unix Kernels

Synchronization and Critical Regions

Kernel preemption disabling

由于内核支持抢占,进入同步区前要关闭抢占;退出同步区要恢复抢占。内核里的自旋锁就是关闭抢占,没有关闭中断。

Interrupt disabling

对于单CPU关中断能起到作用,但会影响性能。对于多CPU不起作用。

Spin locks

Chapter 3. Processes

3.1 Processes

3.2 Process Descriptor

task_struct描述进程

Process State

Identifying a Process

由于linux使用轻量级线程,多个线程做成 thread group,他们的leader的PID作为所有线程的PID。 getpid()其实是current->tgid 而不是current->pid

Process descriptors handling

thread_info和kernel stack共用8K内存(两个page)


image.png

Identifying the current process

current_thread_info展开

movl $0xffffe000, %ecx
andl %esp, %ecx
movl %ecx, p

current宏展开

current_thread_info( )->task

The lists of TASK_RUNNING processes

Linux 2.6修改了runqueue,不是一个很长的list。而是按照优先级,分成了很多list。而每个CPU上又都有自己的runqueue。

void enqueue_task(struct task_struct *p, struct prio_array *array)
{
    sched_info_queued(p);
    list_add_tail(&p->run_list, array->queue + p->prio);
    __set_bit(p->prio, array->bitmap);
    array->nr_active++;
    p->array = array;
}

dequeue_task()代码类似

Relationships Among Processes

image.png

The pidhash table and chained lists

这是简化版本


image.png

这个是现实版本,可以看到pid_list组成的链表,每个元素的nr都一样,它们同属一个线程组,nr就是它们的PID


image.png

How Processes Are Organized

Wait queues

队列头

struct __wait_queue_head {
    wq_lock_t lock;
    struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

队列元素

struct __wait_queue {
    unsigned int flags;
#define WQ_FLAG_EXCLUSIVE   0x01
    void *private;
    wait_queue_func_t func;
    struct list_head task_list;
};
typedef struct __wait_queue wait_queue_t;

唤醒wait queue有两种方式:

  1. 只唤醒其中一个,exclusive processes,flags为1,add_wait_queue_exclusive()
  2. 所有都唤醒,nonexclusive processe,flags为0, add_wait_queue()

Handling wait queues

void sleep_on(wait_queue_head_t *q)
{
    wait_queue_t wait;
    init_waitqueue_entry(&wait, current);
    current->state = TASK_UNINTERRUPTIBLE;
    add_wait_queue(q, &wait);  //插入wait queue
    schedule( );  //进入睡眠
    remove_wait_queue(q, &wait); //醒来后从wait queue里删除
}

wake_up()=>__wake_up()=>__wake_up_common()

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
                 int nr_exclusive, int sync, void *key)
{
    struct list_head *tmp, *next;

    list_for_each_safe(tmp, next, &q->task_list) {
        wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
        unsigned flags = curr->flags;

        if (curr->func(curr, mode, sync, key) &&
                (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;
    }
}

其中的curr->func默认为try_to_wake_up,主要是将这个进程的state设置为TASK_RUNNING

wait相关宏,sleep_on基本不用了

wake相关宏

Process Resource Limits

系统调用getrlimit(),setrlimit()

Process Switch

hardware context swtich
在老版本linux中,使用硬件办法切换,far jump
linux2.4使用软件切换,即用软件的办法保存寄存器,恢复寄存器

TSS段,User Mode到Kernel Mode需要读取TSS段,获取新的EIP,CS等值
thread_struct //一些硬件寄存器被保存在这里
tss_struct //等同TSS段

schedule
局部变量prev,next分别指向切换前的进程,切换后的进程
switch_to宏调用__switch_to,其“调用”通过jmp而非call,所以ret后,会来next->thread.eip,而prev->thread.eip存储了lable 1的位置

3.4 Creating Processes

创建进程三种系统调用

  1. clone() 创建light weight process
  2. fork()
  3. vfork()
    他们都会调用do_fork()
    user_struct记录user相关数据

Chapter 4. Interrupts and Exceptions

4.8 Work Queues(不同于第二版)

Work Queue出现在2.6版本中,目的是替代2.4版本中的task queue。Work Queue运行在某个kernel thread中,不能访问用户数据。

/*
 * The per-CPU workqueue (if single thread, we always use the first
 * possible cpu).
 *
 * The sequence counters are for flush_scheduled_work().  It wants to wait
 * until until all currently-scheduled works are completed, but it doesn't
 * want to be livelocked by new, incoming ones.  So it waits until
 * remove_sequence is >= the insert_sequence which pertained when
 * flush_scheduled_work() was called.
 */
struct cpu_workqueue_struct {

    spinlock_t lock;

    long remove_sequence;   /* Least-recently added (next to run) */
    long insert_sequence;   /* Next to add */

    struct list_head worklist;
    wait_queue_head_t more_work;
    wait_queue_head_t work_done;

    struct workqueue_struct *wq;
    struct task_struct *thread;

    int run_depth;      /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;

/*
 * The externally visible workqueue abstraction is an array of
 * per-CPU workqueues:
 */
struct workqueue_struct {
    struct cpu_workqueue_struct *cpu_wq;
    const char *name;
    struct list_head list;  /* Empty if single thread */
};
### Work queue functions

真正运行是在worker_thread=>run_workqueue中运行。

The predefined work queue

The predefined work queue被叫做events

4.9 Returning from Interrupts and Exceptions (不同于第二版)

从中断、异常返回ret_from_intr()或者ret_from_exception().

The entry points

等价的汇编

中断返回时候已经关闭中断,所以只有异常返回时候需要关中断。内核把thread_info指针放入ebp。根据进入中断时候的压栈情况,判断返回内核,还是返回用户态。

Resuming a kernel control path

如果需要内核抢占,则调用preempt_schedule_irq()

等价汇编
判断thread_info的preempt_count是否为0,如果是则内核被抢占,否则继续,最后从中断请求返回(iret)。

Checking for kernel preemption

如果需要抢占,调用preempt_schedule_irq()。它为preempt_count加上PREEMPT_ACTIVE,内核大锁计数器lock_depth为-1,打开本地中断,进入schedule()。退出schedule()后,关闭本地中断。删掉PREEMPT_ACTIVE标记,将内核大锁计数器lock_depth恢复以前的值。

Chapter 5. Kernel Synchronization

5.1 How the Kernel Services Requests (不同于第二版)

Kernel Preemption

  1. preemptive和nonpreemptive kernels都可以自愿放弃CPU。preemptive的kernel允许被更高优先级的进程打断。
  2. 抢占内核和非抢占内核Linux都是通过switch_to宏,非抢占内核linux在内核态,不允许切换内核。
    例子1:进程A进入kernel模式下的异常处理函数,此时来了有一个中断,进入ISP。
  3. 对于preemptive kernel,ISP返回后会切换到别的进程。
  4. 对于nonpreemptive kernel,ISP返回后不会切换到别的进程。

例子2:在kernel模式下,如果当前进程的时间片用完

  1. 对于preemptive kernel,切换到别的进程。
  2. 对于nonpreemptive kernel,不会切换到别的进程。

preemptive kernel的目的:让用户模式下的反应加快。

preempt_count如果大于0,则关闭preemptive kernel。它发生在如下情况:

  1. ISP运行中
  2. kernel运行softirq或者tasklet
  3. 手动修改preempt_count,如preempt_disable就是将其加一。

在UP中,如果变量不被中断或异常处理函数访问,那么它是安全的,因为不会被打断。

When Synchronization Is Necessary

例子1: 多个ISP访问某资源,对于单核CPU可以简单的关闭中断实现保护。
例子2: 多个系统调用共享某资源,对于单核CPU可以简单的关闭preemptive kernel。

When Synchronization Is Not Necessary

  1. ISP中关闭某IRQ。
  2. ISP、softirqs、tasklets是nonpreemptable and non-blocking。
  3. ISP不会被其他kerenl模块打断。
  4. softirq、tasklet不会被给定CPU交织。
  5. tasklet不会同时被多个CPU执行。

5.2 Synchronization Primitives

Spin lock

对于linux2.4 , UP的获取Spin Lock什么也没做,因为linux 2.4是 nonpreemptive的
for SMP // copy from asm/spinlock.h

#define spin_lock_init(x) do { *(x) = SPIN_LOCK_UNLOCKED; } while(0)

for UP //copy from include/spinlock.h

#define spin_lock_init(lock) do { } while(0)
#define spin_lock(lock) (void)(lock) /* Not "unused variable". */
#define spin_is_locked(lock) (0)
#define spin_trylock(lock) ({1; })
#define spin_unlock_wait(lock) do { } while(0)
#define spin_unlock(lock) do { } while(0)

Chapter 12. The Virtual Filesystem

五种标准UNIX文件类型

12.1 The Role of the Virtual Filesystem (VFS)

根目录所在的文件系统是root filesystem,其他文件系统需要mount在根下面的子目录下。

The Common File Model

read() => sys_read() => file->f_op->read(...)
The Common File Model分为:

  1. The superblock object
  2. The inode object
  3. The file object
  4. The dentry object
image.png

几种cache的比较

12.2 VFS Data Structures

12.7 File Locking

两个文件同时写同一个文件,结果无法预测。

Linux File Locking

  1. mount时候带上-o mand option参数,它代表MS_MANDLOCK(默认此功能关闭)
  2. 设置set-group bit (SGID) ,清除group-execute permission bit
  3. 使用fcntl()获得、释放锁
chmod g+s /mnt/testfile
chmod g-x /mnt/testfile

租约的使用:

  1. invoke a fcntl() system call with a F_SETLEASE or F_GETLEASE command.
  2. fcntl() invocation with the F_SETSIG command may be used to change the type of signal to be sent to the lease process holder
    系统调用:
#define IS_POSIX(fl)    (fl->fl_flags & FL_POSIX)
#define IS_FLOCK(fl)    (fl->fl_flags & FL_FLOCK)
#define IS_LEASE(fl)    (fl->fl_flags & (FL_LEASE|FL_DELEG|FL_LAYOUT))

File-Locking Data Structures

struct lock_file是一个单链表,struct inode中的i_flock描述了这个单链表第一个元素的位置。

// The i_flock list is ordered by:
struct file_lock {
    struct file_lock *fl_next;  /* singly linked list for this inode  */
    struct hlist_node fl_link;  /* node in global lists */
    struct list_head fl_block;  /* wait在这个lock上所有的进程*/
    fl_owner_t fl_owner;
    unsigned int fl_flags;
    unsigned char fl_type;
    unsigned int fl_pid;
    int fl_link_cpu;        /* what cpu's list is this on? */
    struct pid *fl_nspid;
    wait_queue_head_t fl_wait;
    struct file *fl_file;
    loff_t fl_start;
    loff_t fl_end;

    wait_queue_head_t fl_wait;
}
//active locks, /proc/locks里显示的就是这些
static DEFINE_PER_CPU(struct hlist_head, file_lock_list);
//The blocked_hash is used to find POSIX lock loops for deadlock detection.
static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);

FL_FLOCK Locks

An FL_FLOCK lock is always associated with a file object.

sys_flock()
{
        //对于nfs,f.file->f_op->flock是nfs_flock()
    if (f.file->f_op && f.file->f_op->flock)
        error = f.file->f_op->flock(f.file,
                      (can_sleep) ? F_SETLKW : F_SETLK,
                      lock);
    else
        error = flock_lock_file_wait(f.file, lock);    
}

flock_lock_file_wait() => flock_lock_file()
找到filp->f_dentry->d_inode->i_flock,对其加锁、释放锁。

FL_POSIX Locks

An FL_POSIX lock is always associated with a process and with an inode.
sys_fcntl( )会根据参数调用fcntl_getlk()fcntl_setlk()
对于NFS4来说,filp->f_op->lock是nfs_lock,filp->f_op->flock是nfs_flock

int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, struct file_lock *conf)
{
    if (filp->f_op && filp->f_op->lock)
        return filp->f_op->lock(filp, cmd, fl);
    else
        return posix_lock_file(filp, fl, conf);
}

Chapter 15. The Page Cache

inode cachedentry cache存的是元数据。page cache是存储page的disk cache。

15.1 The Page Cache

如果一个page不在page cache中,会分配一个,并把磁盘的内容复制到这个Cache里。
Page cache可以是以下几类:

  1. 从cache里快速搜索到page。
  2. 对于不同类型(普通文件、设备文件等)的owner,处理方法不一样。
    所有的page都有一个owner,对应一个inode。page通过owner的inode和offset定位。

The address_space Object

它是内嵌在inode中的数据结构。
struct inode中的i_mapping指向了struct address_space
struct page中的mapping指向了struct address_space
struct page中的index记录着位置。

struct address_space {
    struct inode        *host;      /* owner: inode, block_device 指向嵌入address_space的那个inode*/
    struct radix_tree_root  page_tree;  /* radix tree of all pages */
    spinlock_t      tree_lock;  /* and lock protecting it */
    unsigned int        i_mmap_writable;/* count VM_SHARED mappings */
    struct rb_root      i_mmap;     /* tree of private and shared mappings */
    struct list_head    i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
    struct mutex        i_mmap_mutex;   /* protect tree, count, list */
    /* Protected by tree_lock together with the radix tree */
    unsigned long       nrpages;    /* number of total pages */
    pgoff_t         writeback_index;/* writeback starts here */
    const struct address_space_operations *a_ops;   /* methods */
    unsigned long       flags;      /* error bits/gfp mask */
    struct backing_dev_info *backing_dev_info; /* device readahead, etc */
    spinlock_t      private_lock;   /* for use by the address_space */
    struct list_head    private_list;   /* ditto */
    void            *private_data;  /* ditto */
} __attribute__((aligned(sizeof(long))));

struct inode {
    ...
    struct address_space    *i_mapping;//总是指向owner的page cache
    struct address_space    i_data;//如果owner是个文件,address_space嵌入在inode中
    ...
}

struct page {
    ...
    struct address_space *mapping;//指向拥有这个page的inode的i_mapping
    pgoff_t index;
    unsigned long private; //如果存page buffer,指向第一个buffer head
}
  1. 如果page的owner是一个regular file,inode中的i_data就是address_space,inode中的i_mapping指针也指向它。同时address_space中的host指向这个inode。
  2. 如果page是来自于block device,address_space存在于bdev设备对象所关联的inode上。

The Radix Tree

image.png

Page Cache Handling Functions

15.2 Storing Blocks in the Page Cache

2.4.10the buffer cache就不用了,为block分配的buffer被称作buffer page。一个buffer pagebuffer head描述符联系起来。buffer head的意义在于block的数据如何从page里定位。

Block Buffers and Buffer Heads

struct buffer_head {
    unsigned long b_state;      /* buffer state bitmap (see above) */
    struct buffer_head *b_this_page;/* circular list of page's buffers 指向同一个page里的下一个buffer head*/
    struct page *b_page;        /* the page this bh is mapped to */

    sector_t b_blocknr;     /* start block number */
    size_t b_size;          /* size of mapping */
    char *b_data;           /* pointer to data within the page */

    struct block_device *b_bdev;
    bh_end_io_t *b_end_io;      /* I/O completion */
    void *b_private;        /* reserved for b_end_io */
    struct list_head b_assoc_buffers; /* associated with another mapping */
    struct address_space *b_assoc_map;  /* mapping this buffer is
                           associated with */
    atomic_t b_count;       /* users using this buffer_head */
};

b_bdev+b_blocknr可以定位到具体block设备上的具体位置。

Managing the Buffer Heads

bh有其对应的slab分配器。

Buffer Pages

两种情况内核创建buffer pages:

  1. When reading or writing pages of a file that are not stored in contiguous disk blocks. 对于这种情况,page会被插入到这个文件的address_space中。
  2. When accessing a single disk block (for instance, when reading a superblock or an inode block).对于这种情况,page会被插入到这个设备文件的address_space中。这被称为block device buffer pages (有时称作blockdev’s pages).本章重点讲述这种情况。

一个例子是,VFS读取一个block的内容,申请一个page,并创建4个bh


image.png

Allocating Block Device Buffer Pages

static int
grow_buffers(struct block_device *bdev, sector_t block, int size)

Releasing Block Device Buffer Pages

int try_to_release_page(struct page *page, gfp_t gfp_mask)

Searching Blocks in the Page Cache

struct buffer_head *
__find_get_block(struct block_device *bdev, sector_t block, unsigned size)

先在LRU中寻找,如果找不到在从address_space中找

struct buffer_head *
__getblk(struct block_device *bdev, sector_t block, unsigned size)

__getblk()调用__find_get_block(),如果不成功则调用__getblk_slow()从内存中分配

struct buffer_head *
__bread(struct block_device *bdev, sector_t block, unsigned size)

__bread()调用__getblk(),如果内容不是最新,则调用__bread_slow()

Submitting Buffer Heads to the Generic Block Layer

int submit_bh(int rw, struct buffer_head *bh)
//for example
submit_bh(READ | REQ_META | REQ_PRIO, bh);

内部通过bio发生实际IO

void ll_rw_block(int rw, int nr, struct buffer_head *bhs[])
//for example
ll_rw_block(WRITE, 1, &bh);

15.3 Writing Dirty Pages to Disk

Chapter 16. Accessing Files

操作文件

16.1 Reading and Writing a File

Reading and Writing a File

  1. 读文件 : 它是page-based的,内核如果读几个字节,如果它不在内存中,则先分配一个page,将它加入到page cache中。然后将结果从page拷贝到用户态内存中。读的入口函数一般是generic_file_read()
  2. 写文件 : 写的入口函数一般是generic_file_write()

Reading from a File

generic_file_read()是针对block类型文件的。
generic_file_read() => __generic_file_aio_read() => do_generic_mapping_read()
do_generic_file_read()做了以下几件事:

  1. filp->f_mapping作为address_space对象
  2. address_space对象的owner即address_space结构体中的host字段,如果是块设备则是inode in the bdev special filesystem,而不是filp->f_dentry->d_inode
  3. 读会分解成若干个page大小的read。对于分解的每一个page,做如下操作:
    1. 在需要预读的情况下调用page_cache_readahead()
    2. 调用find_get_page()向address_space对象获得一个page。如果为null,则分配一个并加入到address_space对象
    3. 如果有PG_uptodate标记,说明是updated的数据,直接返回
    4. 执行lock_page(),对page上锁
    5. 调用readpage方法
    6. 将结果拷贝回用户态内存,由file_read_actor()负责:
      1. 试图从page对应的内存拷贝到用户态内存。
      2. 如果失败说明,page对应的内存是高端内存,需要重新映射,调用kmap()
      3. 将重新映射的地址内容拷贝拷贝到用户空间
      4. 调用kunmap()释放映射

The readpage method for regular files

ext3_readpage() => mpage_readpage()
mpage_readpage的分析:

  1. 此函数会传进一个get_block函数,它可以帮助计算从文件的block位置,转换成块设备上的具体位置(逻辑位置)
  2. 如果page的private域有值,说明此page对应bh,并且这4个block在磁盘上不连续
  3. 依次调用get_block,判断这4个block是否在磁盘连续。
  4. 如果连续,则调用submit_bio一次把这4个连续的block从磁盘读上来。
  5. 如果不连续,则调用block_read_full_page,即依次对于每个block进行读取。

The readpage method for block device files

blkdev_readpage() => block_read_full_page
block_read_full_page : 依次对每个block进行读取

Read-Ahead of Files

read-ahead对顺序读有帮助,对随机读没帮助。

16.3 Direct I/O Transfers

16.4 Asynchronous I/O

Asynchronous I/O in Linux 2.6

异步IO可以在操作系统不支持异步IO的情况下实现。aio_read/aio_write可以创建有一个子进程,并正子进程中调用同步read/write。
Linux 2.6内核支持以下AIO系统调用

The asynchronous I/O context

在使用io_submit之前需要初始换AIO context。AIO context和kioctx对象关联。所有的kioctx对象组成链表,粗放在ioctx_list中。每个kioctx关联一个AIO ring,它是用户模式下的buffer,同时可以在内核模式下看到。
io_setup就是为了创建AIO context。

Submitting the asynchronous I/O operations

系统调用io_submit做了以下事情:

  1. 查询iocb是否有效

Chapter 18. The Ext2 and Ext3 Filesystems

18.2 Ext2 Disk Data Structures

第一个block被用作boot,不被ext2所管理。其他block被分成大小一样的block group

image.png
上一篇下一篇

猜你喜欢

热点阅读