Linux.ChinaUnix.net
ChinaUnix | Linux首页 | 新闻 | 博客 | 文章 | 专栏 | 新手 | 方案 | 图书 | 下载 | 人才 | 手册 | wiki | 搜索     
Linux论坛
  会员: 密码: 免费注册 | 忘记密码 | 会员登录 | 搜索 | 帮助 


Kernel Bug-Vulnerability-Comment library
首页 » CU论坛 » Linux » 汇总贴列表 » 内核源码 »  
[打印] [订阅] [收藏] [本帖文本页] [推荐此主题给朋友,立即获积分]
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


111楼 发表于 2008-5-11 00:29 


/*
* 2.6.24.4-rt4
*/
+static int ksoftirqd(void * __data)
+{
+        struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 };
+        struct softirqdata *data = __data;
+        u32 softirq_mask = (1 << data->nr);
+        struct softirq_action *h;
+        int cpu = data->cpu;
+
+#ifdef CONFIG_PREEMPT_SOFTIRQS
+        init_waitqueue_head(&data->wait);
+#endif
+
+        sys_sched_setscheduler(current->pid, SCHED_FIFO, &param);
+        current->flags |= PF_SOFTIRQ;
        set_current_state(TASK_INTERRUPTIBLE);



/*
* 2.6.24.4
*/
static void task_tick_rt(struct rq *rq, struct task_struct *p)
{
        update_curr_rt(rq);

        /*
         * RR tasks need a special form of timeslice management.
         * FIFO tasks have no timeslices.
         */
        if (p->policy != SCHED_RR)
                return;

        if (--p->time_slice)
                return;

        p->time_slice = DEF_TIMESLICE;

        /*
         * Requeue to the end of queue if we are not the only element
         * on the queue:
         */
        if (p->run_list.prev != p->run_list.next) {
                requeue_task_rt(rq, p);
                set_tsk_need_resched(p);
        }
}

if ksoftirqd is RT & SCHED_FIFO, provided that FIFO tasks have no timeslices,
it is possible in many cases even in RT version that NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ and others mask each other, which may result in unnormal behaviors,
say slab cache crash.

[ 本帖最后由 sisi8408 于 2008-5-11 00:30 编辑 ]



您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


112楼 发表于 2008-5-17 10:05 


/* 2.6.24.4
* static void rebalance_domains(int cpu, enum cpu_idle_type idle)
*/
-                if (interval > HZ * NR_CPUS /10)
-                        interval = HZ * NR_CPUS /10;
+                if (interval > (HZ /10) * num_online_cpus)
+                        interval = (HZ /10) * num_online_cpus;




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


113楼 发表于 2008-5-18 00:34 


/** 2.6.24.4
static int __assign_irq_vector(int irq, cpumask_t mask)
*/
                if (unlikely(current_vector == vector))
-                        continue;
+                        goto next;               




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


114楼 发表于 2008-6-8 00:06 


/*
* from 独孤九贱大侠的帖子
*/
Unable to handle kernel paging request at virtual address 713401b6

printing eip: *pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: uflux e1000 trusthost
CPU:    0
EIP:    0060:[<c03c1de6>]    Not tainted VLI
EFLAGS: 00010206   (2.6.12)

EIP is at ip_route_input+0x86/0x1d0

eax: df976fc0   ebx: c0629000   ecx: 7134010a   edx: 00000000
esi: 0121010a   edi: 2495313a   ebp: 00000003   esp: c0629f04
ds: 007b   es: 007b   ss: 0068

Process swapper (pid: 0, threadinfo=c0629000 task=c0508c20)

Stack: 0121010a 2495315a 00000000 00000000 df987800 00000000 df987800 dd0b6400
       00000000 006d8580 00000000 dd0b6400 dd14c620 c03c4fc0 c03c4eff dd0b6400
       0121010a 2495313a 00000000 df987800 c03c4fc0 80000000 dd18d440 dc7ef240
Call Trace:
[<c03c4fc0>] ip_rcv_finish+0x0/0x310
[<c03c4eff>] ip_rcv+0x4ef/0x5b0
[<c03c4fc0>] ip_rcv_finish+0x0/0x310
[<c039a771>] netif_receive_skb+0x1e1/0x280
[<c039a8a3>] process_backlog+0x93/0x130
[<c039a9ef>] net_rx_action+0xaf/0x1a0
[<c0129942>] __do_softirq+0x72/0xe0
[<c0106bcb>] do_softirq+0x5b/0x60

/* ------------------------------------------------ */

    39c0:       a1 00 00 00 00          mov    0x0,%eax
    39c5:       8b 53 10                mov    0x10(%ebx),%edx
    39c8:       f7 d0                   not    %eax
    39ca:       8b 04 90                mov    (%eax,%edx,4),%eax
    39cd:       ff 40 38                incl   0x38(%eax)
    39d0:       8b 09                   mov    (%ecx),%ecx
    39d2:       85 c9                   test   %ecx,%ecx
    39d4:       74 7a                   je     3a50 <ip_route_input+0x100>
    39d6:       39 b1 ac 00 00 00       cmp    %esi,0xac(%ecx)



1,        EIP is at 39d6, as shown the original poster,
        and it is clear that the rt entry is still in hash table,
        otherwise the reader of rt hash table has no chance to see it.
        And there are at most 3 chances of hard irq, which will block
        the reader, from 39d0 to 39d6.

2,        From call trace, the reader is in context of soft irq,
        and the NAPI is not active in the kernel used.

3,        Since non-NAPI, what a pity, the NIC ISR is longer and completes when
        netif_rx() returns.

4,        It is assumed that hardware works fine, anyway.
        An updater on another cpu, yeah, (if no ohter cpu oops has no chance to play
        its game, just because oops occurs in soft irq contxt,)
        deletes the rt entry in %ecx in 39d0,
        after happenly the reader is blocked by hard irq, say new skb is coming.

5,        Of high probablity, the updater is from timer hard irq on another cpu,

/* 2.6.24.4
*
* Called from the timer interrupt handler,
* to charge one tick to the current process.
*
* user_tick is 1 if the tick is user time, 0 for system.
*/
void update_process_times (int user_tick)
{
        struct task_struct *p = current;
        int cpu = smp_processor_id();

        account_process_tick(p, user_tick);
        run_local_timers();
       
        if (rcu_pending(cpu))
                rcu_check_callbacks(cpu, user_tick);
       
        scheduler_tick();
        run_posix_cpu_timers(p);
}

        rt gc timer, execed by run_local_timers(),
        happenly deletes the rt entry in %ecx in 39d0,
        which is already seen by the reader and is not NULL,
        and delivers it to rcu core by calling call_rcu_bh().
       
        In rcu_pending(), there are 4 chances that
        fire rcu_check_callbacks(),
        in which tasklet_schedule(&per_cpu(rcu_tasklet, cpu)) is called in anyway.

6,        Though rcu_tasklet is registered,

/*
* This does the RCU processing work from tasklet context.
*/
static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
                                        struct rcu_data *rdp)
{
        if (rdp->curlist &&
            !rcu_batch_before(rcp->completed, rdp->batch)) {
                *rdp->donetail = rdp->curlist;
                rdp->donetail = rdp->curtail;
                /*
                 * rcp->completed >= rdp->batch
                 */
                rdp->curlist = NULL;
                rdp->curtail = &rdp->curlist;
        }

        if (rdp->nxtlist && !rdp->curlist) {
                local_irq_disable();
                rdp->curlist = rdp->nxtlist;
                rdp->curtail = rdp->nxttail;
               
                rdp->nxtlist = NULL;
                rdp->nxttail = &rdp->nxtlist;
                local_irq_enable();
               
                /*
                 * start the next batch of callbacks
                 */

                /* determine batch number */
                rdp->batch = rcp->cur + 1;
               
                /* see the comment and corresponding wmb() in
                 * the rcu_start_batch()
                 */
                smp_rmb();

                if (!rcp->next_pending) {
                        /* and start it/schedule start if it's a new batch */
                        spin_lock(&rcp->lock);
                        rcp->next_pending = 1;
                        rcu_start_batch(rcp);
                        spin_unlock(&rcp->lock);
                }
        }

        rcu_check_quiescent_state(rcp, rdp);
       
        if (rdp->donelist)
                rcu_do_batch(rdp);
}

        call_rcu_bh() only add the victim to rdp->nxtlist,
        which will not be freed in the first exec of rcu_tasklet after
        timer hard irq, but maybe freed in the following execs of rcu_tasklet
        after the same timer hard irq becaude soft irqs are allowed more than once,
        if certain conditions match:
       
        I,        if (rdp->nxtlist && !rdp->curlist) is true,
                when the first exec of rcu_tasklet,
                rdp->curlist = rdp->nxtlist;
       
        II,        if (rdp->nxtlist && !rdp->curlist) is true,
                also when the first exec of rcu_tasklet,
       
        III,        rcu_bh_qsctr_inc(cpu) already called in rcu_check_callbacks()
       
        IV,        cpu_quiet() called by rcu_check_quiescent_state()
                when the first exec of rcu_tasklet,
       
        V,        if (rdp->curlist &&
                        !rcu_batch_before(rcp->completed, rdp->batch))
                is true,
                when the second exec of rcu_tasklet,

        VI,        if the reader is blocked longer enough       

[ 本帖最后由 sisi8408 于 2008-6-8 00:08 编辑 ]



您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


115楼 发表于 2008-7-18 21:48 


...
        schedstat_inc(rq, ttwu_count);

        if (cpu == this_cpu) {
                schedstat_inc(rq, ttwu_local);
                goto out_set_cpu;
        }
/* try_to_wake_up @ 2.6.24.4
        if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
                goto out_set_cpu;
*/
        for_each_domain(this_cpu, sd) {
                if (cpu_isset(cpu, sd->span)) {
                        schedstat_inc(sd, ttwu_wake_remote);
                        this_sd = sd;
                        break;
                }
        }

        if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
                goto out_set_cpu;

        /*
         * Check for affine wakeup and passive balancing possibilities.
         */
...




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


116楼 发表于 2008-7-18 22:38 


...
        /* build_sched_domains @ 2.6.24.4
         * Calculate CPU power for physical packages and nodes
         */
#ifdef CONFIG_SCHED_SMT
        for_each_cpu_mask(i, *cpu_map) {
                struct sched_domain *sd = &per_cpu(cpu_domains, i);

                init_sched_groups_power(i, sd);
        }

#elif defined(CONFIG_SCHED_MC)
        for_each_cpu_mask(i, *cpu_map) {
                struct sched_domain *sd = &per_cpu(core_domains, i);

                init_sched_groups_power(i, sd);
        }
#else

        for_each_cpu_mask(i, *cpu_map) {
                struct sched_domain *sd = &per_cpu(phys_domains, i);

                init_sched_groups_power(i, sd);
        }
#endif

        /* Attach the domains */
...




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


117楼 发表于 2008-7-27 11:39 
queue & sched in .26 blk layer

1, blk init core
   =============

int __init blk_dev_init(void)
{
        int i;

        /* 1,
         * ops upon queue is based on `work',
         * it is asyn and in kthead context, goto see work queue,
         * like do_softirq and ksoftirqd/x
         */
        kblockd_workqueue = create_workqueue("kblockd");
        if (!kblockd_workqueue)
                panic("Failed to create kblockd\n");

        request_cachep = kmem_cache_create("blkdev_requests",
                        sizeof(struct request), 0, SLAB_PANIC, NULL);
        /* 2,
         * like at other spots, kmem cache utilised for skb in net,
         * if u like to do something funny for skb, say 0-copy,
         * u have to maintain alloc/free_skb methods,
         * hard&hot isnt, shit?
         */
        blk_requestq_cachep = kmem_cache_create("blkdev_queue",
                        sizeof(struct request_queue), 0, SLAB_PANIC, NULL);

        for_each_possible_cpu(i)
                INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
        /* 3,
         * per cpu list prepared for BLOCK_SOFTIRQ,
         * powerful, and lockless of cough
         */
        open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
       
        register_hotcpu_notifier(&blk_cpu_notifier);

        return 0;
}




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


118楼 发表于 2008-7-27 14:23 


2, Q init
   ======

`blk layer` is connected to reiser/btr FS through left hand,
and through right hand connects scsi driver/controler.

general in all, `blk layer` represents the soft methods upon blk device,
including queue and scheduler, or simply elevator which is responsible for
managing whatever schedulers, if available.

in the eyes of elevator, blk device is a queue,
alloced by driver,

struct request_queue * blk_alloc_queue(gfp_t gfp_mask)
{
        return blk_alloc_queue_node(gfp_mask, -1);
}

struct request_queue * blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
{
        struct request_queue *q;
        int err;

        q = kmem_cache_alloc_node(blk_requestq_cachep,
                                gfp_mask | __GFP_ZERO, node_id);
        if (!q)
                return NULL;

        q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
        /* 1,
         * bdi, one concept in disk cache,
         * goto see and compare with ramdisk, nice example for pseudo device,
         * if u like to play game on SSD.
         */
        q->backing_dev_info.unplug_io_data = q;
       
        err = bdi_init(&q->backing_dev_info);
        if (err) {
                kmem_cache_free(blk_requestq_cachep, q);
                return NULL;
        }

        /* 2,
         * like hrtimer, simple machine for scheduling controler
         */
        init_timer(&q->unplug_timer);

        kobject_init(&q->kobj, &blk_queue_ktype);
        /* 3,
         * reguler sysfs socket, its papa looks cool
         */
        mutex_init(&q->sysfs_lock);
       
        spin_lock_init(&q->__queue_lock);
        return q;
}


and initilised in blk layer,

struct request_queue * blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
{
        return blk_init_queue_node(rfn, lock, -1);
}

struct request_queue *
        blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
        struct request_queue *q = blk_alloc_queue_node(GFP_KERNEL, node_id);

        if (!q)
                return NULL;
        q->node = node_id;

        if (blk_init_free_list(q)) {
                kmem_cache_free(blk_requestq_cachep, q);
                return NULL;
        }

        /*
         * if caller didn't supply a lock,
         * they get per-queue locking with our embedded lock
         */
        if (!lock)
                lock = &q->__queue_lock;
        q->queue_lock        = lock;

        q->request_fn                = rfn;
        q->prep_rq_fn                = NULL;
        q->unplug_fn                = generic_unplug_device;
        q->queue_flags                = (1 << QUEUE_FLAG_CLUSTER);

        blk_queue_segment_boundary(q, 0xffffffff);

        blk_queue_make_request(q, __make_request);
        blk_queue_max_segment_size(q, MAX_SEGMENT_SIZE);

        blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
        blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);

        q->sg_reserved_size = INT_MAX;

        /*
         * all done
         *
         * info big brother, wheelz, ready for game
         */
        if (!elevator_init(q, NULL)) {
                blk_queue_congestion_threshold(q);
                return q;
        }

        blk_put_queue(q);
        return NULL;
}




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


119楼 发表于 2008-7-27 15:44 


3, basic methods for request alloc/free/prepare/init
   =================================================

FS dispatch bios to elevator, just for efficiency they are merged into request
if possible, then delivered to controler when necessary.

static struct request *
        get_request(struct request_queue *q, int rw_flags,
                        struct bio *bio, gfp_t gfp_mask)
{
        struct request *rq = NULL;
        struct request_list *rl = &q->rq;
        struct io_context *ioc = NULL;
        const int rw = rw_flags & 0x01;
        int may_queue, priv;

        may_queue = elv_may_queue(q, rw_flags);
        /* 1,
         * check resource not over limit, say cpu time.
         * the starv concept still play its role,
         * howto compensate?
         */
        if (may_queue == ELV_MQUEUE_NO)
                goto rq_starved;

        if (rl->count[rw] +1 >= queue_congestion_on_threshold(q)) {
                /* 2,
                 * not only for efficiency, nice scheduling also imposes
                 * resource limit upon all tasks in system,
                 * fair play is considered here, with little to do with
                 * if u r root, but compensation is also fair.
                 */
                if (rl->count[rw] +1 >= q->nr_requests) {
                        ioc = current_io_context(GFP_ATOMIC, q->node);
                        /*
                         * The queue will full after this allocation, so set
                         * it as full, and mark this process as "batching".
                         *
                         * This process will be allowed to complete a batch of
                         * requests, others will be blocked.
                         */
                        if (!blk_queue_full(q, rw)) {
                                ioc_set_batching(q, ioc);
                                blk_set_queue_full(q, rw);
                        } else {
                                if (may_queue != ELV_MQUEUE_MUST
                                        && !ioc_batching(q, ioc)) {
                                        /*
                                         * The queue is full and the allocating
                                         * process is not a "batcher", and not
                                         * exempted by the IO scheduler
                                         */
                                        goto out;
                                }
                                /* else
                                 *
                                 * batcher is biased to allocate upto 50%
                                 * over the defined limit
                                 */
                        }
                }
                blk_set_queue_congested(q, rw);
        }

        /*
         * Only allow batching queuers to allocate up to 50% over the defined
         * limit of requests, otherwise we could have thousands of requests
         * allocated with any setting of ->nr_requests
         */
        if (rl->count[rw] >= (3 * q->nr_requests / 2))
                goto out;

        rl->count[rw]++;
        rl->starved[rw] = 0;

        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
        if (priv)
                rl->elvpriv++;

        spin_unlock_irq(q->queue_lock);

        rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
        if (unlikely(!rq)) {
                /*
                 * Allocation failed presumably due to memory. Undo anything
                 * we might have messed up.
                 *
                 * Allocating task should really be put onto the front of the
                 * wait queue, but this is pretty rare.
                 */
                spin_lock_irq(q->queue_lock);
                /* 3,
                 * nice scheduling is based upon housekeeper, right?
                 * is mm based on page frame?
                 */
                freed_request(q, rw, priv);

                /*
                 * in the very unlikely event that allocation failed and no
                 * requests for this direction was pending, mark us starved
                 * so that freeing of a request in the other direction will
                 * notice us. another possible fix would be to split the
                 * rq mempool into READ and WRITE
                 */
rq_starved:
                if (unlikely(rl->count[rw] == 0))
                        rl->starved[rw] = 1;

                goto out;
        }

        /*
         * ioc may be NULL here, and ioc_batching will be false. That's
         * OK, if the queue is under the request limit then requests need
         * not count toward the nr_batch_requests limit. There will always
         * be some limit enforced by BLK_BATCH_TIME.
         */
        if (ioc_batching(q, ioc))
                ioc->nr_batch_requests--;

        blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
out:
        return rq;
}

static void freed_request(struct request_queue *q, int rw, int priv)
{
        struct request_list *rl = &q->rq;

        rl->count[rw]--;
       
        if (priv)
                rl->elvpriv--;

        __freed_request(q, rw);

        if (unlikely(rl->starved[rw ^ 1]))
                /* 1,
                 * today elevator has little to do with if u r reader,
                 * since nobody has enough power to declare blue blood is nicer,
                 */
                __freed_request(q, rw ^ 1);
}
static void __freed_request(struct request_queue *q, int rw)
{
        struct request_list *rl = &q->rq;

        if (rl->count[rw] < queue_congestion_off_threshold(q))
                blk_clear_queue_congested(q, rw);

        if (rl->count[rw] + 1 <= q->nr_requests) {
                /* 2,
                 * but if and only if under control,
                 * chance for compensation still available to keep disk rotating
                 */
                if (waitqueue_active(&rl->wait[rw]))
                        wake_up(&rl->wait[rw]);

                blk_clear_queue_full(q, rw);
        }
}


void init_request_from_bio(struct request *req, struct bio *bio)
{
        req->cmd_type = REQ_TYPE_FS;

        /*
         * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST)
         */
        if (bio_rw_ahead(bio) || bio_failfast(bio))
                req->cmd_flags |= REQ_FAILFAST;

        /*
         * REQ_BARRIER implies no merging, but lets make it explicit
         */
        if (unlikely(bio_barrier(bio)))
                req->cmd_flags |= (REQ_HARDBARRIER | REQ_NOMERGE);

        if (bio_sync(bio))
                req->cmd_flags |= REQ_RW_SYNC;
        if (bio_rw_meta(bio))
                req->cmd_flags |= REQ_RW_META;

        req->errors = 0;
        req->hard_sector = req->sector = bio->bi_sector;
        /* 1,
         * prio also considered, but how is it defined?
         */
        req->ioprio = bio_prio(bio);
        req->start_time = jiffies;
       
        blk_rq_bio_prep(req->q, req, bio);
}
void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
                     struct bio *bio)
{
        /* first two bits are identical in rq->cmd_flags and bio->bi_rw */
        rq->cmd_flags |= (bio->bi_rw & 3);

        rq->nr_phys_segments = bio_phys_segments(q, bio);
        rq->nr_hw_segments = bio_hw_segments(q, bio);
        rq->current_nr_sectors = bio_cur_sectors(bio);
        rq->hard_cur_sectors = rq->current_nr_sectors;
        rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
        rq->buffer = bio_data(bio);
        rq->data_len = bio->bi_size;

        rq->bio = rq->biotail = bio;

        if (bio->bi_bdev)
                rq->rq_disk = bio->bi_bdev->bd_disk;
}




您对本贴的看法:鲜花[0] 臭蛋[0]

__________________________________

东直门外大街
张字85号
丁字96号

积分兑换专区 | IT节能和TPC-E活动获奖名单 | 致电800-858-2903,了解DELL如何为你量身订制笔记本 | 送2G U盘 | 站长如何获得资金?
sisi8408 (linux八哥)
风云使者




UID:509266
注册:2006-12-22
最后登录: 2008-09-21
帖子:617
精华:0

可用积分:567 (稍有积蓄)
信誉积分:100
专家积分:0 (本版:0)
空间积分:0
推广积分:0

状态:...离线...

[资料] [站内短信] [Blog]


120楼 发表于 2008-7-27 16:39 


4, basic method exported to FS
   ===========================

void submit_bio(int rw, struct bio *bio)
{
        int count = bio_sectors(bio);

        bio->bi_rw |= rw;

        /*
         * If it's a regular read/write or a barrier with data attached,
         * go through the normal accounting stuff before submission.
         */
        if (!bio_empty_barrier(bio)) {
                BIO_BUG_ON(!bio->bi_size);
                BIO_BUG_ON(!bio->bi_io_vec);

                if (rw & WRITE) {
                        count_vm_events(PGPGOUT, count);
                } else {
                        task_io_account_read(bio->bi_size);
                        count_vm_events(PGPGIN, count);
                }

                if (unlikely(block_dump)) {
                        char b[BDEVNAME_SIZE];
                       
                        printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
                        current->comm, task_pid_nr(current),
                                (rw & WRITE) ? "WRITE" : "READ",
                                (unsigned long long)bio->bi_sector,
                                bdevname(bio->bi_bdev, b));
                }
        }
        generic_make_request(bio);
}

void generic_make_request(struct bio *bio)
{
        if (current->bio_tail) { /* make_request is active */
                *(current->bio_tail) = bio;
                bio->bi_next = NULL;
                current->bio_tail = &bio->bi_next;
                return;
        }
        /* following loop may be a bit non-obvious, and so deserves some
         * explanation.
         * Before entering the loop, bio->bi_next is NULL (as all callers
         * ensure that) so we have a list with a single bio.
         *
         * We pretend that we have just taken it off a longer list, so
         * we assign bio_list to the next (which is NULL) and bio_tail
         * to &bio_list, thus initialising the bio_list of new bios to be
         * added.  __generic_make_request may indeed add some more bios
         * through a recursive call to generic_make_request.  If it
         * did, we find a non-NULL value in bio_list and re-enter the loop
         * from the top.  In this case we really did just take the bio
         * of the top of the list (no pretending) and so fixup bio_list and
         * bio_tail or bi_next, and call into __generic_make_request again.
         *
         * The loop was structured like this to make only one call to
         * __generic_make_request (which is important as it is large and
         * inlined) and to keep the structure simple.
         */
        BUG_ON(bio->bi_next);
       
        do {
                current->bio_list = bio->bi_next;
               
                if (bio->bi_next == NULL)
                        current->bio_tail = &current->bio_list;
                else
                        bio->bi_next = NULL;
               
                __generic_make_request(bio);
                bio = current->bio_list;
        } while (bio);

        current->bio_tail = NULL; /* deactivate */
}

static inline void __generic_make_request(struct bio *bio)
{
        struct request_queue *q;
        sector_t old_sector;
        int ret, nr_sectors = bio_sectors(bio);
        dev_t old_dev;
        int err = -EIO;

        might_sleep();
        /* 1,
         * check physical overflow
         */
        if (bio_check_eod(bio, nr_sectors))
                goto end_io;

        /*
         * Resolve the mapping until finished. (drivers are
         * still free to implement/resolve their own stacking
         * by explicitly returning 0)
         *
         * NOTE: we don't repeat the blk_size check for each new device.
         * Stacking drivers are expected to know what they are doing.
         */
        old_sector = -1;
        old_dev = 0;
        do {
                char b[BDEVNAME_SIZE];
                /* 2,
                 * determine physical disk
                 */
                q = bdev_get_queue(bio->bi_bdev);
                if (!q) {
                        printk(KERN_ERR
                               "generic_make_request: Trying to access "
                                "nonexistent block-device %s (%Lu)\n",
                                bdevname(bio->bi_bdev, b),
                                (long long) bio->bi_sector);
end_io:
                        bio_endio(bio, err);
                        break;
                }
                /* 3,
                 * overflow controler's capability
                 */
                if (unlikely(nr_sectors > q->max_hw_sectors)) {
                        printk(KERN_ERR "bio too big device %s (%u > %u)\n",
                                bdevname(bio->bi_bdev, b),
                                bio_sectors(bio),
                                q->max_hw_sectors);
                        goto end_io;
                }
                /* 4,
                 * elevator still healthy
                 */
                if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
                        goto end_io;
                /* 5,
                 * if just a poke
                 */
                if (should_fail_request(bio))
                        goto end_io;

                /* 6,
                 * If this device has partitions, remap block n
                 * of partition p to block n+start(p) of the disk.
                 */
                blk_partition_remap(bio);

                if (old_sector != -1)
                        blk_add_trace_remap(q, bio, old_dev, bio->bi_sector,
                                            old_sector);

                blk_add_trace_bio(q, bio, BLK_TA_QUEUE);

                old_sector = bio->bi_sector;
                old_dev = bio->bi_bdev->bd_dev;

                if (bio_check_eod(bio, nr_sectors))
                        goto end_io;
                /* 7,
                 * If hard to pose
                 */
                if (bio_empty_barrier(bio) && !q->prepare_flush_fn) {
                        err = -EOPNOTSUPP;
                        goto end_io;
                }
                /* 8,
                 * do it, but where assigned?
                 * void blk_queue_make_request(struct request_queue *q,
                 *                                 make_request_fn *mfn)
                 * and how about __make_request?
                 * it is dft method when Q init
                 */
                ret = q->make_request_fn(q, bio);
        } while (ret);
}

static int __make_request(struct request_queue *q, struct bio *bio)
{
        struct request *req;
        int el_ret, nr_sectors, barrier, err;
        const unsigned short prio = bio_prio(bio);
        const int sync = bio_sync(bio);
        int rw_flags;

        nr_sectors = bio_sectors(bio);

        /*
         * low level driver can indicate that it wants pages above a
         * certain limit bounced to low memory (ie for highmem, or even
         * ISA dma in theory)
         */
        blk_queue_bounce(q, &bio);

        barrier = bio_barrier(bio);
       
        if (unlikely(barrier) && (q->next_ordered == QUEUE_ORDERED_NONE)) {
                err = -EOPNOTSUPP;
                goto end_io;
        }

        spin_lock_irq(q->queue_lock);
        /* 1,
         * bar do no merge in principle
         */
        if (unlikely(barrier) || elv_queue_empty(q))
                goto get_rq;
        /* 2,
         * try merge
         */
        el_ret = elv_merge(q, &req, bio);
       
        switch (el_ret) {
        case ELEVATOR_BACK_MERGE:
                BUG_ON(!rq_mergeable(req));

                if (!ll_back_merge_fn(q, req, bio))
                        break;

                blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);

                req->biotail->bi_next = bio;
                req->biotail = bio;
                req->nr_sectors = req->hard_nr_sectors += nr_sectors;
                req->ioprio = ioprio_best(req->ioprio, prio);
                drive_stat_acct(req, 0);
               
                if (!attempt_back_merge(q, req))
                        elv_merged_request(q, req, el_ret);
                goto out;

        case ELEVATOR_FRONT_MERGE:
                BUG_ON(!rq_mergeable(req));

                if (!ll_front_merge_fn(q, req, bio))
                        break;

                blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);

                bio->bi_next = req->bio;
                req->bio = bio;

                /*
                 * may not be valid. if the low level driver said
                 * it didn't need a bounce buffer then it better
                 * not touch req->buffer either...
                 */
                req->buffer = bio_data(bio);
                req->current_nr_sectors = bio_cur_sectors(bio);
                req->hard_cur_sectors = req->current_nr_sectors;
                req->sector = req->hard_sector = bio->bi_sector;
                req->nr_sectors = req->hard_nr_sectors += nr_sectors;
                req->ioprio = ioprio_best(req->ioprio, prio);
                drive_stat_acct(req, 0);
                if (!attempt_front_merge(q, req))
                        elv_merged_request(q, req, el_ret);
                goto out;

        /* ELV_NO_MERGE: elevator says don't/can't merge. */
        default:
                break;
        }

get_rq:
        /*
         * This sync check and mask will be re-done in init_request_from_bio(),
         * but we need to set it earlier to expose the sync flag to the
         * rq allocator and io schedulers.
         */
        rw_flags = bio_data_dir(bio);
        if (sync)
                rw_flags |= REQ_RW_SYNC;

        /* 3,
         * Grab a free request. This is might sleep but can not fail.
         * Returns with the queue unlocked.
         */
        req = get_request_wait(q, rw_flags, bio);

        /*
         * After dropping the lock and possibly sleeping here, our request
         * may now be mergeable after it had proven unmergeable (above).
         *
         * We don't worry about that case for efficiency. It won't happen
         * often, and the elevators are able to handle it.
         */
        init_request_from_bio(req, bio);

        spin_lock_irq(q->queue_lock);
       
        if (elv_queue_empty(q))
                blk_plug_device(q);
        /* 4,
         * in case of new request, and then wait on queue
         */
        add_request(q, req);
out:
        if (sync) {
        /* 5,
         * if neccessary, drive disk rotating,
                 * even current gas is above $100
         */
                __generic_unplug_device(q);
        }
        spin_unlock_irq(q->queue_lock);
        return 0;
end_io:
        bio_endio(bio, err);
        return 0;
}




您对本贴的看法:鲜花[0]