background

这几天经常无法ssh登录到自己的服务器 , 通过htop发现console-kit-daemon这个 进程占用了100%的cpu , 问题应该就是它 .

what is console-kit-daemon

首先 , 弄清楚它是什么 :

ConsoleKit is a framework for keeping track of the various users, sessions, and seats present on a system. It provides a mechanism for software to react to changes of any of these items or of any of the metadata associated with them.

也就是说 , 当通过ssh登录系统 , 从consolekit会为我们准备一个运行环境 , 并通过一些抽象 ( session , seat ) 进行管理 . 当我们退出时 , 这些环境 将被销毁 .

ck-remove-dirtory

查看syslog , 发现每当退出时, 都会出现以下log :

traps: ck-remove-direc[23119] trap int3 ip:7f5d94016750 sp:7ffc5a95fd90 error:0

从名字来看 , 它好像用于删除一个目录 , 那么到底是删除哪个目录呢 ? 又是谁调用它的呢 ? 要回答这些问题 , 只有查看consolekit的代码 , 通过搜索 , 发现是在这里 :

static gboolean
remove_rundir (guint uid, const gchar *dest)
{
        gchar   *command;
        GError  *error = NULL;
        gboolean res;

        TRACE ();

        g_return_val_if_fail (dest, FALSE);

        if (uid < 1) {
                g_debug ("We didn't create a runtime dir for root, nothing to remove");
                return FALSE;
        }

        command = g_strdup_printf (LIBEXECDIR "/ck-remove-directory --uid=%d --dest=%s", uid, dest);

        res = g_spawn_command_line_sync (command, NULL, NULL, NULL, &error);

        if (! res) {
                g_warning ("Unable to remove user runtime dir '%s' error was: %s", dest, error->message);
                g_clear_error (&error);
                return FALSE;
        }

        return TRUE;
}

而调用它的地方有两处 , 首先是当用户退出时 :

gboolean
ck_remove_runtime_dir_for_user (guint uid)
{
        gchar        *dest;

        TRACE ();

        dest = get_rundir (uid);

        /* attempt to remove the tmpfs */
        ck_remove_tmpfs (uid, dest);

        /* remove the user's runtime dir now that all user sessions
         * are gone */
        remove_rundir (uid, dest);

        g_free (dest);

        return TRUE;
}

第二个新用户登录时为其准备环境时 , 发现之前旧的环境还没有被删除时 , 将其删除 :

gchar *
ck_generate_runtime_dir_for_user (guint uid)
{
	...

    dest = get_rundir (uid);

    /* Ensure any files from the last session are removed */
    if (g_file_test (dest, G_FILE_TEST_EXISTS) == TRUE) {
            remove_rundir (uid, dest);
    }

而且 , 删除的目录是用户的动态运行的临时目录 , 在目录在linux上就是 /var/run/user/<uid> , 而我用来登录的uid为1000 , 所以动态运行的临时目录 就是 /var/run/user/1000 . 那我们手动调用 ck-remove-directory 来删除该目录 :

/usr/lib/ConsoleKit/ck-remove-directory --uid=1000 --dest=/var/run/user/1000

** (ck-remove-directory:28130): ERROR **: Failed to remove /var/run/user/1000, reason was: Permission denied
[1]    28130 trace trap  ./tools/ck-remove-directory --uid=1000 --dest=/var/run/user/1000

原来是permission deny , 那我们来看看该目录的权限 :

ls -ld /var/run/user/1000
drwx------ 2 tw root 40 Sep 21 15:06 /var/run/user/1000

它的父目录的权限 :

ls -ld /var/run/user/
drwxr-xr-x 3 root root 60 Sep 21 15:06 /var/run/user/

看来 , 的确时权限不够, ck-remove-directory 程序会以uid去删除该目录 . 通过查看consolekit的issue , 发现已经有人碰到了同样的问题 , 同时也提供了 解决方法, 具体可以参见这个PR. 它的解决思路是以uid去删除 /var/run/user/<uid> 目录下的内容 , 然后再以调用 该程序的用户 ( 通常是root ) 去删除 /var/run/user/<uid> 目录本身.

session hash table

当consolekit占用100%cpu时 , 通过gdb看看它在做什么:

#0 0x0000000000410bea in ck_seat_remove_session (seat=seat@entry=0x7fffe8011590, session=0x6680c0, error=error@entry=0x0) at ck-seat.c:701
#1 0x000000000040ab67 in remove_session_for_cookie (manager=0x6590f0, cookie=cookie@entry=0x65dae0 "totorow-1474285598.732068-49893253", error=error@entry=0x0) at ck-manager.c:3280
#2 0x000000000040b4e5 in remove_leader_for_connection (cookie=0x65dae0 "totorow-1474285598.732068-49893253", leader=0x64d4f0, data=0x7fffffffdf10) at ck-manager.c:3436
#3 0x00007ffff72f18d1 in g_hash_table_foreach_remove_or_steal (hash_table=0x652860, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffdf10, 
    notify=notify@entry=1) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1491
#4 0x00007ffff72f298c in g_hash_table_foreach_remove (hash_table=<optimized out>, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffdf10)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1537
#5 0x000000000040c380 in remove_sessions_for_connection (service_name=<optimized out>, manager=0x6590f0) at ck-manager.c:3455
#6 on_name_owner_notify (connection=<optimized out>, sender_name=<optimized out>, object_path=<optimized out>, interface_name=<optimized out>, signal_name=<optimized out>, parameters=0x672aa0, 
    user_data=0x6590f0) at ck-manager.c:3491
#7 0x00007ffff790b9f5 in emit_signal_instance_in_idle_cb (data=0x7fffec0076f0) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gio/gdbusconnection.c:3701
#8 0x00007ffff7302a55 in g_main_dispatch (context=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3154
#9 g_main_context_dispatch (context=context@entry=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3769
#10 0x00007ffff7302dc8 in g_main_context_iterate (context=0x64ac00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3840
#11 0x00007ffff730308a in g_main_loop_run (loop=0x652d20) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:4034
#12 0x0000000000408840 in main (argc=1, argv=0x7fffffffe288) at main.c:307

通过查看代码 , 这里是从session hash table中删除对应的表项 .

也就是说hash表被破坏了, 而且出问题之前, 也有如下log :

console-kit-daemon[29110]: GLib-CRITICAL: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed

查看相关代码 :

static guint
g_hash_table_foreach_remove_or_steal (GHashTable *hash_table,
                                      GHRFunc     func,
                                      gpointer    user_data,
                                      gboolean    notify)
{
  guint deleted = 0;
  gint i;
#ifndef G_DISABLE_ASSERT
  gint version = hash_table->version;
#endif

  for (i = 0; i < hash_table->size; i++)
    {
      guint node_hash = hash_table->hashes[i];
      gpointer node_key = hash_table->keys[i];
      gpointer node_value = hash_table->values[i];

      if (HASH_IS_REAL (node_hash) &&
          (* func) (node_key, node_value, user_data))
        {
          g_hash_table_remove_node (hash_table, i, notify);
          deleted++;
        }

#ifndef G_DISABLE_ASSERT
      g_return_val_if_fail (version == hash_table->version, 0);
#endif
    }

  g_hash_table_maybe_resize (hash_table);

#ifndef G_DISABLE_ASSERT
  if (deleted > 0)
    hash_table->version++;
#endif

  return deleted;
}

说明当删除hash表项时 , hash表发生了变化 . 可能出现该问题 , 有可能是多线程的问题 , 也就是说多个线程同时在操作同一个 hash table . 通过查看代码 , 所有对该table的操作都在一个线程中 , 所以排除 该可能性 .

通过在 g_hash_table_foreach_remove_or_steal 设置断点 , 出问题之前会有如下的 call stack :

#0  0x00007ffff72f18d1 in g_hash_table_foreach_remove_or_steal (hash_table=0x652860, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffd6c0, 
    notify=notify@entry=1) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1491
#1  0x00007ffff72f298c in g_hash_table_foreach_remove (hash_table=<optimized out>, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffd6c0)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1537
#2  0x000000000040c380 in remove_sessions_for_connection (service_name=<optimized out>, manager=0x6590f0) at ck-manager.c:3455
#3  on_name_owner_notify (connection=<optimized out>, sender_name=<optimized out>, object_path=<optimized out>, interface_name=<optimized out>, signal_name=<optimized out>, 
    parameters=0x7fffec00d630, user_data=0x6590f0) at ck-manager.c:3491
#4  0x00007ffff790b9f5 in emit_signal_instance_in_idle_cb (data=0x7fffec00d580) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gio/gdbusconnection.c:3701
#5  0x00007ffff7302a55 in g_main_dispatch (context=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3154
#6  g_main_context_dispatch (context=context@entry=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3769
#7  0x00007ffff7302dc8 in g_main_context_iterate (context=context@entry=0x64ac00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3840
#8  0x00007ffff7302e6c in g_main_context_iteration (context=0x64ac00, context@entry=0x0, may_block=may_block@entry=1) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3901
#9 0x0000000000414c5c in ck_run_programs (dirpath=dirpath@entry=0x429f18 "/usr/lib64/ConsoleKit/run-session.d", action=action@entry=0x427037 "session_removed", 
    extra_env=extra_env@entry=0x7fffffffd8b0) at ck-run-programs.c:220
#10 0x000000000041416c in ck_session_run_programs (session=session@entry=0x6680c0, action=action@entry=0x427037 "session_removed") at ck-session.c:1330
#11 0x000000000040a553 in on_seat_session_removed_full (seat=0x7fffe8011590, session=0x6680c0, manager=0x6590f0) at ck-manager.c:2463
#12 0x00007ffff76030d7 in g_cclosure_marshal_VOID__OBJECTv (closure=0x670470, return_value=<optimized out>, instance=<optimized out>, args=<optimized out>, marshal_data=0x0, 
    n_params=<optimized out>, param_types=0x66c8f0) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gobject/gmarshal.c:2102
#13 0x00007ffff7600237 in _g_closure_invoke_va (closure=closure@entry=0x670470, return_value=return_value@entry=0x0, instance=instance@entry=0x7fffe8011590, args=args@entry=0x7fffffffdc78, 
    n_params=1, param_types=0x66c8f0) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gobject/gclosure.c:864
#14 0x00007ffff7618f88 in g_signal_emit_valist (instance=0x7fffe8011590, signal_id=<optimized out>, detail=0, var_args=var_args@entry=0x7fffffffdc78)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gobject/gsignal.c:3292
#15 0x00007ffff7619c0a in g_signal_emit (instance=instance@entry=0x7fffe8011590, signal_id=<optimized out>, detail=detail@entry=0)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gobject/gsignal.c:3439
#16 0x0000000000410bea in ck_seat_remove_session (seat=seat@entry=0x7fffe8011590, session=0x6680c0, error=error@entry=0x0) at ck-seat.c:701
#17 0x000000000040ab67 in remove_session_for_cookie (manager=0x6590f0, cookie=cookie@entry=0x65dae0 "totorow-1474285598.732068-49893253", error=error@entry=0x0) at ck-manager.c:3280
#18 0x000000000040b4e5 in remove_leader_for_connection (cookie=0x65dae0 "totorow-1474285598.732068-49893253", leader=0x64d4f0, data=0x7fffffffdf10) at ck-manager.c:3436
#19 0x00007ffff72f18d1 in g_hash_table_foreach_remove_or_steal (hash_table=0x652860, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffdf10, 
    notify=notify@entry=1) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1491
#20 0x00007ffff72f298c in g_hash_table_foreach_remove (hash_table=<optimized out>, func=func@entry=0x40b4b0 <remove_leader_for_connection>, user_data=user_data@entry=0x7fffffffdf10)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/ghash.c:1537
#21 0x000000000040c380 in remove_sessions_for_connection (service_name=<optimized out>, manager=0x6590f0) at ck-manager.c:3455
#22 on_name_owner_notify (connection=<optimized out>, sender_name=<optimized out>, object_path=<optimized out>, interface_name=<optimized out>, signal_name=<optimized out>, parameters=0x672aa0, 
    user_data=0x6590f0) at ck-manager.c:3491
#23 0x00007ffff790b9f5 in emit_signal_instance_in_idle_cb (data=0x7fffec0076f0) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/gio/gdbusconnection.c:3701
#24 0x00007ffff7302a55 in g_main_dispatch (context=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3154
#25 g_main_context_dispatch (context=context@entry=0x64ac00) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3769
#26 0x00007ffff7302dc8 in g_main_context_iterate (context=0x64ac00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:3840
#27 0x00007ffff730308a in g_main_loop_run (loop=0x652d20) at /var/tmp/portage/dev-libs/glib-2.46.2-r3/work/glib-2.46.2/glib/gmain.c:4034
#28 0x0000000000408840 in main (argc=1, argv=0x7fffffffe288) at main.c:307

原来 , 这里出现了recursive call !!! 通过分析该部分代码 , 当从一个hash table删除一个session时,会执行一些其他废时操作 , 为了性能的考虑 , 在这些操作执行时 , 主线程仍然可以处理其他请求 , 所以当一个session还没有完全删除时 , 另一个session被删除了 , 这样, hash table有可能被resize , 那么就会触发之前的assertion . 那么, 原来的操作仍会使用已经被释放的hash table . 所以一些unexpected结果就会发生.

conclusion

为了防止recursive call, 解决的思路是先将从hash table中将相关的表项删除, 将相关的清理操作放在之后执行, 这样即使出现了recursive call发生 , 对hash table的操作也不会被破坏 , 具体参见该 PR. 问题解决 !

FIN.