osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug 60296] RMM list corruption in ldap module results in server hang


https://bz.apache.org/bugzilla/show_bug.cgi?id=60296

--- Comment #12 from Rafael David Tinoco <rafael.tinoco@xxxxxxxxxxxxx> ---
Thread #19 7092 (Suspended : Container)
        kill() at syscall-template.S:84 0x7ff7e9911767
        <signal handler called>() at 0x7ff7e9cb7390
        find_block_of_size() at apr_rmm.c:106 0x7ff7ea10e25a
        apr_rmm_calloc() at apr_rmm.c:342 0x7ff7ea10ea68
        util_ald_alloc() at util_ldap_cache_mgr.c:105 0x7ff7e3369b3d
        util_ldap_compare_node_copy() at util_ldap_cache.c:257 0x7ff7e3369784
        util_ald_cache_insert() at util_ldap_cache_mgr.c:501 0x7ff7e336a310
        uldap_cache_compare() at util_ldap.c:1,183 0x7ff7e33662d3
        ldapgroup_check_authorization() at mod_authnz_ldap.c:925 0x7ff7e8459937
        apply_authz_sections() at mod_authz_core.c:737 0x7ff7e4bb99fa
        apply_authz_sections() at mod_authz_core.c:751 0x7ff7e4bb9c01
        authorize_user_core() at mod_authz_core.c:840 0x7ff7e4bb9dca
        ap_run_auth_checker() at request.c:91 0x56127e692f00
        ap_process_request_internal() at request.c:335 0x56127e695d57
        ap_process_async_request() at http_request.c:408 0x56127e6b4690
        ap_process_request() at http_request.c:445 0x56127e6b4850
        ap_process_http_sync_connection() at http_core.c:210 0x56127e6b091e
        ap_process_http_connection() at http_core.c:251 0x56127e6b091e
        ap_run_process_connection() at connection.c:41 0x56127e6a6bf0
        ap_process_connection() at connection.c:213 0x56127e6a7000
        process_socket() at worker.c:631 0x7ff7e2f51f8b
        worker_thread() at worker.c:990 0x7ff7e2f51f8b
        start_thread() at pthread_create.c:333 0x7ff7e9cad6ba
        clone() at clone.S:109 0x7ff7e99e341d

--------------------------------------------------------------------------------

static apr_rmm_off_t find_block_of_size(apr_rmm_t *rmm, apr_size_t size)
{
    apr_rmm_off_t next = rmm->base->firstfree;
    apr_rmm_off_t best = 0;
    apr_rmm_off_t bestsize = 0;

    while (next) {
        struct rmm_block_t *blk = (rmm_block_t*)((char*)rmm->base + next);

--------------------------------------------------------------------------------

The following code touches "rmm->base->firstfree" (capable of racing):

apr_rmm_destroy()
 - ...

apr_rmm_init()
 - ...

find_block_of_size()
 - apr_rmm_calloc()
 - apr_rmm_malloc()

move_block()
 - apr_rmm_calloc()
 - apr_rmm_free()
 - apr_rmm_malloc()

--------------------------------------------------------------------------------

All APR calls to RMM have to have: APR_ANYLOCK_LOCK(&rmm->lock) and lock can't
be apr_anylock_none ORELSE the lock has to be guaranteed by the caller, in our
case, uldap_cache_compare() using the LDAP_CACHE_LOCK() directive.

LDAP_CACHE_LOCK() directive is:

        do {
                if (st->util_ldap_cache_lock)
                apr_global_mutex_lock(st->util_ldap_cache_lock);
        } while (0);

Where st is a ldap state struct, got by ap_get_module_config() on the ldap mod.
"util_ldap_cache_lock" is the global apr mutex for this module and it is set
by:

    st->util_ldap_cache_lock = base->util_ldap_cache_lock;

in "util_ldap_merge_config" AND (re)-initialized in util_ldap_post_config if,
still after ldap merge, it does NOT contain "cache_shm" and
APR_HAS_SHARED_MEMOR
was defined (cache settings are inherited in the virtual host, all server use
the same shared memory cache).

If re-initialized by util_ldap_post_config, util_ldap_cache_lock is a mutex
created with ap_global_mutex_create using a "ldap-cache" type.

#if APR_HAS_SHARED_MEMORY
    if (!st->cache_shm) {
#endif

If APR_HAS_SHARED_MEMORY is not set, it always initializes the lock like this.

        result = ap_global_mutex_create(&st->util_ldap_cache_lock, NULL,
                         ldap_cache_mutex_type, NULL, s, p, 0);

Since the util_ldap_cache_lock->proc_mutex->fname is this:

        0x7ff7ea75af38 == "/var/lock/apache2/ldap-cache.1368"

We know for sure that the locking being used is "ldap-cache" and was set like
above since the name is originated from "lock type" string.

--------------------------------------------------------------------------------

We can say for sure that the ldap global lock: st->util_ldap_cache_lock was
created by util_ldap_post_config() function, and the locking method was
obtained
from the "mxcfg_lookup()" logic when doing the ap_global_mutex_create() with
the
type "ldap_cache_mutex_type" ("ldap-cache" string).

ldap_cache_mutex_type variable ("ldap-cache") is a global, configured or used
by
functions coming from the util_ldap_register_hooks:

util_ldap_child_init()                  - n/a
util_ldap_post_config()                 - creates util_ldap_cache_lock based on
it
util_ldap_pre_config()                  - registers the ldap_cache_mutex_type

util_ldap_cache_lock is used for all uldap_cache_XXXXX functions calling
LDAP_CACHE_LOCK() and by the following functions:

util_ldap_child_init()                  - n/a
util_ldap_merge_config()                - set to base (server) config
util_ldap_post_config()                 - created based on
ldap_cache_mutex_type

--------------------------------------------------------------------------------

Inside util_ldap_post_config() -> ap_global_mutex_create() we called the
functio
get_futex_filename(). It checks wether the mutex mech(anism) needs a backing
file or not (only if mech == APR_LOCK_FLOCK or APR_LOCK_FCNTL)

My dump had
st->util_ldap_cache_lock->proc_mutex->fname = /var/lock/apache2/ldap-cache.1368

Meaning that my lock was, for sure, APR_LOCK_FLOCK or, likely, APR_LOCK_FCNTL.

--------------------------------------------------------------------------------

>From our compilation flags:

$ apachectl -V

Server version: Apache/2.4.18 (Ubuntu)
Server built:   2018-03-01T18:29:12
Server's Module Magic Number: 20120211:52
Server loaded:  APR 1.5.2, APR-UTIL 1.5.4
Compiled using: APR 1.5.2, APR-UTIL 1.5.4
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

--------------------------------------------------------------------------------

Since we've registered this locking mechanism, ldap_cache_mutex_type, as
"APR_LOCK_DEFAULT", we have to check how the APR_LOCK_DEFAULT logic is defined
inside APR lib.

The full call path is this:

util_ldap_post_config() >>> ap_global_mutex_create() >>>
 apr_global_mutex_create() >>> apr_proc_mutex_create() >>> proc_mutex_create()
 >>> proc_mutex_choose_method()

 And the "mech" used in util_ldap_post_config(), to decide type of mutex, comes
 from the ap_global_mutex_create() function, doing the mxcfg_lookup() with
 the hash type "ldap-cache" (as I showed before, APR_LOCK_DEFAULT).

 Then the proc_mutex_choose_method() function takes its decision on what
locking
 type to use:

    switch (mech) {
    ...
        case APR_LOCK_DEFAULT:
        #if APR_USE_FLOCK_SERIALIZE
                new_mutex->inter_meth = &mutex_flock_methods;
        #elif APR_USE_SYSVSEM_SERIALIZE
                new_mutex->inter_meth = &mutex_sysv_methods;
        #elif APR_USE_FCNTL_SERIALIZE
                new_mutex->inter_meth = &mutex_fcntl_methods;
        #elif APR_USE_PROC_PTHREAD_SERIALIZE
                new_mutex->inter_meth = &mutex_proc_pthread_methods;
        #elif APR_USE_POSIXSEM_SERIALIZE
                new_mutex->inter_meth = &mutex_posixsem_methods;
        #else
                return APR_ENOTIMPL;
        #endif

Meaning that default is set compile time and, in our case, it was FCNTL based,
based on the logic above AND the dump file containing the:

        st->util_ldap_cache_lock->proc_mutex->inter_meth->*_mutex_fcntl_*
pointers

This decision path would have to have APR_USE_FCNTL_SERIALIZE set and this does
not happen for me, since the decision comes from the autoconf script:

# See which lock mechanism we'll select by default on this system.
# The last APR_DECIDE to execute sets the default.
# At this stage, we match the ordering in Apache 1.3
# which is (highest to lowest): sysvsem -> fcntl -> flock.
# POSIX semaphores and cross-process pthread mutexes are not
# used by default since they have less desirable behaviour when
# e.g. a process holding the mutex segfaults.
# The BEOSSEM decision doesn't require any substitutions but is
# included here to prevent the fcntl() branch being selected
# from the decision making.

>From configure.in:

case $ac_decision in
    USE_FLOCK_SERIALIZE )
        flockser="1"
        ;;
    USE_FCNTL_SERIALIZE )
        fcntlser="1"
        ;;
    USE_SYSVSEM_SERIALIZE )
        sysvser="1"
        ;;
    USE_POSIXSEM_SERIALIZE )
        posixser="1"
        ;;
    USE_PROC_PTHREAD_SERIALIZE )
        procpthreadser="1"
        ;;
    USE_BEOSSEM )
        beossem="1"
        ;;
esac

And those variables are the ones defining the headers:

$ grep -ri APR_USE_FCNTL_SERIALIZE *
...
debian/build/include/apr.h:#define APR_USE_FCNTL_SERIALIZE           0
include/apr.h.in:#define APR_USE_FCNTL_SERIALIZE           @fcntlser@
...

This is what would make proc_mutex_choose_method() to decide to use fcntl
locking mechanism and you can see that, in my case, fakeroot debian/rules
was build-arch did NOT set that variable. In the configure output, you can see:

decision on apr_lock implementation method... SysV IPC semget()

Which means that the APR_DECIDE was sysvser, like described, as we can see:

$ grep -ri APR_USE_SYSVSEM_SERIALIZE *  | grep -v html
debian/build/include/apr.h:#define APR_USE_SYSVSEM_SERIALIZE         1
include/apr.h.in:#define APR_USE_SYSVSEM_SERIALIZE         @sysvser@

in the build dir.

--------------------------------------------------------------------------------

So, how was fcntl choosen as locking mech(anism) if APR_USE_FCNTL_SERIALIZE
is not set ? The only possibility I see is that in the beginning of
proc_mutex_choose_method, mech == APR_LOCK_FCNTL was passed as an argument,
and NOT APR_LOCK_DEFAULT, like it should have been in the "ldap-cache" type of
mutex, when being created. OR for the APR_LOCK_DEFAULT to be something else.

With that, I went to check what defined it and found apache's function:
ap_set_mutex(), called by apache "init core commands" logic. This function will
set the mechanism for APR_LOCK_DEFAULT depending on the configuration file
stanza (derd!) and with that I went for it and found:

Mutex file:${APACHE_LOCK_DIR} default

--------------------------------------------------------------------------------

SUMMARY:

Now it makes more sense. So the "bug" here now is to find out why fcntl locking
did not guarantee atomicity for the LDAP operations, or even, if it should have
done (based on its premises). 

Another question that raises is, why to use fcntl as a backing mechanism for
the
LDAP locking ? If the lock was supposed to be guaranteed among different nodes,
then backing the lock in a file, IF the name was based also on the instance id,
which in this case is not, it could make sense. But in a shared-threaded only
environment, why ?

TODOs:

(1)

I would like end user to test the LDAP caching operations/locking without fcntl
based locking. According to:

https://httpd.apache.org/docs/2.4/mod/core.html#mutex

It is possible to set a different locking mechanism to each of the subsystems
described in table "Mutex name, Module(s), Protected Resource". This, per se,
will likely guarantee the atomicity of the LDAP cache operations.

(2)

To investigate if there is some catch in regarding to using fcntl() in an
intensive I/O environment for backing up mutexes implementation. Is there any
fix related to fcntl() that should be picked OR any type of I/O configuration
(including page cache, filesystem mounting options, need for direct-io) in
order
for it to be used ?

Eric,

A colleague of mine, or I, will get back to you with some answers on all this
investigation. Tks a lot!

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: bugs-help@xxxxxxxxxxxxxxxx