Fri, Oct 09, 2015 at 04:02:48PM CEST, ctcard(a)hotmail.com wrote:
>>>> Wed, Oct 07, 2015 at 10:15:35AM CEST, ctcard(a)hotmail.com wrote:
>>>>>Hi,
>>>>>
>>>>>We are seeing occasionally seeing issues with the teaming daemon not
starting after a reboot on centos 7 VMs. Here is an example from last night (from
/var/log/messages.minor):
>>>>>
>>>>> <snip>
>>>>>Is this a known issue?
>>>>
>>>> Not to me.
>>>>
>>>> Need more info, debug messages.
>>> Thanks, I'll try and reproduce with teamd running with debug output
>>I've tried to reproduce with teamd running with -g, but so far it hasn't
happened again.
>>My guess is that it is a timing issue/race condition, so making teamd run with
debug out may even slow it down enough to stop it happening.
>>Can you give me any idea what teamd might be racing against?
>
> No clue, this is quite odd. Try to compile teamd yourself and find out
> where exactly the error happens.
I just had this occur again, this time with teamd running with debug output, but there was
no useful extra information that I could see.
I did look at the source code for teamd, and the error occurs when this function fails:
int get_ifinfo_list(struct team_handle *th)
{
struct nl_cb *cb;
struct nl_cb *orig_cb;
struct rtgenmsg rt_hdr = {
.rtgen_family = AF_UNSPEC,
};
int ret;
ret = nl_send_simple(th->nl_cli.sock, RTM_GETLINK, NLM_F_DUMP,
&rt_hdr, sizeof(rt_hdr));
if (ret < 0)
return -nl2syserr(ret);
orig_cb = nl_socket_get_cb(th->nl_cli.sock);
cb = nl_cb_clone(orig_cb);
nl_cb_put(orig_cb);
if (!cb)
return -ENOMEM;
nl_cb_set(cb, NL_CB_VALID, NL_CB_CUSTOM, valid_handler, th);
ret = nl_recvmsgs(th->nl_cli.sock, cb);
nl_cb_put(cb);
if (ret < 0)
return -nl2syserr(ret);
return check_call_change_handlers(th, TEAM_IFINFO_CHANGE);
}
but unfortunately I can't see that the error code is ever printed out anywhere, to
narrow down the point of failure.
I guess I'll have to follow your suggestion, and compile a version that prints out
more info on failure.
Sure, that's why I asked :)