|
Re: Example "local" fails on node with two IP addresses: msg#00068file-systems.lustre.user
Hi Alexey, I'm still encountering a problem even after disabling SELinux. # cat /proc/cmdline ro root=LABEL=/ splash=0 rhgb selinux=0 quiet # grep ^SELINUX /etc/selinux/config SELINUX=disabled SELINUXTYPE=targeted Below is a snippet of /var/log/messages (more complete log is attached): ========== Apr 23 12:57:06 sun-n1-console kernel: Lustre: OBD class driver Build Version: 1.4.10-19691231170000-PRISTINE-.testsuite.tmp.lbuild-boulder.lbuild-v1_4_10_RC2-2.6-rhel4-i686.lbuild.BUILD.lustre-kernel-2.6.9.lustre.linux-2.6.9-42.0.10.EL_lustre.1.4.10smp, info-KYPl3Ael/zSakBO8gow8eQ@xxxxxxxxxxxxxxxx Apr 23 12:57:07 sun-n1-console kernel: Lustre: Added LNI 129.158.130.75@tcp [8/256] Apr 23 12:57:07 sun-n1-console kernel: Lustre: Accept secure, port 988 Apr 23 12:57:12 sun-n1-console kernel: LustreError: Refusing connection from 192.168.123.45 for 192.168.123.45@tcp: No matching NI Apr 23 12:57:12 sun-n1-console kernel: LustreError: 4416:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 Apr 23 12:57:12 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? Apr 23 12:57:12 sun-n1-console kernel: Lustre: 10:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall ROUTER_NOTIFY,192.168.123.45@tcp,down,1177304206 Apr 23 12:57:17 sun-n1-console kernel: LustreError: 4854:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177304232, 5s ago) req@ef64ec00 x1/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 Apr 23 12:57:31 sun-n1-console kernel: LustreError: 5170:0:(mds_lov.c:589:mds_lov_start_synchronize()) mds1: error starting mds_lov_synchronize: -4 Apr 23 12:57:31 sun-n1-console kernel: LustreError: 5170:0:(quota_master.c:1103:mds_quota_recovery()) Cannot start quota recovery thread: rc -4 Apr 23 12:57:37 sun-n1-console kernel: LustreError: Refusing connection from 192.168.123.45 for 192.168.123.45@tcp: No matching NI Apr 23 12:57:37 sun-n1-console kernel: LustreError: 4417:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.123.45 Apr 23 12:57:37 sun-n1-console kernel: LustreError: Connection to 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? Apr 23 12:57:42 sun-n1-console kernel: LustreError: 4854:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at 1177304257, 5s ago) req@f5024a00 x3/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 ========== It looks to me that there's a confusion over which network interface to use (eth0 = 129.158.130.75, and eth1 = 192.168.123.45). I intended to deploy MDS on eth1; this is specified using IP address when creating a node: --add net --node sun-n1-console --nettype lnet --nid 192.168.123.45@tcp I've emptied /etc/resolv.conf to ensured that "sun-n1-console" is resolved to 192.168.12.45, # cat /etc/hosts 127.0.0.1 localhost.localdomain localhost 192.168.123.45 sun-n1-console 129.158.130.75 public-host # hostname -f ; hostname -i sun-n1-console 192.168.123.45 And results of ifconfig: eth0 Link encap:Ethernet HWaddr 00:07:E9:06:AC:5C inet addr:129.158.130.75 Bcast:129.158.130.255 Mask:255.255.255.0 eth1 Link encap:Ethernet HWaddr 00:07:E9:06:AC:5D inet addr:192.168.123.45 Bcast:192.168.123.255 Mask:255.255.255.0 Are there anything else that I missed? Regards, Verdi Alexey Lyashkov wrote: > looks you need selinux disable. > === > Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo > == > > > On Fri, 2007-04-20 at 14:04, Verdi March wrote: > > Hi, > > > > I'm encountering problem when starting the "local" example (one > > MSD, LOV, OST, and client, all on node "sun-n1-console"). > > > > # lmc -m test.xml --batch test.txt > > # cat test.txt > > --add node --node sun-n1-console > > --add net --node sun-n1-console --nettype lnet --nid sun-n1-console@tcp > > --add mds --node sun-n1-console --mds mds1 --fstype ldiskfs --dev > /tmp/mds1-sun-n1-console --size 400000 > > --add lov --lov lov1 --mds mds1 --stripe_sz 1048576 --stripe_cnt 1 > --stripe_pattern 0 > > --add ost --node sun-n1-console --lov lov1 --ost ost1-sun-n1-console > --fstype ldiskfs --dev /tmp/ost1-sun-n1-console --size 400000 > > --add mtpt --node sun-n1-console --path /mnt/lustre --mds mds1 --lov > lov1 > > > > > > > > The node has two ethernets, eth0 and eth1, both on separate subnets. > > I deploys all lustre components on eth1 (IP: 192.168.123.45, hostname: > > sun-n1-console). > > > > # cat /etc/hosts > > 127.0.0.1 localhost.localdomain localhost > > xxx.yyy.zzz.ab public-host > > 192.168.123.45 sun-n1-console > > > > > > When eth0 is down, I successfully deployed the "local" example. > > Only when eth0 is up that Lustre fails to start (see attachment) > > > > The error messages from /var/log/messages indicates that MDS does > > not respond (see below). I believe it's not caused by firewall cause > > I've switched it off: > > > > # iptables -L > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > > > > > > > > > And here're are the error messages: > > > > # tail /var/log/messages > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 > req@f7fe7e00 x22/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl > Rpc:/0/0 > rc 0/0 > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(client.c:947:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1177061855, 0s ago) > req@f7fe7e00 x22/t0 o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens > 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > > Apr 20 17:37:35 sun-n1-console kernel: LustreError: > 6840:0:(client.c:947:ptlrpc_expire_one_request()) Skipped 2 previous similar > messages > > Apr 20 17:38:00 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 > req@ed133e00 x23/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl > Rpc:/0/0 > rc 0/0 > > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.683:64): avc: > denied { rawip_recv } for pid=6537 comm="socknal_cd03" > saddr=192.168.123.45 src=1023 daddr=192.168.123.45 dest=988 netif=lo > scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t > tclass=netif > > Apr 20 17:38:25 sun-n1-console kernel: audit(1177061905.884:65): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:26 sun-n1-console kernel: audit(1177061906.286:66): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:27 sun-n1-console kernel: audit(1177061907.090:67): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:28 sun-n1-console kernel: audit(1177061908.698:68): avc: > denied { rawip_recv } for saddr=192.168.123.45 src=1023 > daddr=192.168.123.45 dest=988 netif=lo scontext=system_u:object_r:unlabeled_t > tcontext=system_u:object_r:netif_lo_t tclass=netif > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: > 6539:0:(acceptor.c:442:lnet_acceptor()) Error -11 reading connection request > from > 192.168.123.45 > > Apr 20 17:38:30 sun-n1-console kernel: audit(1177061910.683:69): avc: > denied { rawip_send } for pid=6539 comm="acceptor_988" > saddr=192.168.123.45 src=988 daddr=192.168.123.45 dest=1023 netif=lo > scontext=system_u:object_r:unlabeled_t tcontext=system_u:object_r:netif_lo_t > tclass=netif > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: > 6537:0:(socklnd_cb.c:2160:ksocknal_recv_hello()) Error -104 reading HELLO > from 192.168.123.45 > > Apr 20 17:38:30 sun-n1-console kernel: LustreError: Connection to > 192.168.123.45@tcp at host 192.168.123.45 on port 988 was reset: is it > running a > compatible version of Lustre and is 192.168.123.45@tcp one of its NIDs? > > Apr 20 17:38:50 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 > req@ec698e00 x25/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl > Rpc:/0/0 > rc 0/0 > > Apr 20 17:39:15 sun-n1-console kernel: LustreError: > 6840:0:(events.c:53:request_out_callback()) @@@ type 4, status -5 > req@e97c8c00 x26/t0 > o8->ost1-sun-n1-console_UUID@sun-n1-console_UUID:6 lens 240/272 ref 2 fl > Rpc:/0/0 > rc 0/0 > > > > > > > > Any advices how to make this simple example work? > > > > > > Regards, > > Verdi > -- > Alexey Lyashkov <shadow-KYPl3Ael/zSakBO8gow8eQ@xxxxxxxxxxxxxxxx> > Beaver team -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Example "local" fails on node with two IP addresses: 00068, Nathaniel Rutman |
|---|---|
| Next by Date: | please help ! error in make.: 00068, Onkar N Mahajan |
| Previous by Thread: | Re: Example "local" fails on node with two IP addressesi: 00068, Nathaniel Rutman |
| Next by Thread: | Re: Example "local" fails on node with two IP addresses: 00068, Oleg Drokin |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |