Alright, so for the last 3 days or so, my main Solaris machine has been going crazy and kernel panicing about once every day or so, which is extremely annoying because every time it panics the machine reboots (and this machine has 3 zones that are in current use, so I get 3 calls about “why did my machine reboot”). Luckily, none of our servers here are production, so I get calls from development and not angry customers. So, I’m setting out to try and figure out why the machine is panicing. Here’s what I’m getting from the logs:
From the vmcore file:
ZFS: I/O failure (write on <unknown> off 0: zio 6000620cd40 [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:1300cb9000:1000> zilog uncompressed BE contiguous birth=208621 fill=0 cksum=8eafa7df8b7cb3e:f2fd0
From the /var/adm/messages file:
Jun 5 12:01:11 lava2051 fctl: [ID 517869 kern.warning] WARNING: fp(0)::GPN_ID for D_ID=650700 failed
Jun 5 12:01:11 lava2051 fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port with D_ID=650700, PWWN=5006016841e019a7 disappeared from fabric
Jun 5 12:01:30 lava2051 scsi: [ID 243001 kern.info] /pci@1c,600000/fibre-channel@1/fp@0,0 (fcp0):
Jun 5 12:01:30 lava2051 offlining lun=0 (trace=0), target=650700 (trace=2800004)
Jun 5 12:06:28 lava2051 unix: [ID 836849 kern.notice]
Jun 5 12:06:28 lava2051 ^Mpanic[cpu2]/thread=2a101061cc0:
Jun 5 12:06:28 lava2051 unix: [ID 809409 kern.notice] ZFS: I/O failure (write on <unknown> off 0: zio 6000620cd40 [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:1300cb9000:1000> zilog uncompressed BE contiguous birth=208621 fill=0 cksum=8eafa7df8b7cb3e:f2fd0a04af0e949e:1a:f3): error 5)
... some stuff ...
Jun 5 12:09:55 lava2051 savecore: [ID 570001 auth.error] reboot after panic: ZFS: I/O failure (write on <unknown> of
f 0: zio 6000620cd40 [L0 ZIL intent log] 1000L/1000P DVA[0]=<0:1300cb9000:1000> zilog uncompressed BE contiguous birth=208621 fill=0 cksum=8eafa7df8b7cb3e:f2fd0
Jun 5 12:09:55 lava2051 savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/lava2051/*.1
Repeat x 3 so far. Like I said, extremely annoying.
Here’s what I think the problem is so far: I have a 500g ZFS pool built on a single Clariion LUN that is exported to this machine. From the looks of it the machine is having trouble seeing the LUN all the time, when it disappears ZFS freaks out and panics because of a I/O failure. Now that I know what the problem is, I have no idea how to make the LUN stop disappearing. Guess I’m off to check some Clariion logs and see where that gets me. Anyone out there have any other suggestions on how I could go about fixing this problem? I have little experience in working with core dumps. I would be extremely grateful
P.S. Yes, I know I should have mirrored the ZFS pool on 2 or more devices in case of a problem like this. This is more my “proof-of-concept” machine where I try out new things and see how developers/QA react to them.
UPDATE:
It looks like the problem was a problem on the Clariion side, for the meantime, we exported a LUN from a different clariion, did a zfs attach, waited for the data to be mirrored and then detached the old one. Fixed! <3 ZFS
UPDATE 2:
Now the data is mirrored to a different Clariion. fun fun. Interestingly enough, EMC doesn’t officially support ZFS on Clariion, only on Symmetrix.