Hi folks. I've got a new install FC4 box which keeps crashing. I've replaced the memory because it looked like a mem fault, but it's still happening, and has suddenly got worse. Looking in /var/log/messages shows entries like below. I've done a yum update this morning but it hasn't made any difference.
Anyone got any ideas what's wrong?:
Oct 31 11:53:51 eddie kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000024 Oct 31 11:53:51 eddie kernel: printing eip: Oct 31 11:53:51 eddie kernel: c012daf9 Oct 31 11:53:51 eddie kernel: *pde = 00000000 Oct 31 11:53:51 eddie kernel: Oops: 0000 [#1] Oct 31 11:53:51 eddie kernel: Modules linked in: Oct 31 11:53:51 eddie kernel: CPU: 0 Oct 31 11:53:51 eddie kernel: EIP: 0060:[<c012daf9>] Not tainted VLI Oct 31 11:53:51 eddie kernel: EFLAGS: 00010246 (2.6.13-1.1532_FC4) Oct 31 11:53:51 eddie kernel: EIP is at do_exit+0x625/0x942 Oct 31 11:53:51 eddie kernel: eax: 00000000 ebx: c16d4550 ecx: df7e6c80 edx: 00000000 Oct 31 11:53:51 eddie kernel: esi: df7ec6c0 edi: 00000000 ebp: 00000001 esp: c16e6f54 Oct 31 11:53:51 eddie kernel: ds: 007b es: 007b ss: 0068 Oct 31 11:53:51 eddie kernel: Process udev (pid: 28, threadinfo=c16e6000 task=c16d4550) Oct 31 11:53:52 eddie kernel: Stack: 0000038d 00000000 bfe98750 bfe986b4 00000003 c16e6000 c01b1e69 0000038d Oct 31 11:53:52 eddie kernel: 0000ea00 ffffffea c1686500 0000ea00 c16e6000 c012df6d 00001000 00000000 Oct 31 11:53:52 eddie kernel: c16e6fbc c0414a2c 00000004 0000000e 0000000b 080d8078 ffffffea ffffffea Oct 31 11:53:52 eddie kernel: Call Trace: Oct 31 11:53:52 eddie kernel: [<c01b1e69>] sys_stat64+0x23/0x28 Oct 31 11:53:52 eddie kernel: [<c012df6d>] do_group_exit+0x12b/0x349 Oct 31 11:53:52 eddie kernel: [<c0104465>] syscall_call+0x7/0xb Oct 31 11:53:52 eddie kernel: Code: 89 d8 e8 7c 74 10 00 85 ed 74 1a 8b 83 6c 04 00 00 8b b8 98 00 00 00 85 ff 74 0a b8 01 00 00 00 e8 6a ea 17 00 8b 43 04 8b 40 04 <8b> 40 24 85 c0 74 0b ff 88 00 01 00 00 83 38 02 74 6e 8b 83 80 Oct 31 11:53:52 eddie kernel: <1>Fixing recursive fault but reboot is needed!
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
On Monday 31 October 2005 1:40 pm, Tim wrote:
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
Thanks for this Tim. I've done as you suggested, and I did get some errors, which I'm looking into. I've got some more memory to try, but I've already changed it once.
I've also looked into lm_sensors, and if I understand the output below, I've got a power supply problem - which may be the cause of the memory faults.
It also looks like I've got a CPU that's running at below freezing.
Anyone got any comments on this output. Anything I need to sort out, or have missed?
it8712-isa-0290 Adapter: ISA adapter VCore 1: +1.52 V (min = +1.42 V, max = +1.57 V) VCore 2: +2.51 V (min = +2.40 V, max = +2.61 V) +3.3V: +6.66 V (min = +3.14 V, max = +3.46 V) ALARM +5V: +2.66 V (min = +4.76 V, max = +5.24 V) ALARM +12V: +12.03 V (min = +11.39 V, max = +12.61 V) -12V: -12.63 V (min = -12.63 V, max = -11.41 V) ALARM -5V: -7.13 V (min = -5.26 V, max = -4.77 V) ALARM Stdby: +2.85 V (min = +4.76 V, max = +5.24 V) ALARM VBat: +4.08 V fan1: 0 RPM (min = 0 RPM, div = 8) fan2: 0 RPM (min = 664 RPM, div = 8) fan3: 0 RPM (min = 664 RPM, div = 8) ALARM M/B Temp: +25°C (low = +15°C, high = +40°C) sensor = thermistor
CPU Temp: -55°C (low = +15°C, high = +45°C) sensor = thermistor
Temp3: +80°C (low = +15°C, high = +45°C) sensor = diode
eeprom-i2c-0-51 Adapter: SMBus I801 adapter at 5000 [gary@gary gary]$ more sensors.txt it8712-isa-0290 Adapter: ISA adapter VCore 1: +1.52 V (min = +1.42 V, max = +1.57 V) VCore 2: +2.51 V (min = +2.40 V, max = +2.61 V) +3.3V: +6.66 V (min = +3.14 V, max = +3.46 V) ALARM +5V: +2.66 V (min = +4.76 V, max = +5.24 V) ALARM +12V: +12.03 V (min = +11.39 V, max = +12.61 V) -12V: -12.63 V (min = -12.63 V, max = -11.41 V) ALARM -5V: -7.13 V (min = -5.26 V, max = -4.77 V) ALARM Stdby: +2.85 V (min = +4.76 V, max = +5.24 V) ALARM VBat: +4.08 V fan1: 0 RPM (min = 0 RPM, div = 8) fan2: 0 RPM (min = 664 RPM, div = 8) fan3: 0 RPM (min = 664 RPM, div = 8) ALARM M/B Temp: +25°C (low = +15°C, high = +40°C) sensor = thermistor
CPU Temp: -55°C (low = +15°C, high = +45°C) sensor = thermistor
Temp3: +80°C (low = +15°C, high = +45°C) sensor = diode
eeprom-i2c-0-51 Adapter: SMBus I801 adapter at 5000 Memory type: DDR SDRAM DIMM Memory size (MB): 512
eeprom-i2c-0-50 Adapter: SMBus I801 adapter at 5000 Memory type: DDR SDRAM DIMM Memory size (MB): 512
Gary Stainburn wrote:
On Monday 31 October 2005 1:40 pm, Tim wrote:
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
Thanks for this Tim. I've done as you suggested, and I did get some errors, which I'm looking into. I've got some more memory to try, but I've already changed it once.
I've also looked into lm_sensors, and if I understand the output below, I've got a power supply problem - which may be the cause of the memory faults.
It also looks like I've got a CPU that's running at below freezing.
Anyone got any comments on this output. Anything I need to sort out, or have missed?
it8712-isa-0290 Adapter: ISA adapter VCore 1: +1.52 V (min = +1.42 V, max = +1.57 V) VCore 2: +2.51 V (min = +2.40 V, max = +2.61 V) +3.3V: +6.66 V (min = +3.14 V, max = +3.46 V) ALARM +5V: +2.66 V (min = +4.76 V, max = +5.24 V) ALARM +12V: +12.03 V (min = +11.39 V, max = +12.61 V) -12V: -12.63 V (min = -12.63 V, max = -11.41 V) ALARM -5V: -7.13 V (min = -5.26 V, max = -4.77 V) ALARM Stdby: +2.85 V (min = +4.76 V, max = +5.24 V) ALARM VBat: +4.08 V fan1: 0 RPM (min = 0 RPM, div = 8) fan2: 0 RPM (min = 664 RPM, div = 8) fan3: 0 RPM (min = 664 RPM, div = 8) ALARM M/B Temp: +25°C (low = +15°C, high = +40°C) sensor = thermistor
I'd get a DVM and/or oscilloscope on the power supply lines going to the MB- just to verify that the sensors are a little bit messed up...
CPU Temp: -55°C (low = +15°C, high = +45°C) sensor = thermistor
Great! You've somehow figured out how to reverse normal thermal flow! Maybe the CPU is plugged in backwards... ;) (snip)
Seriously, I'd double-check the voltages in there, don't believe what the sensors are telling you.. like has been said elsewhere, check the BIOS settings as well.
On Mon, 2005-10-31 at 18:03 +0000, Gary Stainburn wrote:
I've also looked into lm_sensors, and if I understand the output below, I've got a power supply problem - which may be the cause of the memory faults.
It also looks like I've got a CPU that's running at below freezing.
If your BIOS lets you see the outputs from such sensors, have a look at it too. In my case, lm_sensors came up with some wildly stupid results, and the documentation almost suggests that it's next to useless.
You could try plugging in another PSU to see if that makes any difference. Just borrow a mate's, unscrew and unplug the box, and connect it to yours.
Gary Stainburn kirjoitti viestissään (lähetysaika maanantai, 31. lokakuuta 2005 20:03):
I've also looked into lm_sensors, and if I understand the output below, I've got a power supply problem - which may be the cause of the memory faults.
No, you have a lm_sensors configuration problem.
It also looks like I've got a CPU that's running at below freezing.
Anyone got any comments on this output. Anything I need to sort out, or have missed?
You must edit /etc/sensors.conf to set up the program for your motherboard.
On Monday 31 October 2005 1:40 pm, Tim wrote:
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
I've run memtest a number of times now. I've got 2 slots and 5 memory sticks. I've tried all sorts of variants with different sticks in different slots, having a stick in just slot 1, having a stick in just slot 2, having both slots full etc.
Every time I run memtest86+ I get errors on test 5. The location and number of the errors change as I change configs, but I never get a clean test. I also never get errors on any othe other tests, just test 5.
What's the chance that the problem's something other than memory.
Any ideas what I can try next?
Gary Stainburn wrote:
Every time I run memtest86+ I get errors on test 5. The location and number of the errors change as I change configs, but I never get a clean test. I also never get errors on any othe other tests, just test 5.
What's the chance that the problem's something other than memory.
If it was CPU temperature, you can expect problems turning up all over the place and crashes in memtest86. Therefore it sounds like it is the motherboard chipset, the memory, or a BIOS setting controlling the behaviour of the chipset or the memory. Examine your BIOS carefully for settings relating to memory voltage, drive strength on IOs to the memory, and memory cycle timing, and runs some trials on memtest86 with different settings. Usually higher voltage and slower timings will get better results (but with higher voltage comes more heat... be cautious if you have this setting).
-Andy
On Mon, 2005-10-31 at 18:48 +0000, Gary Stainburn wrote:
On Monday 31 October 2005 1:40 pm, Tim wrote:
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
I've run memtest a number of times now. I've got 2 slots and 5 memory sticks. I've tried all sorts of variants with different sticks in different slots, having a stick in just slot 1, having a stick in just slot 2, having both slots full etc.
Every time I run memtest86+ I get errors on test 5. The location and number of the errors change as I change configs, but I never get a clean test. I also never get errors on any othe other tests, just test 5.
What's the chance that the problem's something other than memory.
Memory problems can be memory dimm, memory controller, or power.
Since you have repeatedly tried different dimms, with changing location results on test 5, I would assume this is likely a power supply problem. The other choice is the motherboard.
Try one at a time and you will eventually find the culprit.
Any ideas what I can try next?
Gary Stainburn
This email does not contain private or confidential material as it may be snooped on by interested government parties for unknown and undisclosed purposes - Regulation of Investigatory Powers Act, 2000
On Monday 31 October 2005 13:48, Gary Stainburn wrote:
On Monday 31 October 2005 1:40 pm, Tim wrote:
On Mon, 2005-10-31 at 12:00 +0000, Gary Stainburn wrote:
I've replaced the memory because it looked like a mem fault
If you believe that to be the case, run memtest86 (installable from extras). It'll pick up memory and related problems (e.g. motherboard might have other issues).
I've run memtest a number of times now. I've got 2 slots and 5 memory sticks. I've tried all sorts of variants with different sticks in different slots, having a stick in just slot 1, having a stick in just slot 2, having both slots full etc.
Every time I run memtest86+ I get errors on test 5. The location and number of the errors change as I change configs, but I never get a clean test. I also never get errors on any othe other tests, just test 5.
What's the chance that the problem's something other than memory.
Any ideas what I can try next?
Purchance are you running an Athlon XP-2800 with a 400mhz FSB in the bios settings? Try it at 333mhz for the FSB setting, the 2800 is not rated for 400mhz FSB use. There may be similar limitations for otehr cpu's also.
-- Gary Stainburn
This email does not contain private or confidential material as it may be snooped on by interested government parties for unknown and undisclosed purposes - Regulation of Investigatory Powers Act, 2000
On Tuesday 01 November 2005 4:09 am, Gene Heskett wrote:
Purchance are you running an Athlon XP-2800 with a 400mhz FSB in the bios settings? Try it at 333mhz for the FSB setting, the 2800 is not rated for 400mhz FSB use. There may be similar limitations for otehr cpu's also.
No, I'm running Intel, but I have noticed that the supplier put DDR400 into a DDR333 board. I've ordered some DDR333 to see if that fixes it.
In the meantime, I'm going to run a tester over the power supply.
On Tuesday 01 November 2005 11:40 am, Gary Stainburn wrote:
On Tuesday 01 November 2005 4:09 am, Gene Heskett wrote:
Purchance are you running an Athlon XP-2800 with a 400mhz FSB in the bios settings? Try it at 333mhz for the FSB setting, the 2800 is not rated for 400mhz FSB use. There may be similar limitations for otehr cpu's also.
No, I'm running Intel, but I have noticed that the supplier put DDR400 into a DDR333 board. I've ordered some DDR333 to see if that fixes it.
In the meantime, I'm going to run a tester over the power supply.
I tested the PSU and all looks fine - erronious report from lm_sensors I replaced the memory with the DR333 and the memory errors vanished. Funny how it's been fine for 2 years though. The only thing I can think of is that I updated the kernel a few months back.
On Tue, 2005-11-01 at 14:10 +0000, Gary Stainburn wrote:
I tested the PSU and all looks fine - erronious report from lm_sensors
I get that, too.
I replaced the memory with the DR333 and the memory errors vanished. Funny how it's been fine for 2 years though. The only thing I can think of is that I updated the kernel a few months back.
I've got some faulty DIMM SDRAM that will work, to some degree, in various PCs here (though completely rejected by others).
Most of the time the PC will work fine, but occasionally it'll have some unexplained error. You think it's just a programming fault, but once you swap the RAM, the box becomes 99.99999% reliable. All I can guess is that the fault exists in some part of the RAM that's either not used often, or is usually gobbled up by something that does nothing most of the time.
The only time a PC is 100% reliable is when it's unpowered. You can tell, with absolutely precision, what it will do.
On Mon, 2005-10-31 at 18:48 +0000, Gary Stainburn wrote:
I've run memtest a number of times now. I've got 2 slots and 5 memory sticks. I've tried all sorts of variants with different sticks in different slots, having a stick in just slot 1, having a stick in just slot 2, having both slots full etc.
Every time I run memtest86+ I get errors on test 5. The location and number of the errors change as I change configs, but I never get a clean test. I also never get errors on any othe other tests, just test 5.
What's the chance that the problem's something other than memory.
It's possible. Your memory could be bad. ;-) It might not be compatible with your motherboard. Your motherboard might have a problem that relates to handling RAM, or the CPU.
Any ideas what I can try next?
Try your RAM in a friend's box, and memtest86 it in there. Take it to a shop, and ask them to test your RAM (though watch that they don't just put in a PC and see what the BIOS test says).
Cold boot, and skip straight to test 5. If you don't get errors, then, it might be a temperature related issue.
Be aware that if you get any errors with memtest86, for whatever reason, then your system has a problem.