I bought a “new” computer this week — motherboard, processor, memory, and video card — because I have been experiencing memory pressure on some of my work loads lately and I really didn’t want to go over 16 GB of non-ECC RAM. My old processor and chipset (Intel Core i5-2300 on H67) didn’t support ECC RAM, so I needed to upgrade the whole shebang. I wound up selecting an Intel Xeon e5-2620 v4 8-core processor, 32 GB of Crucial RAM, and an ASRock Fatal1ty X99X Killer (I know, I know) motherboard, with an MSI GeForce GT 710 video adapter. The processor, motherboard, and RAM were purchased for performance and power consumption, the video adapter was purchased based primarily on power consumption; it seems very difficult to buy a truly low-power desktop card these days, and my GPU usage is quite minimal. Unfortunately, this system was not as plug-and-play as I had hoped, and this is the story of the problem and how I’ve (hopefully) fixed it.
Things seemed to start off well, with the machine performing as expected (each core seems to perform similarly to, or maybe about 5% faster than, each core in my i5-2300 for loads I care about, but there are twice as many cores plus twice as many again hyperthreading contexts). The first thing I did upon receipt was update the BIOS (as I had read that some X99 boards required update for Broadwell CPUs, although it did boot to BIOS and update itself without any trouble — props to ASRock for that, by the way, network update from BIOS was convenient), and before swapping the boards I had installed a 4.8.0 kernel from Debian Jessie backports. Initial power up was no trouble at all.
Unfortunately, after several hours one of the cores locked up, issuing the error NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s!, and Chrome became unresponsive. A short time after this other CPUs started reporting soft lockups, and the dumps looked something like this (included so that searching might find this if it's relevant):
Jan 12 18:23:20 colt kernel: [ 6484.590710] CPU: 11 PID: 31903 Comm: chrome Tainted: G E 4.8.0-0.bpo.2-amd64 #1 Debian 4.8.11-1~bpo8+1 Jan 12 18:23:20 colt kernel: [ 6484.590711] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99X Killer, BIOS P3.10 07/01/2016 Jan 12 18:23:20 colt kernel: [ 6484.590712] task: ffff920ca14f3000 task.stack: ffff920c50e40000 Jan 12 18:23:20 colt kernel: [ 6484.590713] RIP: 0010:[
] [ ] smp_call_function_many+0x1f1/0x250 Jan 12 18:23:20 colt kernel: [ 6484.590719] RSP: 0018:ffff920c50e43c58 EFLAGS: 00000202 Jan 12 18:23:20 colt kernel: [ 6484.590720] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000000 Jan 12 18:23:20 colt kernel: [ 6484.590721] RDX: ffffda387fa03020 RSI: 0000000000000200 RDI: ffff920cff4d9688 Jan 12 18:23:20 colt kernel: [ 6484.590721] RBP: ffff920cff4d9688 R08: ffffffffffffffff R09: 0000000000000000 Jan 12 18:23:20 colt kernel: [ 6484.590722] R10: 0000000000000008 R11: 0000000000000246 R12: ffff920cff4d9680 Jan 12 18:23:20 colt kernel: [ 6484.590723] R13: ffffffffa4e6d990 R14: 0000000000000000 R15: 0000000000000001 Jan 12 18:23:20 colt kernel: [ 6484.590724] FS: 00007f617a98da00(0000) GS:ffff920cff4c0000(0000) knlGS:0000000000000000 Jan 12 18:23:20 colt kernel: [ 6484.590725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 12 18:23:20 colt kernel: [ 6484.590725] CR2: 0000188f77c3b000 CR3: 00000007d9b00000 CR4: 00000000003426e0 Jan 12 18:23:20 colt kernel: [ 6484.590727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 12 18:23:20 colt kernel: [ 6484.590727] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jan 12 18:23:20 colt kernel: [ 6484.590728] Stack: Jan 12 18:23:20 colt kernel: [ 6484.590729] 0000000000019640 0000000100000000 ffff920cc132e700 ffffffffa4e6d990 Jan 12 18:23:20 colt kernel: [ 6484.590731] 0000000000000000 ffff920c50e43d50 ffff920c50e43d58 ffff920c50e43cb0 Jan 12 18:23:20 colt kernel: [ 6484.590733] ffffffffa4f004b8 ffff920cc132e700 ffff920c50e43d00 ffff920cff5d4828 Jan 12 18:23:20 colt kernel: [ 6484.590735] Call Trace: Jan 12 18:23:20 colt kernel: [ 6484.590739] [ ] ? leave_mm+0xc0/0xc0 Jan 12 18:23:20 colt kernel: [ 6484.590741] [ ] ? on_each_cpu+0x28/0x50 Jan 12 18:23:20 colt kernel: [ 6484.590743] [ ] ? flush_tlb_kernel_range+0x48/0x90 Jan 12 18:23:20 colt kernel: [ 6484.590746] [ ] ? __purge_vmap_area_lazy+0xba/0x2f0 Jan 12 18:23:20 colt kernel: [ 6484.590748] [ ] ? vm_unmap_aliases+0x118/0x140 Jan 12 18:23:20 colt kernel: [ 6484.590750] [ ] ? change_page_attr_set_clr+0xef/0x450 Jan 12 18:23:20 colt kernel: [ 6484.590752] [ ] ? set_memory_ro+0x2d/0x40 Jan 12 18:23:20 colt kernel: [ 6484.590755] [ ] ? bpf_prog_select_runtime+0x25/0xc0 Jan 12 18:23:20 colt kernel: [ 6484.590758] [ ] ? bpf_prepare_filter+0x2ef/0x3f0 Jan 12 18:23:20 colt kernel: [ 6484.590761] [ ] ? kmemdup+0x32/0x40 Jan 12 18:23:20 colt kernel: [ 6484.590762] [ ] ? bpf_prog_create_from_user+0xd0/0x120 Jan 12 18:23:20 colt kernel: [ 6484.590767] [ ] ? proc_watchdog_cpumask+0xd0/0xd0 Jan 12 18:23:20 colt kernel: [ 6484.590768] [ ] ? do_seccomp+0x109/0x6b0 Jan 12 18:23:20 colt kernel: [ 6484.590772] [ ] ? system_call_fast_compare_end+0xc/0x96
I showed some of the errors to Rik van Riel who thought that the dumps, in totality, looked like a possible CPU idle management bug and suggested booting with the kernel option intel_idle.max_cstate=0 to use the ACPI idle management instead of the Linux native Intel chip idle management code. Long story short, this did succeed in changing the idle management, but didn't fix the problem. I backed this change out, because it prevented the CPU from descending to the deep sleep I like seeing for power savings.
We had also discussed the possibility that this was a graphics-related bug, and I disabled accelerated rendering in Chrome. However, it didn't seem to have an effect, and soon I wound up with the same soft lockup problems. At some point Rik asked if I might have a SATA 6 Gbps drive on a 3 Gbps cable, which reminded me that I had moved some drives from a 3 Gbps controller to a 6 Gbps controller in the upgrade, so I forced those to the slower SATA 2 speeds with another kernel option — libata.force=3:3.0,4:3.0. This option causes the kernel to force ata3 and ata4 (the two drives in question) to a 3 Gbps connection. Sadly, this wasn't the problem either, and the soft lockups persisted. It was probably a good idea until I check their cables, however (and won't matter much as they're spinning platter drives), so I left it in.
I had been becoming increasingly suspicious of the video card through all of this. I should mention that one of the first things I did was run memtest86+ for a couple of hours to try to rule out general processor/memory problems, and it ran without trouble. That pretty much left the video card or the software (or the video card software, as it turned out), in my mind, as the likely culprits. I do have another, more powerful GPU in a static bag in the basement (that I removed from another machine years ago to save the power draw), and I considered putting it in. However, before I did that, I decided that I should try the software solution of replacing the open source Nouveau drivers for NVIDIA video cards with the closed source, proprietary NVIDIA drivers.
I am sad to say that this appears to have fixed the problem. Only time will tell (as some of the lockups took several hours to come to fruition), but the machine has been running without error for almost 17 hours. This included an hour or so of running as close to 100% CPU utilization as I could get (16 threads of melt, 16 jobs in a compile, rendering four simultaneous 1080p29.97fps H.264 videos), during which temperatures remained very acceptable and no glitches presented.
So ... if you're seeing soft lockups using Nouveau on a 4.8.0 kernel (or some similar combination thereof), you may wish to consider either an alternate video adapter or the proprietary NVIDIA drivers. I'll be reporting this, so hopefully it is fixed soon. As an added bonus, the NVIDIA drivers cause the video card to run about 10% cooler (about 50 °C versus about 55 °C), which I have to assume means it's drawing less power. Either way, it's too power hungry!