To maximise available space on my benches for actual electronics work, my lab bench PC is a tiny Gigabyte BRIX BXBT-1900 which I have mounted on the wall behind the bench. The monitor is then a 24" flat panel display mounted on an arm to keep it off the bench. I'm running the latest Linux Mint 18 with the Cinnamon display manager. From an efficient utilization of space point of view this works really well and I can move the monitor around on its arm so the display is flat to the wall or off to the middle of the benches pointing either towards my testing bench or my soldering / rework area bench. There is just one small problem..... it crashes..... daily..... sometimes several times a day.....
I'd lived with this for a while as most of the usage was for displaying data sheets, schematics or layouts whilst working at the bench and the timing of the lockups wasn't a regular issue. For anything serious like developing code or working with EAGLE I always use one of the Macs, so it didn't interfere with anything significant. However, recently I have installed an NGINX web server, MySQL, and PHP on this machine as it's one of the few "always on" machines and this is going to form part of my component management system (a possible topic for a future blog post) so needs to be reliable. Eventually once I have rebuilt my ESXi system this functionality will be on a virtual machine but for now this is where is must reside and so I need to fix this issue.
2. Investigation / Research
The problem with an issue like this is that as the crashes happen relatively infrequently, i.e. hours or days vs minutes or seconds, it can take a quite a while to find out if a change has had an effect and sometimes a change can appear to work but then you find you are victim to random variation and it just so happens to have been stable a little longer than previously.
I first ensured I was fully up to date with all updates, including some kernel updates which I was hopeful might resolve the issue. Unfortunately after leaving it overnight it had locked up again.....
I wondered if it were related to the Cinnamon desktop which taxes the graphics system a reasonable amount as it's one of the "prettier" desktops available on Linux (IMHO) and as such uses more features of the graphics driver so I installed Xfce and reverted to using this as my desktop of choice for a little while. Unfortunately this didn't resolve the issue either.
I did some Googling to see if anybody else had this issue, maybe it was my BRIX and I was fighting a hardware issue? I did see reports of one or two people having issues with the external PSU and this leading to crashes, I was about to rig up a temporary "clean" PSU to test this out when I stumbled on a forum post (which I can no longer find so can't link to) in which several people were complaining that their BRIX machines were rock solid reliable under Windows but locked up daily with Ubuntu. Now at this point I started to worry, I really didn't want to have to install Windows to get this to be reliable.....
A little more reading and a few more searches later and I discovered a potential culprit, CPU C-States! I had a flashback to a few years back when I was setting up a server rack for running VM's at work and both of the HP servers I was using were randomly freezing. The solution back then was to disable C-States in the BIOS of the servers! How had I forgotten all about this?!
But what are C-States and what are they used for? In order to save power, modern CPU's have various modes they can enter to reduce power consumption when the CPU is idle. Basically different states turn off clocking to various parts of the CPU which in turn reduces power consumption. In some circumstances this can go wrong and can cause a system to lock up. For more information on C-States and the various levels see http://www.hardwaresecrets.com/everything-you-need-to-know-about-the-cpu-c-states-power-saving-modes/
3. The solution
DISCLAIMER: FOLLOW THIS GUIDE AT YOUR OWN RISK, I TAKE NO RESPONSIBILITY FOR ANY ISSUES WHICH MAY OCCUR IN YOUR SYSTEMS AS A RESULT OF TRYING ANYTHING FROM THIS GUIDE. ENSURE YOU HAVE IMPORTANT DATA BACKED UP AND TAKE A COPY OF YOUR GRUB CONFIG FILE BEFORE EDITING.
Unlike in the HP servers I was using when I had previously seen this issue, the BIOS in the BRIX doesn't have any options for disabling C-States. This means it's down to the OS kernel to set the CPU up correctly so that it works reliably. In this case, using Linux, the solution is quite simple, a single additional parameter needs to be passed via the GRUB boot loader.
This can be achieved as follows:
- Open the GRUB config file in a text editor.
rachael$ nano /etc/default/grub
- Edit the GRUB_CMDLINE_LINUX_DEFAULT to include intel_idle.max_cstate=1 as shown below. Your initial line may look different if you are on a different Linux distribution, just add the setting in with whatever is already there.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1"
- Tell the system to update GRUB with the updated configuration.
rachael$ sudo update-grub
- Now reboot your system.
rachael$ sudo reboot
- Once your system is back up and running you should be able to see if the setting has taken effect by looking at the output of dmesg as follows:
rachael$ dmesg | grep intel_idle [ 0.000000] Command line: BOOT_IMAGE=<PathToBootImage> root=UUID=<UUIDOfDisk> ro quiet splash intel_idle.max_cstate=1 vt.handoff=7 [ 0.000000] Kernel command line: BOOT_IMAGE=<PathToBootImage> root=UUID=<UUIDOfDisk> ro quiet splash intel_idle.max_cstate=1 vt.handoff=7 [ 1.432426] intel_idle: MWAIT substates: 0x33000020 [ 1.432429] intel_idle: v0.4.1 model 0x37 [ 1.432432] intel_idle: lapic_timer_reliable_states 0xFFFFFFFF [ 1.432435] intel_idle: max_cstate 1 reached
Your output may look a little different but the key things to note are that the intel_idle.max_cstate=1 parameter has been passed to the kernel and that it has shown its max_cstate value of 1 being reached.
Source reference: http://askubuntu.com/questions/749349/how-to-set-intel-idle-max-cstate-1
My BRIX has now been up and running without lock-up for several days. Whilst this isn't 100% guaranteed proof that the issue is resolved, this is significantly longer than it has ever stayed "up" before. No doubt it will fail again as soon as I publish this blog....
I decided to write this blog so if in future I come across this issue I have this as a little reminder of what I need to do to fix it. Whilst it applies to the Gigabyte BRIX I am using, I suspect it is relevant to other systems too so hopefully others may find this of use if they are experiencing similar issues. I'll update this blog once the system has been running for a little longer and I have verified whether or not the fix is good. Feel free to leave any feedback in the comments below and thanks for reading!