First 5 things to do when your Windows Server keels over
A few weeks ago, I found myself fixing a Windows NT 4.0 server. After ripping out what little hair I have left, I realized that when it comes to troubleshooting, we’ve come a long way.
The experience of updating that old Windows server— it’s more than 20 years old!—had me wishing it had the power to run Windows Server 2016. For example, I had to install a new driver for the server’s SCSI controller under Windows NT 4.0? (Blue screen at boot and a complete reinstall!) Get the USB ports of the dual Pentium 200 Pro running on NT? (Blue screen at boot and the second Format C: in two days.) Just looking at the thing sideways causes a blue screen and a boot loop. I can’t count how many times I’ve had to wipe this beast.
Twenty years ago, we didn’t have things like advanced, easy-to-use recovery environments, detailed logs, and self-healing capabilities. Today’s servers can crash or have other issues, of course, but getting them up and running does not instantly result in wiping the thing and starting over.
Or it shouldn’t. Here’s what any system administrator should do to recover from a server failure trauma before considering a full refresh. I hope you're not having to support 20-year-old equipment, but legacy computers are a fact of life.
No. 1: Have you tried turning it… No, seriously, check the hardware!
No, I won’t finish that question. You’ve tried the basics—otherwise, you wouldn’t be here reading this. Last year, I had to fix a Windows Server 2016 disaster that was caused by lightning. The client didn’t have a UPS or any surge protection, most of the hardware components were damaged, and I had to identify and fix them one by one.
Here’s what I did. It should be a good checklist for basic hardware troubleshooting:
- Cabling: Whether it was caused by the surge or something else, the wiring in one of the primary Ethernet cables burned out. Any Ethernet cable tester can tell you if a cable needs replacing, which is how I found out why the server’s connections constantly dropped after we got it running again. If your server doesn’t have especially complex wiring or you don’t have a tester ready, try the good old method of replacing cables one by one.
- Peek into device manager: Go to MMC and look at the list of hardware shown. If MMC displays any exclamation marks, it’s time to update or reinstall a driver (best case) or check for error codes that hint at a hardware failure (worst case). Server core users can’t look at Device Manager and should find ways to remotely access it via PowerShell, for example (see the section about Device Manager access), or install NirSoft’s device manager alternative (which can also be installed and run remotely).
- Check disk management: Next step, within MMC, look at your arrays of hard disks and check their health status (run diskmgmt.msc). This is where you get more detailed information about most disk failures. Not sure what some of these messages mean? Microsoft has an exhaustive list of what to do in each scenario.
- Check your server’s memory: It’s odd to see how often faulty memory results in strange issues that might seem like they have absolutely nothing to do with RAM! For example, on one workstation, some device drivers failed to load; it took me days to figure out it wasn’t the driver but the driver trying to load itself into defective memory. Don’t be me. Boot into Memtest and let it do its thing before endlessly hunting down strange crashes. (Note: This might take some time!)
No. 2: Learn the logs
Windows event logs are a blessing, no matter how convoluted they sometimes are and how hard it can be to sift through hundreds of “alerts” you don’t care about. You will find those hidden “critical” events useful when trying to pinpoint a crash, bug, or any form of anomaly.
You’ll find it by firing up eventvwr.msc (WINKEY+R) or going to Control Panel > System & Security > Administrative Tools > Event Viewer. If you can’t boot up the server, you can extract the logs locally from C:\Windows\System32\winevt\logs straight from the disk. Look out for the more serious “Error” entries under each Windows Logs category.
Here are a few examples of typical server errors I found over the past years of debugging Windows Server 2016 racks for a small business:
- WSUS errors: Located under the System category, I instantly found out that the Windows Server Update Service (Windows patch manager) failed to download or install updates from the WSUS server. It broke the entire patch management, as Windows got stuck on one stupid update error and didn’t install any additional security patches. I excluded this one problematic update and, voilá, problem solved.
- Application hangs: Look out for an .exe file here. Find out what it’s used for and use the error code (usually visible in the description) to see why the application crashed.
- Service failure: Service Control Manager events indicate that one of your services failed to start or timed out. This is usually caused by third-party services. A reinstall might help in such cases. The details found under description give you the exact name of the service, too. If it’s nothing you recognize, google it!
- Crashes: The big ones show up as Microsoft-Windows-Kernel-Power events and are the most serious errors listed here. This is when your system reboots as the result of a power outage, for example, or a heavy crash (BSOD). Check under the Details section for an error code or a filename related to a driver to help you figure out what caused this.
No. 3: Stress test
You think you fixed your server? Well, before you put it back online, you had better stress test its hardware components for several hours. Make sure everything works and is capable of handling load. (That’s especially true when you work as a consultant; you don’t want a client calling you back with a second complaint: “I thought you fixed it!” Best be sure.)
For such cases, I always carry the following handful of tools with me, no matter whether I stress test a client or a server:
- Active Directory stress test: Unfortunately discontinued—but thanks to some crafty Microsoft employee made usable again—the Active Directory Performance Testing Tool simulates heavy AD load. It shows you if either your hardware or AD configuration isn’t set up properly for high-load scenarios.
- IIS benchmark: For folks running web servers, this IIS benchmark—while not exactly a stress test—shows how a web server’s performance stacks up. It’s highly recommended, especially when you compare your results to other similar servers on your network or online.
- Network stress test: Used by Microsoft internally and popular among data center admins, NTttcp Utility is my favorite tool to benchmark and stress test Windows networking performance.
- Prime95: Stress tests all of your servers’ CPUs across all threads. Prime95 simulates extreme scenarios that shouldn’t occur in normal server usage. Note that this causes high temperatures, so don’t run it for too long. If the server freezes after even a few seconds or minutes, you know you should consider investing in additional cooling.
- Storage tests: DiskSpd simulates heavy storage loads. If one of your hard disk arrays fails or stops during these tests, it’s time for a replacement.
No. 4: Check for renegade processes
Random crashes could be caused by buggy processes or services that are constantly thrashing your servers resources—i.e., RAM or CPU. Having any component run at 100 percent at some point causes it to fail and could reduce its lifespan.
The simplest way to find out what’s going on: Open Task Manager
and sort by CPU, RAM, etc., to see if there’s a process using up all the resources. Chances are, the process froze or caused a memory leak. If that’s the case, see if an update is available or replace the buggy piece of software.
No. 5: Server unbootable? Repair core system files!
This has helped me many times on both Windows clients and servers: If the server is unbootable, shows a cryptic or generic BSOD, or reports that any boot files are missing, it’s time to bring out Windows RE.
Got a Windows Server (2008, 2012/R2, 2016…) DVD or bootable USB handy? Time to dig it up (or create one manually using the media creation tool). Make sure your server's UEFI/BIOS is set to boot from DVD/USB.
Once you see the setup, go to the "Repair your computer" options and select Command Prompt. You can try the Startup Repair option first, but chances are that if your server is that far gone, you'll need more drastic methods:
- Repair system files: First, repair some of the core Windows Server system files using the good old
sfc /scannowcommand or
sfc /scannow /offbootdir=d:\ /offwindir=d:\windows(check to see which drive letter your Windows Server installation has been assigned in WinRE; in this example, it’s D). This checks for missing or corrupt OS files and replaces them with their pristine originals.
- Repair your server's Boot Configuration Data: The BCD is a small database that stores boot-related information, such as from which partition to boot or what the boot items are labeled in dual-boot scenarios. If this database is corrupted, you will run into no-boot scenarios. From the command prompt, type in
If that doesn’t help, you might need to re-create the boot sector as well as the Master Boot Record. This can be done using the
Bootrec /fixboot and
ootrec /fixmbr commands, respectively, although my success rates with those two have been pretty low.
Admittedly, today’s servers can crash and burn for a billion reasons, but this guide should give system administrators—in both large and small businesses—a good start. Good luck!
All images courtesy of the author.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.