Coping with Spectre and Meltdown: What sysadmins are doing
[Editor's note: This is part of a series on Spectre and Meltdown. Check out HPE's step-by-step instructions, important links, and FAQ on how to mitigate risk and resolve. Get up-to-speed now.]
It’s been a month since we all learned about Spectre and Meltdown, the two gaping security vulnerabilities that affect nearly every hardware platform. By now, you’re up to date on the technical details, and you've watched the industry scurry to respond with patches and updates as it prepares for longer-term solutions.
But that doesn’t help the poor human being who has to cope with the problem in the meantime. If a vendor has released patches, it’s the system administrator who has to apply them. If the patch introduces new problems, it has to be backed out. If the computer systems are slowed down as the result of the short-term update and application performance suffers, the finger-pointing is directed at the whimpering sysadmin whose only recourse is to post to /r/talesfromtechsupport.
If nothing else, at the beginning of January, every sysadmin had a long to-do list and now those projects are behind.
I asked several network managers, system administrators, and DevOps how they’re coping—so, if nothing else, you can be reassured you’re not alone. Here’s a snapshot of what other sysadmins have done so far to respond to Spectre and Meltdown, as well as their current plans and long-term strategy.
What have you done so far?
Everyone has applied patches. But that sounds ever so simple.
Ron, an IT admin, summarizes the situation succinctly: “More like applied, applied another, removed, I think re-applied, I give up, and have no clue where I am anymore.”
That is, sysadmins are ready to apply patches—when a patch exists. “I applied the patches for Meltdown but I am still waiting for Spectre patches from manufacturers,” explains an IT pro named Nick. “I’m waiting for Bond to do his Spectre thing,” agrees another admin, Emil.
Vendors have released, pulled back, re-released, and re-pulled back patches, explains Chase, a network administrator. “Everyone is so concerned by this that they rushed code out without testing it enough, leading to what I've heard referred to as ‘speculative reboots.’”
“We patched OSs where we could,” says Skylar, a systems engineer at a university. “Unfortunately I think we have some vendor-supported systems where we don't have patches yet, but they're also single-purpose systems isolated from the Internet so the risk seems acceptable for now.”
What are your current activities and short-term plans?
The confusion—and rumored performance hits—are causing some sysadmins to adopt a “watch carefully” and “wait and see” approach. This means spending even more time reading in order to stay up to date and learn if their own infrastructure needs immediate attention.
"The problem is that the patches don’t come at no cost in terms of performance. In fact, some patches have warnings about the potential side effects," says Sandra, who recently retired from 30 years of sysadmin work. "Projections of how badly performance will be affected range from 'You won’t notice it' to 'significantly impacted.'" Plus, IT staff have to look into whether the patches themselves could break something. They’re looking for vulnerabilities and running tests to evaluate how patched systems might break down or be open to other problems.
“We’ve investigated what's happening and verified that our production network does not run outside code (while being very thankful that we are 99% in colo datacenters, not cloud),” says David, a security architect at a large Internet advertising company. “We started paying a lot closer attention to the Linux Kernel list to see the discussion of the various patches and their impact.”
“We're keeping an eye on the patches for Windows and Linux, and the microcode releases/BIOS updates from our server vendors,” says Chase. “Once they seem to stay out for more than two weeks we'll look at upgrading. Given that there is a performance hit, and it differs by workload, it'll be a phased approach to roll out with lots of performance benchmarking. The vulnerability hits if someone can execute malicious code on your servers. There's not a worm or drive-by attack that can exploit this (as yet). Given that, hardening our systems and locking down user access is the best mitigation we can do.”
This requires attention to nearly every internal system. As Emil says, “We do virtualization, but users don't have RDP access to the virtualized servers, so one less worry there.”
"Wait and see" doesn’t mean “do nothing.” As Sandra explains, "We can’t just apply the patches and dust off our hands. Keep in mind that these patches change how our systems work at a very low level. Would you take a drug that kept you from making stupid mistakes if you didn’t know the other ways in which it might affect your thinking? Probably not."
“We're waiting for functioning microcode updates,” says Skylar. “Fortunately most of our workload is batch-scheduled so we have an easy mechanism to schedule maintenance and do rolling reboots behind the scenes.”
In the meantime, says Chase, “we're making sure our hardening guidelines are in place and that user access to servers is limited. That should mitigate the threat of malicious code getting onto our boxes. (I should point out we're heavily virtualized on-premise.)”
How much of a threat that is depends on the hardware the sysadmin’s employer uses and the features it provides.
What’s the long-term strategy?
These sysadmins are doing their best to respond reasonably in the short term. But everyone knows that Spectre and Meltdown patches are just Band-Aids. Buy new servers?
Maybe. The long-term options depend on the performance impact, says Chase. That means a lot of performance testing, particularly in the sysadmin’s own environment. “If it's heavy for our workloads, yes, we'll look at expanding our server footprint.” As sysadmins and professional benchmark testers have already learned, there’s a lot of variation in the performance impacts based on workload, hardware, and other factors. Initial fears of 20 percent to 30 percent degradation generally were worst-case, but that doesn’t mean any given shop should expect to be lucky or unlucky.
Skylar is taking a similar approach: “We don't have any big purchases scheduled right now, but I would be curious to see what the new performance figures for Intel vs. AMD (vs. ARM?) turn out to be.”
The unexpected problem is causing some to reconsider network configurations. David expects to consider plans to run a private cloud, for instance. That might include “mix dev/QA/prod workloads on the same box, with an eye to running separate Cloud instances and swing machines from one cluster to another based on need.”
Then there’s the open question of what the system admins can expect from the vendors they work with regularly. The vulnerabilities are a factor in our server purchasing, says David, “but we are having trouble finding VARs that will offer us AMD servers.”
And what have you told management to explain all this?
"I can’t imagine any IT staff not fully informing upper management about the problem and what they’re doing to both cautiously and responsibly address it," says Sandra. "That involves the nature of the flaws and the steps being taken to reduce the impact on the organization. These flaws represent a large and very serious problem with a lot of still unknown implications." The question is: How do you deliver that message?
These particular sysadmins are lucky: They have upper management who are educated—or at least are smart enough to trust their tech staff.
“I have no idea what the upper management in our group has told overall company management,” says Chase. “Our IT management understood the technical details behind the vulnerability and our exposure footprint very well upon release.”
“We've told them the cold, hard facts,” says Skylar. But at the university, the management and customer base are technical—and capable of understanding complex problems.
David’s experience may be more representative of the average network admin who needs to convey the situation to the boss. He explained to management “that it is a huge danger on the office network (where we patched as soon as vendors provided patches), but that the short term risk was low due to not sharing systems, and the performance impact on our workload is expected to be very high (30 percent performance hit is probably optimistic).”
Do these sysadmins’ experiences match yours? Let me know on Twitter @estherschindler.
This article/content was written by the individual writer identified and does not necessarily reflect the view of Hewlett Packard Enterprise Company.