Recently in sysadmin Category

One of the tasks I have is to build some separation between the production environment and test/integration environments. As we approach that Very Big Deadline I've talked about, this is a pretty significant task. Dev processes that we've been using for a really long time now need to change in order to provide the separation we need, and that's some tricky stuff.

One of the ways I plan on accomplishing this is through the use of Environments in Puppet. This looks like a great way to allow automation while delivering different code to different environments. Nifty keen stuff.

But it requires that I re-engineer our current Puppet environment, which is a single monolithic all-or-nothing repo to accommodate this. This is fairly straight-forward, as suggested by the Puppet Docs:

If the values of any settings in puppet.conf reference the $environment variable (like modulepath = $confdir/environments/$environment/modules:$confdir/modules, for example), the agent's environment will be interpolated into them.

So if I need to deliver a different set of files to production, such as a different authorized_keys file, I can drop that into $confdir/environments/production/modules and all will be swell.

So I tried to set that up. I made the environment directory, changed the puppet.conf file to reflect that, made an empty module directory for the module that'd get the changed files (but not put the files in there yet), and ran a client against it.

It didn't work so well.

The error I was getting on the puppetmaster was err: Could not find class custommod for clunod-wk0130.sub.example.local at /etc/puppet/manifests/site.pp:84 on node clunod-wk0130.sub.example.local

I checked eight ways from Sunday for spelling mistakes, bracket errors, and other such syntax problems but could not find out why it was having trouble locating the custommod directory. I posted on ServerFault for some added help, but didn't get much beyond, "Huh, it SHOULD work that way," which wasn't that helpful.

I decided that I needed to debug where it was searching for modules since it manifestly wasn't seeing the one staring it in the face. For this I used strace, which produced an immense logfile. Grepping out the 'stat' calls as it checked the import statements, I noticed something distinctly odd.

stat("/etc/puppet/modules/base/manifests/init.pp"
stat("/etc/puppet/modules/curl/manifests/init.pp"
stat("/etc/puppet/environments/production/modules/custommod/manifests/init.pp"

It was attempting to read in the init.pp file for the custommod, even though there wasn't a file there. And what's more, it wasn't then re-trying under /etc/puppet/modules/custommod/manifests/init.pp. Clearly, the modulepath statement does not work like the $PATH variable in bash and every other shell I've used. In those, if it fails to find a command in (for example) /usr/local/bin, it'll then try /usr/bin, and then ~/bin until it finds the file.

The modulepath statement in puppet.conf is focused on modules, not individual files.

This is a false cognate. By having that empty module in the production environment, I was in effect telling Puppet that the Production environment doesn't have that module. Also, my grand plan to provide a different authorized_keys file in the Production environment requires me to have a complete module for the Production environment directory, not just the one or two files I want changed.

Good to know.

The need for 'standby time'

| No Comments
This morning's SANS blog-entry rang true with me.

They're coming from an InfoSec point of view, but this fits into the overall IT framework rather well. I even remembered it in the chart I posted back in January:
SA-TreeOfIT.png
A discrete Security department is one of the last things to break out of the pack, and that's something the SANS diary entry addresses. Between one-gun IT and a discrete security department (a "police department" in their terms) you get the volunteer fire-department. There may be one person who is the CSO, but there are no full-timers who do nothing but InfoSec stuff. When stuff happens, people from everywhere get drawn in. Sometimes it's the same people every time, and when a formal Security department is formed it'll probably be those people who are in it.

But, stand-by time:

Although it may sound like it means "stand around doing nothing," standby time is more like on-call or ready-to-serve time. Some organizations implement on-call time as that week or two that you're stuck with the pager so if anything happens after-hours you're the one that gets called. Otherwise known as the "sorry family, I can't do anything with you this week" time. As the organization grows, that will become less onerous as they move to a fully-staffed 24/7 structure with experienced people. That's not really what I mean by standby time.

Standby time is time that is set aside in the daily schedule that is devoted to incident-response. Most of the time it should focus on the first stage of incident-response, or Preparation. It's time spent keeping up to date on security news and events, updating documentation, and building tools and response processes. It's an interruptable time should an incident arise, but it's not interruptable for other meetings or projects.

This is time I've been calling "fire-watch" all along. Time when I have to be there in case something goes wrong, but I don't have anything else really going on. I spent a lot of 2010 in "fire-watch" thanks to the ongoing budget crisis at WWU and the impact it had on our project pace.

Kevin Liston is advocating putting actual time on the schedule when you are doing the crisis watch as your primary duty. During this time anything else you do is the kind of thing that is immediately interruptable, such as studying or doing the otherwise low-priority grunt work of InfoSec. Or, you know, blogging.

Does this apply to the Systems Administration space?

I believe it does, and it follows a similar progression to the InfoSec field.

When a crisis emerges in a one-gun shop, well, you already know who is going to handle it.

In the mid-size shops like we had at WWU the person who handles the crisis is generally the one who discovers it first, a decidedly ad-hoc process.

In the largest of shops where they have a formal Network Operations Center they may have sysadmin staff on permanent standby just in case something goes wrong. Once Upon a Time, they were called 'Operators', I don't know what they're called these days. They're there for first line triage of anything that can go wrong, and they know who to call when something fails outside their knowledge-base.

Standby Time is useful, since it gives you a whack of time you'd otherwise be bored in, during which you can do such useful things as:

  • Updating documentation
  • Reviewing patches and approving their application
  • Spreadsheeting for budget planning
  • Reviewing monitoring system logs for anomalies any automated systems may have missed
  • Reviewing the monitoring and alerting framework to make sure it makes sense

Or, those 'low priority' tasks we rarely seem to get to until it's too painful not to get around to it.

Not all sysadminly types need to be on the fire watch, but some do. In DevOps environments where the systems folk are neck deep in development, some of them may only be on it for a little while. In others where certain specialties that are rarely involved in incident-response are in evidence, such as Storage Administrators, they may never get on the front-line fire watch but may carry the 2nd/3rd tier pager.

Note, this is in addition to any helpdesk that may be in evidence. The person on fire watch will be the first responder in case something like a load-spike triggers a cascading failure among the front-end load-balanced web-servers. A fire-watch is more important for entities that have little application diversity and few internal users; things like esty, ebay, and amazon. It's less important for entities that have a lot of internal users and a huge diversity in internal systems, places like WWU. In these cases you can have a lot of things that can go wrong in little ways and who knows how to fix 'em is hard to track.

If nothing else, you can put "fire watch" on your calendar as an excuse to do the low-level tasks that need to get done while at the same time fending off meeting invites.

Lisa 2011: The Limoncelli Test

| 1 Comment
Also known as M7.

From the book:

Tom's books total over 2,100 pages of advice. In this class he'll narrow all that down to 32 essential practices. Tom will blast though all the 32 practices, explaining what brought him to include each one on the list, plus tips for incorporating the practice, policy, or technology into your organization. You'll find some great ideas for providing better service with less effort.

Take back to work: How to identify and fix your biggest problems, cross-train your team, strengthen your systems--and more!

Topics include:

  • Improving sysadmin-user interaction
  • Best practices for working together as a team
  • Best practices for service operations
  • Engineering for reliability
  • Sustainable Enterprise fleet (desktop/laptop) management
  • How to figure out what your team does right, and where it needs to improve
This was a very good session. It covers the Limoncelli Test, unsurprisingly. This is one of many attempts to come up with a Sysadmin version of the Joel Test (ServerFault tried). But this one seems to be going the distance. Do click on the link, as it leads right to the test. Tom has even written essays about each point to support its being there.

Some of the stuff in here is obvious if you've been in the industry for a while (use a ticket-tracking system, automated patching) others perhaps  not so much (there are three policies that all sysadmin departments need to have defined to be effective). Some applies only to multi-person environments (pager rotations) while others are universally applicable (service monitoring).

I got a lot of goodies out of this. Some of it I had been peripherally aware of, but had never seen written up like this all in one spot before.

Ops Docs

An Ops Doc is a kind of service documentation. Each service you offer needs an ops doc and it needs to have certain things in it:
  • Overview: What it is, what it does.
  • Build: How to built it, get it.
  • Deploy: How to install it, configure it.
  • Common Tasks: What do you commonly do with it, and what kinds of issues commonly come up + resolutions.
  • Pager playbook: Document alert handling.
  • Disaster Recovery: What are the DR policies for this service, and how do you run them.
  • Service Level Agreements: What has been promised to whom, what are the penalties. What is it, where is it, how to deal with it.

Critical? Periodic audit. Probably by a non-technical manager, such as a Project Manager. And by "non-technical" I mean someone for whom managing people is their job, not managing technology directly.

The Three Empowering Policies

There are three policies that all Sysadmins need to have defined in order to be effective. Otherwise, people will just walk up whenever and ask you to do stuff, and you'll do it whenever they ask, however they ask, since we're nice that way. This is managing by interrupt and that's not a good way to manage our time. What's more, it leads to grumbling, and reinforcing the Server Troll reputation we sometimes get. The three policies

Acceptable methods for users to ask for help

Walking up and asking may be the best way for you, but in general it isn't a good way. Having a policy that defines what are the ways that users may ask for help allows sysadmins to better budget their time, and makes them more efficient overall.

The definition of an emergency

By enshrining the definitions of emergency into policy you prevent localized issues being advocated by a vocal person or small group of users from sucking resources away from a larger issue affecting the entire system but doesn't have a vocal advocate driving attention to it. The example Tom uses is a Code Red is something that stops production cold, a Code Yellow is something that could lead to a Code Red if left unattended.

The scope of service

This policy defines what is and is not covered. It is this policy that tells people that the sysadmins are not fax-repair qualified, or whether or not they make house-calls for teleworkers. This policy also defines when service is available, and what the after-hours options are. 

How to convince people to make big changes

I have to give big, big thanks to Tom for this one. This section of the class was focused on how to convince manager-types or other people with the power to block IT changes that such a change is in their best interest. I've been saying for years that one of the chief skills a well qualified Systems Engineer needs is the ability to effectively speak to management. A technician doesn't need to talk to people persuasively. A technical manager needs to talk to other managers. Tom went there.

Thank you.

Thank you.

Thank you.

I've met many people in our field who stuck with computers because either people are scary, or they don't want to deal with the bullshit that dealing with people day in and day out requires. These are not the people that make it to Senior jobs, at least not without some help. Tom identified some effective strategies for social engineering your way to what needs doing.

A more full treatment of this topic will be in another blog post. Heck, I've got a proto-book on this very topic in progress. So this will be briefer than it really needs.

  • Don't make people feel wrong. Phrase changes in non-accusatory ways. Making them feel wrong gets them defensive, and MUCH less likely to agree that you are presenting the best way forward.
  • Don't make people feel blamed. Explain how this change will improve everything overall. People feeling responsible for bad decisions get defensive. You don't want that.
  • Invent questions that'll give THEM the idea. Social engineering. If they come up with it (subtly pushed) they're more likely to follow through.
  • Don't be threatening to their authority. Authority can come in the form of direct power (they're your boss), or indirect (they have 20 years on you, and everyone listens to them before you). People don't like upstarts, and can quash your idea out of hand just because you seem like a threat. Don't be a threat.
  • For big changes, break them up into smaller changes and present those. Smaller change is less scary than bigger change.
  • The Statement of Undeniable Value. If you can distil your change down to simple to understand numbers, it can make it a LOT easier to convince people that this change is needed. Suddenly, all of that seemingly irreducible complexity is now distilled into a discrete dollars-per-unit savings.
  • Some people respond to data, others respond to peer recommendation. Knowing the difference is key. Knowing that Google uses a specific product raises that product's shine in the eyes of that specific decision maker saying 'no' all the time.

Also behaviorwizard.org is a nice wizard-style website to help you figure out how to persuade certain people to do things. It takes some social know-how to really get the most out of it, but if you have that it can really help you get even better.

Noise

| 6 Comments
I've been spending a lot of time at our datacenter recently. Unlike at WWU, we colo at one of the large providers so I'm getting to interact with a datacenter vastly larger than the ones I've played with in the past. This is cool in many ways (this is a multi megawatt facility!) but there are some downsides.

Sound.

I've known for years that datacenters can get very loud. When WWU picked up our first HP bladerack, the whine that produced was audible in the hallway outside the room. And this is with sound-proofing, mind. It was about then that I brought my shooting muffs to work for when I'd be in there for any length of time.

This facility? Worse. Two rows behind our racks are five racks full of dual power-supply servers with only one power cable each, which means five racks of servers doing their alarm beep continually for months (possibly years) on end. This is in addition to the usual hum of air-handlers and cooling fans in every rack.

It's loud in here. Loud enough that two people talking need to raise their voices, which puts it above 70dB. This is right close to the OSHA hearing-protection-required levels. And for a good reason.

I'm pretty sure my tinnitus has gotten a bit worse since I've been working here.

I haven't always been able to use my muffs when working, since talking to other people is problematic when I have them. The facility does offer softies for hearing protection, but they're only so useful. A couple of my recent 8 hour stints have been with help, so there was much shouting back and forth as we do things. There will be more, longer visits in the near future too, so I need to plan for that as well.

Hearing loss from long term exposure to loud white noise and blood-loss from sharp bits of equipment. Two hazards to what it is that we do.

Unexpected parallels

| No Comments
A friend of mine is going through some frustrating medical crap. And while reading her latest post about her experiences, she expressed a sentiment similar to this (wording changed to foil googling):

It drives me crazy. I get that these people have been there, and done that. But when you come there with something that isn't common, sometimes it just gets ignored.

Um... guilty.

As a technical support professional, albeit one that also wears many other hats these days, I'm guilty of just that. You may be too. Computers glitch, we all know that. We are also busy people who hate chasing after something that's just a transient glitch rather than a symptom of a deeper problem. We wait for the glitch to turn into a pattern, when instead the person who reported it saw that we won't help them and then doesn't tell us when things start getting patterns.

Treating every problem like it's the reporters most important problem in the world is a goal we strive for, but fail at far too often. Demands on our time are indeed heavy and that does require some triage. Not all problems get our undivided attention.

I imagine doctors have similar pressures, though more intense since it's people's health on the line. Perhaps less time pressure and more insurance pressures, but still pressure.

So that one guy? The one who has the VPN connection that resets every 63 minutes regardless of where he is? And you've pushed it off because you don't want to dig around his personal computer? Just a reminder that to him, this is a critical problem. You should get on that.
AMD has released their server-version of the Bulldozer CPU class they released over a month ago, called Interlagos.

Bulldozer/Interlagos is AMD's attempt to grab more of the market from Intel. Currently, it's competing in the value sector but not on performance. The days when AMD CPUs were the virtualization kings have been gone for a couple years now. AMD would like that crown back, thank you, and they're driving to go there.

That said, comparing performance between equivalently clocked AMD and Intel CPUs is hard. They're optimized for different tasks, which means that the smart Systems Engineer looking for the next CPU to base their environment on should pay attention. Workload matters! Those AMD CPUs may be damned cheap compared to Intel, but if you're doing the wrong things with them you'd be better off buying previous-gen Intel chips.

The most controversial thing AMD has done is to make two cores share a Floating Point Unit. They've also done quite a bit of optimization in their Arithmatic Logic Unit, where Integer math is handled. The reasoning behind this is that most server usage these days is integer heavy, highly parallelizeable workloads; most database and simple web-serving workloads are entirely Integer and parallel-friendly, and that's a large part of the webapp stack right there. The likes of Google Plus, StackExchange, and Reddit do far more Integer work than floating-point, so something like Interlagos should be a good fit.

And the early benchmarks show that AMD does indeed have an edge on integer-heavy workloads over equivalent generation Intel parts. Intel still has an edge on compute-performance-per-watt, but AMD holds the edge on compute-performance-per-GHz. Pick which is more important to you.

Specialist workloads like render farms are edge cases, if big consumers, so engineering to handle those workloads is not worth the time. By staking out the middle of the market, AMD can drive innovation in the marketplace by forcing Intel to get creative in the middle. It's good for everyone.



Yes, but what about me, you cry.

Biometrics

| No Comments
Today I had an incident with biometrics that further convinces me that they are not the end-all be-all of security.

Today I had to head up to our datacenter to do some nebulous "things". Like you do. Since we colo at a large facility that has all of those security certifications, getting into it is something of a trial but a familiar one. Hand scans, double-layered man-trap, the whole deal.

Only, the hand-scanner on the cage our racks are in wouldn't read my hand today. It took 19 tries before it decided I was me. To get this far I had to pass four other hand-scaners and only had to re-enter three times along the way. When I went back to the security station to see WTF, they had me re-scan my hand. Leaning over I saw that I had managed to fill their live-log of entry/exit events with red events, and they didn't seem phased in the least.

I've had some trouble with this particular hand-scanner before, enough that I dread leaving the cage. For what ever reason, this particular scanner is far enough out of spec that the fuzzy results it returns are fuzzier than their system will accept. Since it's on a rack cage rather than the higher trafficked man-trap scanners, it hasn't been caught out and re-tuned (or something). But still, I hate that thing.

Batch jobs in PowerShell

| 1 Comment
Say you want to execute a series of commands on a bunch of Windows machines and your AV considers 'psexec' to be malware. What are you to do? Remote PowerShell can be used. What's more, it can be done in parallel. Say you've got to run a malware cleanup script on 1200 computer-lab machines RIGHT NOW, you can't just put it as a scheduled task in a GPO and wait for the GPOs to apply.

$JobTracker=new-object system.collections.hashtable
$Finished=new-object system.collections.hashtable
$MachineArray=$Args[0]

foreach ($Mcn in $MachineArray) {
    $JobTracker["$Mcn"]=invoke-command -AsJob -ComputerName "$Mcn" -ScriptBlock {
        $ProcList=get-wmiobject win32_Process
        foreach ($Proc in $ProcList) {
           if ($Proc.Name -eq "evil.exe") {$Proc.Terminate}
        }
        remove-item c:\windows\system32\evil.exe
    }
}
$JobCount=$JobTracker.Count
$Machines=$MachineArray.Keys
$Completed=0
while ($Completed -lt $JobCount) {
    foreach ($Mcn in $Machines) {
        if (($JobTracker["$Mcn"].State -eq "Completed") -and (-not $Finished.Contains("$Mcn"))) {
            remove-job -id $JobTracker["$Mcn"].ID
            $Completed++
            $Finished["$Mcn"]="Completed"
        } elseif (($JobTracker["$Mcn"].State -eq "Failed") -and (-not $Finished.Contains("$Mcn"))) {
            $FailReason=$JobTracker["$Mcn"].JobStateInfo.Reason
            write-warning "$Mcn failed: $FailReason"
            remove-job -i $JobTracker["$Mcn"].ID
            $Completed++
            $Finished["$Mcn"]="Failed"
        }
    }
    sleep 1
}

Heck, psexec can't be run in parallel like this, so this is even better!

What this does:

Facing unreasonable requests

| No Comments
Over on ServerFault we had a question become rather hot lately. The key part:

We received an interesting "requirement" from a client today.

They want 100% uptime with off site fail over on a web application. From our web apps viewpoint, this isn't an issue. It was designed to be able to scale out across multiple database servers etc.

However, from a networking issue I just can't seem to figure out how to make it work.
100% uptime? Anyone who has been in this business for a while knows that 100% either doesn't exist, or only exists when examining small timescales. Our eDiscovery hosting platform had 100% uptime... for September. For the quarter? No, we had a well announced major outage in August while we relocated some servers to a new rack and performed network-backbone upgrades.

As it happens we faced a similar requirement from a potential client a while back. They demanded a full refund of that month's fees if the service was unavailable to their people at any time when someone tried to use it. Personally, I wasn't directly involved in these negotiations, I saw that as a nice opening offer in a negotiation rather than an ultimatum. Unless you've got boilerplate service-contract language they're trying to amend, I believe 'initial position' is the best way to frame these sorts of "requirements".

This particular client had scoped their 100% uptime, single month, had provided service metrics, if they ever notice it is down, as well as a penalty, we don't get paid. Clearly they had thought this out. I personally wanted to know what they thought of planned and announced service outages, and eventually the response came back as "same as unplanned". Ok, that's an initial position.

At that point we could:

  • Dicker about the scale of the penalty (pro-rate for any hour/day/week an outage is noticed?)
  • Dicker about planned outages that they can clearly work around?
  • Dicker about using a third party to assess downtime rather than be purely defined by the client?
  • Dicker about downtime attributable to forces beyond our control (such as something screwy happening route-wise in the Internet core, as has happened a few times, or their own firewall going fritzy)?
Uptime requirements from paying customers is nothing new, but it's also the kind of thing that can show up in internal SLA negotiations for large organizations. As State budgets continue to shrink IT charge-back schemes are becoming more and more common in areas that previously didn't have them, so these kinds of demands can arise from internal customers just as often.

The best defense is to have downtime concerns addressed in your boilerplate service contract, much like Amazon does. If an entity wants special treatment, they can work to have a special contract written up but that's a lot of pushing boulders up-hill so only the specialest of snowflakes do this kind of thing. But if you don't have boiler-plate, be prepared to dicker over their initial position.

Scalability 2011 has been canceled

| 1 Comment
Actually, all of StackOverflow's DevDays has been canned. Too few people registering. The gory details.

I had been really looking forward to that conference, as it covers the kinds of problems I'm dealing with right now.

Other Blogs

My Other Stuff

Monthly Archives