RSS

Data center is on its way out… where do we go now?

Ok, I’ve been working in data center in different forms and varieties for decades. I’ve taken some breaks, but I always end up back there somehow. Let me start by saying that if you’re planning a data center investment, you should read this and rethink your plans.

If you’re in a purchasing and/or leadership position, it’s important to understand that there’s a really good chance that your investments into your corporate IT, especially data center has probably been mostly wasted over the past 15 years. I don’t make this statement lightly and I’m not trying to sell you anything. I don’t want your money, I don’t want your love. I’m simply here to let you know that at this point in time, your IT spending could be as much as 10,000 times higher than they should be. I’m absolutely certain that you are being held hostage by system integrators and am absolutely certain you are being sent the wrong direction.

Before you buy even one more piece of hardware or software for your data center, stop what you’re doing and rethink your entire plan. Almost all of your purchases for years have been entirely centered on wasting money on buying stuff that will hopefully help you finally use the last thing you bought which in itself was to try and make the thing before that work as well.

Let me start by telling you how your integrator works.

It all starts because you have a business need. You need to be online. You need office and maybe Adobe Acrobat. You need accounting software as well as communications software. If you’re not an accounting firm, you probably also need software which will cater not only to your vertical market, but also to your specific business.

This is it. You may notice that whether you’re a 10 person organization or a 1.2 million person organization, probably most of the more complex systems that you need are the exact same systems which every single other company needs as well. In 1995, if you needed e-mail, you needed a server. If you needed chat, you needed a server. If you needed accounting, you needed a server. If you needed printing, you needed a server. I think you see where I’m going.

In 2000-2001, we finally started consolidating. The way we did this was to use virtualization from VMware to migrate our hundreds of servers into dozens of servers. This made sense. Because of VMware, we focused on building big and advanced systems with SANs, data center networking, massive servers, etc… we even started moving our desktop computers to run on these systems. And in the beginning it made sense. A single $2000 server with the right software could replace 10 other $2000 servers which showed a clear return on investment.

It is now 2018 and the minimum cost (before the good rebates) to build even the simplest reliable VMware data center is $1.6 million and that’s probably too small for your needs. Using VMware/Hyper-V/Nutanix, etc… you need a minimum of 6 servers spread across two locations with two high speed networks to support this data center concept in even the simplest form. If you include the cost of servers, licenses, man-power, it is a minimum investment of $1.6 million. The number 6 comes from basic risk analysis. If all of my systems can run on any server, then I can have one server in maintenance mode, lose power to an entire data center, have another server running my business and a final server providing fail-over. Mathematically, you absolutely must have a minimum of 6 servers to ensure that at least one server is operational at any time.

Companies like Cisco, HP and Dell and others will also insist that you have redundancy on top of redundancy to ensure that your systems are never reduced to a single point of failure. They all sell solutions specifically focused on throwing good money after bad.

The cost of operating such a data center infrastructure is a minimum of $1 million a year. Again, this is related to licenses and at least 3 full time engineers and/or 3 part time contractors or a combination of those.

I will gladly prove on paper with charts, diagrams and whatever else is necessary that I’m not randomly choosing these numbers. These are the minimum numbers to build and maintain just the infrastructure of a data center. This does not in any way include the systems running on the infrastructure. If you’re spending less than this, it is mandatory you evaluate an alternative solution, your IT people are incompetent and your data is probably at real risk. It tells me they have asked you to spend what they think they can get out of you as opposed to spending what is necessary to do the job right.

Larger organizations are spending this kind of money, but they are almost certainly overspending considerably.

Again, to operate a business, the type of systems you need are mail, office, collaboration, accounting, etc…

Even if your mail is “secure”, it travels over the Internet insecure. e-mail might as well be public record. If you’re using a mail client which encrypts your mail, then sending the mail over the Internet should be considered safe within reason. Modern mail encryption should be strong enough to limit access strictly to well funded government organizations and the people who should have access to it. For security purposes, it is absolutely necessary to use mass-economy to identify malicious mail before it’s delivered. As such, mail systems should always be cloud based today. Even Cisco’s e-mail security appliance will send your super-secret messages to their servers for testing and categorization as well as provide telemetries to Cisco so they can achieve mass economy. Whether you handle your data locally or hand it to Microsoft or Google to process, you want your corporate mail, for the security of your users and your company to be handled by companies with hundreds of millions of users to use big data analytics to identify malicious data. Otherwise, you’re simply living a lie. Cisco, McAfee, etc… can not possibly provide a better combination of security and corporate privacy than large mail service providers do.

While it’s possible to run collaboration in house, the idea of doing so is utterly ridiculous. Consider this, if you want to communicate with employees while they’re at home, communicate with colleagues outside of the office, whatever you choose, the conversation will pass through the Internet. The system itself if configured correctly (and it’s not surprising how rarely it is) will provide encrypted end-to-end communication between the users. In fact, when using a good cloud service, even the logs of the conversations will be stored encrypted. The organizations running these services have a vested interest in protecting your data far better than you ever could using in-house resources as making your data available even to their own employees would likely destroy them. From a technical and business point of view, services like Slack, Skype for Business and others will provide far more security and functionality than you can ever achieve at home.

Another key aspect to consider is that the companies running these services are generally safer for your company from a legal perspective. During the last US presidential election, it became clear that the trust associated with running your own mail and collaboration is an issue. Consider that by using an external service to operate mail and collaboration, you and your company can’t be accused of tampering with data, history and logs if the data itself is subpoenaed when hosted by a third party. The downside of course is that you can’t tamper with it. And if tampering is in your best interest, then the extra investment in housing an inferior system locally has a legitimate ROI. Of course, I have no interest in assisting anyone who would fall in this category.

Then there’s office. Word processing, spread sheets and presentation as well as whatever other packages are needed to perform daily business are far more readily available than they once were. There is a very high chance that your using Microsoft for these packages, but there’s also Google, LibreOffice, OpenOffice, Apple productivity applications, etc… there are many alternatives that can meet your needs. These are all programs which can be used online and in some cases as applications installed locally. They are all available through app stores or web sites or both.

The office packages are all able to store to “The Cloud”. If you choose to store your company data in the public cloud, all solutions are usable. If you prefer storing your data within your organization’s servers, this is also an option. But consider that all the data your company will ever generate in office packages will very likely never require anything as complicated as a SAN. Products from NetApp, EMC, etc… are like providing a US Naval Carrier armed with multiple jets and bombers when you really only needed a fly swatter. Not only are they just too big, but they almost universally perform the job substandard. You might level the entire city trying to swat a fly, but the fly will probably float away before it’s finished. They’re about as accurate as that. I will not cover all the technical details here, but it’s very likely that all that is needed to host 20 years of company data locally can be purchased for less than one month of my salary and it would do it far better than any of those big systems would.

Let’s consider that by this point, your company is likely paying Microsoft, Google, or someone else about $12.50 a month per user for everything listed above. Adding support for conference room video can be handled by either buying a product like Microsoft Surface Hub or spending substantially more on a Cisco or Polycom solution.

So, for $12.50 a month per user, which you already pay for whether you run your own data center or not as you need software licenses for the users, you have eliminated about 80% of your company’s need for a data center.

Let’s talk accounting.

Accounting/financial software requires constant upgrades. This is because of constant changes to regulation. Due to the extremely rapid pace at which financial software companies make updates, there is a dilemma to be considered. Do you host your finance at home or do you host it in the cloud. Hosting financial software at home can be very difficult and costly. If your organization only has a single accountant or book keeper, hosting at home may be easy. Updates will be sent, copied to a USB key and then run on the accounting computer. On the other hand, hosting with a provider ensures the software is always up to date, but you’ve lost control of where the data is stored.

One solution is to make use of accounting software running on a appliance in house. Another solution is to extend the public cloud into your organization. This would mean running a PaaS at home such as Microsoft Azure Stack which would store all your data within your own walls but receive software and patches as if the hardware was hosted elsewhere. Solutions like this are extremely low maintenance and don’t require any on-staff or contract employees to maintain. If you can figure out an iPhone, you should be able to figure out Microsoft Azure Stack. And if you can’t, Microsoft will support you directly. No joke, adding new internal systems is as complex as using the iPhone App Store. Backup is handled remotely, either to a second Azure Stack or to the public cloud using encryption keys even Microsoft doesn’t have access to.

So, now that accounting and very likely CRM is handled. Your costs for the data center are 100% gone. You’ve gotten rid of the data center and you do not want it back. There is precisely zero value to having it anymore.

So back to internal systems.

Consider this. In 1992, I started working for North American Financial Services in Florida. It was a banking warehouse and we handled all computational transactions for basically every independent bank in a large part of South Florida at the time… at least it seemed we did. We also handled transactions for the Florida State Prepaid College program and more. The paper-check sorting, scanning and processing systems we had probably generated hundreds of thousands of new records in the database every day. We did this all on a computer system which is about as powerful as the original iPhone from 2007. The entire computer system had far less data storage than a modern iPhone. This computer of course filled a room larger than my house and we had a second one on the other side of the state for backup. The entire data connection speed between the two locations was probably about 1/100th the speed of the internet connection you have in your home.

Later, I worked at Raymond James and Associates (for a very brief time until I learned I was too young and unfocused for the big leagues). They were experimenting with what they considered high frequency trading. They had thousands of employees generating massive numbers of transactions and their entire computing capability was probably about the same as the one mentioned above.

I’m not saying this to reminisce. I’m explaining this because with the exception of adding photographs to records, it is extremely likely that those banks could probably continue to run on the same hardware we used back then and those systems has already been running for 15-20 years without any substantial upgrade.

The amount of meaningful business data your organization generates each day is not a lot. Consider the your CRM, your order fulfillment systems, logistics data, etc… while we are a hundred times more wasteful with regards to how this data is stored today, we are overcompensating by at least a hundred times that. Your entire company’s database for 20 years of business can almost certainly fit on a $100 or less thumb drive and even more than likely could fit on a $10 one and even a freebie from the latest trade show you visited.

Now consider big data and analytics. I can almost certainly guarantee that your data is probably not big and your analytics probably also isn’t that big. Most of these “Hadoop scale” systems are specced out to look like super computers, but the exact same systems would run on less than $10,000 worth of COTS equipment. In fact, in my testing, it often runs better since the COTS equipment is so low cost, I can buy substantially more of it. Consider that a “big data” node in my system for supporting data mining from 150,000 IoT devices costs less than $100 including power supply and doesn’t even require cooling to run. I can also assure you that it’s almost guaranteed my system is faster and more effective than yours.

So, how is this accomplished.

First, it’s time to stop using a system integrator. They are generally toxic to your business because they generally have absolutely no understanding of your business and operations.

Next, hire a computer programmer that is “past his/her prime” and would prefer to spend less time typing and more time thinking. You’ll know who to hire when their resume/CV looks like they’ve accomplished much and they made it through an entire 2 hour interview with you without ever using an acronym you don’t understand. It is extremely important that this person is a programmer. If the resume says they have experience with IBM, it’s even better. It means they’ve spent most of their career working with business systems instead of business technologies.

Now, given them 3-6 months to learn different roles in operations of your business. Don’t waste time with the technology as of yet. If you’re a coffee distributor, let them see the process or receiving, scheduling, roasting, inventorying, etc… let them learn the trade. Have them associate with sales, marketing and management so they can learn how the office workers operate. What are their needs and their wishes. Have them work with the receptionist, executive assistant, office manager, etc… to find out how the business really runs and what problems people really tend to have.

Now, you have a technically competent person who understands your business. This will work if you’re 10 or 100,000 people. Spending money on systems without understanding those systems is a guarantee of failure. Never under any circumstance let a sales person from an ISV, OEM or anyplace suggest what you need.

Once your technical leader is educated in your business, have them start working with a cloud company where the cloud makes sense. Then have them identify what systems are needed to operate your business. It may be possible to buy this as a boxed solution from a company who operates in your vertical market segment. It may be necessary to develop some systems in house. Whichever case you encounter, set some rules.

  1. Your core business should never be placed in the cloud. Generic systems can always be easily replaced. Everything from mail to financial to CRM can be moved from cloud to cloud. Your core custom systems are unique. If they are built against Amazon’s AWS, they will never run anywhere other than that. While Amazon will likely stay in business, understand that every year your system is further developed on Amazon AWS is a cost of probably and additional 1.5-2 years to get back out again. This means once you’re in, Amazon can renegotiate their terms with you in their favor and you’ll have zero leverage to counter with.
  2. Set a requirement with your developers that they should try to guarantee that everything they make should operate on $1000 or less in hardware with full system redundancy. Explain that you’d rather given them the money to get it right than give the money to Dell, HP or Cisco, NetApp, EMC, etc.. and keep giving it to them year on year. You will need 6 $100 computers and some network stuff to run your system reliably. I prefer in this configuration to spend closer to $1500 and have 9 systems on 3 separate networks myself. If they can’t figure out how to do this, find someone who can. It’s not even difficult. I am 100% confident that North American Financial Services could easily operate thousands of banks and ATMs on that configuration.
  3. Expect the transition to take some time. But every penny spent on this project will save you a lot more otherwise. Invest in people who invest themselves in learning your business. My uncle ran one of America’s largest children clothing manufacturers, a company named Pixie Playmates for many years with no more than a few developers. Their entire hardware costs was peanuts because the system was designed for their business.

Also, as a note, you should consider moving as many PCs and devices to 4G instead of wireless and wired networks. The hundreds of thousands or millions you spend on that will never provide an ROI compared to 4G and properly securing your network from the outside. Wired and wireless networking should be reserved strictly for places like warehouses where mobile phone signals are too weak to operate.

I also have solutions to desktop management, I won’t go into them much here and now as I’m tired of typing. But desktop computer support should never cost much money. There are solutions to this and for the most part, you’ll find that a good entry level support engineer who isn’t afraid to change printer ink once in a while could be very useful.

 
Leave a comment

Posted by on April 17, 2018 in Uncategorized

 

JavaScript Map & Reduce

When you step up and start doing IT automation (my focus is mostly network, but I’m doing pretty much everything), the guys you work with get a bit excited since there has classically been a division between IT guys and programmers. There’s nothing wrong with this. Consider that when you’re working in IT, often you’re either the one person in the entire IT part of your company or you’re working in silos. Network, Server, Storage, Security, Tier-1, Tier-2, etc… every team seems to be separate from each other and generally don’t communicate as well as they should.

I’ve never had much patience for the silo mind-set. I’m a computer guy. 20+ years as a developer, on and off IT for many years, networking guy for the most part, security is in there too, and wireless, and servers, and everything else. I’m pretty much the person you call when you need a job done and if I don’t know something, I have a really great list of people I can work with to get it done.

Well, I encountered a great question from a colleague today who is really kicking butt using freeCodeCamp to learn to code. I was really impressed with their site and decided that next time I decide to become more than moderately proficient with a new programming language, I’ll use their system as a road map to progressively build my problem solving skills in that language. He brought to me the example listed as Wherefore art thou.

The problem presented is extremely simple given a list of javascript objects with different properties present… for example :

[
    { first: "Romeo", last: "Montague" },
    { first: "Mercutio", last: null },
    { first: "Tybalt", last: "Capulet" }
]

And search criteria presented as another object such as :

{ last: "Capulet" }

The function you should write should traverse the array and return only the members of the first object which match the search criteria of the second object.

I solved the problem after three quick google searches :

  1. Javascript select from array
  2. Javascript enumerate properties within an object
  3. Javascript print properties and values of an object to console

And came up with :

function whatIsInAName(collection, source) {
  // What's in a name?
  var arr = [];

  // Only change code below this line
  arr = collection.filter(function(x) {
  for(var member in source) {
    if(!(x.hasOwnProperty(member) && (x[member] == source[member]))) {
      return false;
    }
  } 
   return true;
 });
 // Only change code above this line
 return arr;
}

Which worked on the first try. Of course, I was less than pleased with my use of a for loop as I did it because it felt a little inefficient. Of course, these things are generally inefficient if for no other reason than the algorithm in question doesn’t scale well if there are too many properties to traverse. I also like code and algorithms which scale well across CPUs without too much manual instrumentation.

I’m no stranger to making A LOT of use of fluent style programming. In fact, I’m become thoroughly addicted to it and wonder if I could go back to the dark ages preceding its use. I absolutely love it. It has a lot of advantages, but not least among them is that iterative functions that work on collections by employing lambda functions can often be easily distributed across boundaries and can even work relatively easily using serialization or message passing within super-computing or cluster scale computing.

Therefore, employing the for loop wasn’t fun for me.

Then I stumbled across a solution for the problem from camperbot on the forums where he/she details three solutions. The first two are similar to mine. They make sure however to correctly employ the Javascript strict equal operator === instead of my looser implementation which is equally correct given the proper context.

My colleague asked me to explain the comparison logic I employed as he consider the if statement to be convoluted or at least complex. I then explained that I subconsciously employ DeMorgan’s theorem to simplify boolean logic where appropriate.

Well Camperbot did a great job on the third example because it forced me to finally dig a little into Map/Reduce which always sounds so terribly exciting when the popular science press grabs hold of it. Like “Google uses Map Reduce and it’s super scientific…” blah blah blah. It’s a simple programming pattern which happens to employ fluent methods to achieve simplicity for parallel processing. It does however address something which my colleague asked.

“Darren, why did you choose to compare the negative instead of the positive when filtering items based on the properties present and equal?”

He had a point, so I described to him that my method would be better for performance because it would abort the process as soon as there wasn’t a match instead of gathering the data and checking whether it matched retroactively.

This is where Camperbot comes in a little short. Performance wise, his/her solution will scale to Google-sized databases on AWS scale server farms. But my solution within the scale and scope of the project would show noticeable improvements in performance within a single processor. Which brings us to the judgement call phase.

I like Camperbot’s solution better and will adopt it as a common practice of mine in the future.

Let’s look at Camperbot’s advanced solution :

function whatIsInAName(collection, source) {
  // "What's in a name? that which we call a rose
  // By any other name would smell as sweet.”
  // -- by William Shakespeare, Romeo and Juliet
  var srcKeys = Object.keys(source);

  // filter the collection
  return collection.filter(function (obj) {
    return srcKeys
      .map(function(key) {
        return obj.hasOwnProperty(key) && obj[key] === source[key];
      })
      .reduce(function(a, b) {
        return a && b;
      });
    });
}

What we see here is that instead of using for loops, three steps are performed :

  1. filter the whole collection
  2. for each item in the collection, test for the positive condition where the properties being search for are present and that the value of those properties are the desired values. This create an array of boolean values representing if each property is present and correct.
  3. reduce the list of boolean values by providing… shall we say a “boolean sum” in the sense that every value is anded with the previous anded value (think dot-sum). This would mean that only a single false value would return false. Then the filter will response and skip this collection item.

The design presented is elegant, readable and unfortunately slow. It requires traversing all properties of all objects to eliminate false values. So in this particular assignment, I would have suggested that unless the scale of the input set was tremendous and worth scaling for parallel processing, it would be wasteful. I would also consider whether the readability benefits would outweigh the drawbacks of the performance.

In this exercise I would certainly not make this choice. But I would instead make use of “every()” instead of a for loop that would provide cleanliness, scalability and performance benefits.

In a large scale database, I would think the map/reduce solution would be the absolute best option, though I might consider profiling the script within v8 to identify whether I could reduce complexity.

Map/Reduce is a wonderful pattern to be used where applicable. I believe it’s a super-tool in the Fluent programming toolbox. However, it makes sense to consider the algorithmic complexity of map/reduce and decide whether it necessarily fits the task at hand.

 
Leave a comment

Posted by on August 29, 2017 in Uncategorized

 

Cisco Open Plug and Play Protocol 

So, Cisco has documented the plug and play protocol as used by APIC-EM and I’ve been playing a bit with it and have some thoughts.

First of all, most people I talk with want to use APIC-EM for the exact same reason. They want to use it for configuration management. That’s great and while I like APIC-EM A LOT, it doesn’t really match what people are asking me about.

I’m sitting on a boat and typing with my thumbs, so this is more of a brain storm than a step by step procedure or detailed design.

Ok, so let’s consider what PnP would be great for.

  1. Initial device deployment… hence PnP
  2. Deployment of new software images
  3. Secure communication with the device from initial power on through its life cycle. This means certificate management for SSH and HTTPS
  4. Configuration deployment and tracking

That’s absolutely awesome because PnP is precisely all those things. And what’s best is that there is no reason why network programmability needs to be part of this. Imagine if deploying a new VLAN or VXLAN was a single Powershell command that simply said “New-NetworkBroadcastDomain” which would run on Windows, Mac or Linux and it would be so. Or what about “Find-DeviceWithMAC” and it would display the site name, device name and port as well as IP address and Active Directory computer or user for example.

This is not anything we need PnP for and we certainly don’t need APIC-EM but it would make it pretty easy to accomplish. I’m not particularly interested in the easy route. The reason… billions of Cisco Ethernet ports on devices that don’t support PnP.

For the purpose of this entry, I’ll still focus on PnP but without APIC-EM.

So… if you want Plug and Play without APIC-EM, it takes a few things.

  1. Certificate server
  2. DHCP server
  3. DNS Server
  4. GIT Server
  5. XMPP Server

Nifty… all those things are pretty much already in your network and if they’re not, you can run all of those things on Windows or a Raspberry Pi. Total cost doesn’t need to be more than about $100. A virtual machine works too, but is probably overkill as ECC memory is expensive.

So what is really needed is a script that can speak as an XMPP client. The initial steps of the plug and play protocol are actually pretty simple.

  1. The network device powers on
  2. The device discovers through DHCP, DNS or Cisco’s website what IP to connect to
  3. The device connects via XMPP over HTTP
  4. The device announced itself
  5. The script initiates certificate negotiation
  6. The device reconnects with the new certificate
  7. The XMPP client pushes a new IOS if applicable
  8. The device reboots with the new version
  9. The XMPP client pushes a new configuration file.

Now, all that needs to be done is to make the script monitor the Git server for changes (which can be event driven) and whenever the configuration for the device changes, it pushes it to the device. Oh, as a bonus, it can also capture events on the device when the configuration changes there and it can check those changes in to Git.

Now, it’s about features and stuff.

What I dislike most about APIC-EM is that it’s REALLY REALLY SLOW! Also I don’t like that devices have to be registered by MAC address or serial numbers. This is a disasterous shortcoming in APIC-EM. 

So what I think is a prettier idea is to use DHCP option 82. This way, when you register a configuration for a device, it will push the right IOS and configuration based on which ports the device is plugged into. So a nifty Powershell script would say “New-NetworkDevice” and specify the ports where it would be connected and provide a boiler-plate configuration. This would make it so that if a device fails and a new device is plugged in, it would just work.
This is easy enough because when a device connects to the XMPP server and requests a certificate, the script can simply lookup which configuration file it should use based on where it was plugged in.

Also, this design is much nicer than APIC-EM since all changes would be within a few seconds rather than waiting for ages.

As for things like watching for other events, I have some thoughts on that as well too. By adding a syslog server to the system, then every time the status of a port changes, events would be received and things like IP device tracking events would be instant. Then it would be possible to do things like say “When this happens run this script”. Or if the system is built like this, then the XMPP client can act as a REST API and employ Signal/R for push events in a standard manor so there would be no reason to constantly poll the server.

Cisco opened a can of worms here. PnP is actually substantially more powerful without APIC-EM than with it. I can’t wait to get some time to build this system!

 
Leave a comment

Posted by on July 7, 2017 in Uncategorized

 

Helpful stuff for working with PSScriptAnalyzer

PSScriptAnalyzer

I’m trying to get my first Powershell script ready for submission to Powershell Gallery which runs PSScriptAnalyzer and blocks sharing scripts that have errors.

PSScriptAnalyzer appears to be a Lint tool for Powershell which is quite nice, but frustrating as well. I just got hit by a few items.

Error suppression

I had a function which expected a username and password to be passed in clear text for the purpose of producing an XML file (unattend.xml) for Windows installation. While I agree with the script analyzer regarding security, I had some helper functions that would accept clear text for the purpose of easy conversion to secure text.

The script analyzer complained about three things.

  1. PSAvoidUsingUserNameAndPassWordParams
  2. PSAvoidUsingConvertToSecureStringWithPlainText
  3. PSAvoidUsingPlainTextForPassword

In order to resolve this, I decided to take advantage of error suppression as I intended these functions to be run on a local system as part of DSC resources and therefore am not concerned about the security issues they are attempting to protect us from.

To do this, I’ve used the following two lines of reflection code :

[System.Diagnostics.CodeAnalysis.SuppressMessageAttribute('PSAvoidUsingPlainTextForPassword')]
[System.Diagnostics.CodeAnalysis.SuppressMessageAttribute('PSAvoidUsingConvertToSecureStringWithPlainText')]
[System.Diagnostics.CodeAnalysis.SuppressMessageAttribute('PSAvoidUsingUserNameAndPassWordParams')]

Position of $null in comparisons with collections

Powershell has some really odd rules regarding how comparisons occur with collections. I’m not 10000% sure what the deal is, but when comparing Collection to $null, if any item in the collection is $null, then it will return $true. However if you compare $null to the members of the collection, the comparison is against the object instance, not to the elements of the collection. As such

$collection -eq $null

is incorrect where

$null -eq $collection

is correct.

This is course is the thing about learning any new programming language and it makes perfect sense if you understand that $collection sets the type of comparison and now $null will be compared to each element.

So, PSScriptAnalyzer tosses out the warning PSPossibleIncorrectComparisonWithNull to warn you of this issue. And I’m grateful for it since it helps me learn as well. Of course, I hate that my code now asks whether $null is equal to something instead of asking whether something is $null. I suppose that extending the language to add a special syntax to explicitly compare against the object would be a total mess.

So, here’s what I did to resolve the issue in my case. I could have been more explicit, but I wrote a simple regular expression which finds these comparisons and corrects them using search and replace in Visual Studio Code.

Search : \([ \t]*([^ ]+)[ \t]+-eq[ \t]+\$null[ \t]*\)

Replace : ($null -eq $1)

This finds instances of

($collection -eq $null)

and replaces it with

($null -eq $collection)

Hope this helps.

 
Leave a comment

Posted by on June 2, 2017 in Uncategorized

 

Network programmability… my definition and it’s a good one

Inigo Montoya was wisely suggested to the great Vizzini that “I don’t think that word means what I think you think it means”. This was in reference to Vizzini using a the word “inconceivable” so many times that someone just had to stop him. Here’s where I try to be the wise Inigo and reach out to many people and hope to help them better understand what network programmability means in order to get people to use it appropriately.

Let’s start by saying what network programmability is definitely not. It’s not the ability to create a VLAN on a switch using a REST API. But that brings us to “what is an API”. Ok, it’s a means of letting one system be able to make another system do something without the need for a human in the middle. It’s a language that is generally able to be relatively easily handled by software. In fact, most network hardware has had good, solid, stable, proven application programming interfaces for years. We called it SNMP. In fact, SNMP was amazing because unlike most modern APIs, many features conformed to strict standard schemas on different devices from different vendors. For example, a Windows Server, a Cisco switch, a Checkpoint Firewall and a door lock from Yale all had network interfaces and those interfaces had IP addresses and the exact same API call could get and set all the interfaces and IP addresses the exact same way.

The disadvantage of this API was lack of simplicity (funny for something called Simple Network Management Protocol huh?) but even today, compared to REST APIs from Cisco and WMI APIs from Microsoft, etc… SNMP just works better and is cleaner. Even Cisco has done an absolutely atrocious job of providing an even slightly standard method of defining, supporting and documenting APIs on their devices. They can’t even agree on how to authenticate their calls between products making SNMP a far better option even if they don’t maintain a good MIB repository either.

So, if I tell you that simply being able to program a device through an API doesn’t meet the criteria of “Network programmability” the what the heck does?

Well, let’s first answer the ultimate question. What in the name of holy cheetos would I even want to achieve using network programmability?

Well, that’s actually really easy to answer. I would want to read, validate, verify, generate and write configurations programmatically to my network. Unless your network has a total of one device, simply smacking an API on a device doesn’t achieve NETWORK programmability.

So, to do this, I’d get a copy of Python and start learning to code right? 

STOP!!! Just stop!!! Python is a frigging awesome language for writing applications and programs and for writing APIs with. Heck, it might be one of the best API making languages out the but holy cow does it suck as bad as programming assembler for actually achieving network programmability. I would consider writing network programmability middleware with it, but I would never program my network with it.

So, let’s immediately dump this IT industry infectious disease of STUPID which insists on saying idiotic things like northbound and southbound APIs. This is monkey speak for explaining stuff to people we consider to be so stupid they are less than human. You are smarter than this and I believe that you are also smart enough to realize that things can actually send and receive data. In fact, in real programs we depend on data structures and algorithms which are our bread and butter or building blocks of programming. We learn how to make calls based on far more complex structures than up and down. Imagine function A calling function B which calls C and then conditionally calls A again, this is called programming and we all do it. There is no north and south.

Let’s instead talk about real automation languages. I have personally decided on Powershell DSC which is the best supported automation language today. Now, Puppet, Chef, Ansible all pretty much do the same thing, but wow… their support is horrible and they insist on having gigantic contribution repositories of horrible and worse code available. Get some code review guys. DSC isn’t as pretty, but at least there is a development company with 40 year of developer support experience backing it up.

In Powershell DSC, I can write a “Desired State Configuration” which describes how my network is wired from part to part. The configuration will also describe what role the devices in the network play based on my design. I could say “switch5” is “access switch in ospf area 2 of BGP autonomous system 16611 with management IP of 10.0.0.73/32 with uplink 1 to dsw1 port 1/1/1 and uplink 2 to dsw2 port 1/1/1” and that would be the entire switch configuration.

Now, none of this would have any meaning unless there is something that gives it meaning. So, in my configuration, I would make an object that describes what and access switch is. An access switch would require another object called switch, a switch would contain a management interface object, the management interface would contain an IP address object.

Once we define the structure in a configure file, we would need a series of objects that can actually configure the switch. So, if we consider the configuration file to describe the absolute total structure of the entire enterprise network describes as a series of object configurations that depend on other objects that depend on others… then we have an order that the network should be configured in. We would need a method of applying the configuration.

In a horrible scripted environment, would would just make a series of commands that would make each change and be done. If we were to do this, then every single change to the configuration file would require deleting all the devices in push a new change as everything would have to be generated and deployed from scratch.

In a programming environment, each resource referred to from the configuration file would be able to get, set and test each setting. This behavior accomplishes what is called idempotency. This means that if you set the configuration one or a thousand times, the change is only made once as you check to see if the change actually has to be made before making it. It is basically a verification test for every single setting. This test also verifies that the change is actually working. Every time the configuration file is modified, the scripts will check every single function of the network and verify that it works and if anything at all breaks, simply running an earlier configuration script will revert the network to its previous operational state.

So, where a hack and slash scripter will use things like the unmaintained Cisco Cobra API to make individual changes to the network… which mind you is harder work and less reliable, a programmer will make a system that allows pushing out changes and verifying them one by one.

Cisco makes a system for this that many customers already have called Prime Infrastructure Compliance, but it is so disasterously bug-ridden it is unusable and has been for a long time. In addition, it only works with Cisco devices where Powershell DSC works with absolutely anything. Of course what is missing from DSC is the solution for day 0 deployment. This isn’t a difficult problem as it only takes time. Using Cisco Autoinstall is one option. Using APIC-EM is very limited but is an alternative option.

I think the important thing to recognize is that network programmability can be done many different ways and hack and slash scripting is an option, but a proper change management and control infrastructure is actually always to be preferred over a generic programming language like Python. Additionally, programming is always preferred over scripting. Programming takes more time than scripting, but programming is re usual even while scripting is a one-off deal which is generally wasteful.

I’ll write a bit on templating next which should not ever be confused with programmability and may be a more desireable solution for IT experts who lack the 5 years of programming education. Templating is actually a far better solution (always) than scripting.

 
2 Comments

Posted by on February 14, 2017 in Uncategorized

 

SDN – What is it for real?

In my previous blog entry about SDN, I tried to explain SDN in an extremely simple circumstance. I really oversimplified it to provide a baseline, although I realize now I did it a major injustice. See, SDN is so much more than just a stupid system that enables three-tier apps and can speak to web servers, app servers and SQL servers. It is so much more than that.

SDN is a tool for programmers, it simply isn’t suitable for IT people to dabble with. No, you can’t be an SDN expert, it would be utterly silly to consider thinking in those terms. You can’t buy an SDN tool that makes SDN happen magically. You can’t just go to Cisco and buy a nifty set of tools like ACI and Cliqr, though I will admit, it’s not a bad set of tools and might be what you’re actually looking for. I’ll talk about application defined cloud in another post if I get a chance, but ACI and Cliqr are not software defined anything. They are just tools, and while in theory ACI could support many components and features of SDN, it’s far from being an SDN solution. And before the VMware guys start pointing fingers at my blog saying “Even Darren says ACI isn’t SDN” remember that ACI and NSX are generally the same thing just implemented differently. The difference being that VMware doesn’t even have a Cliqr.

 

Let’s give a more advanced example of what SDN is. Let’s make it more than just a meaningless bunch of virtual machines, web pages and tables. Let’s see the previous example as being a service for providing telephone support. Consider for example something like a call manager.

sdnexample

In the diagram above, we have a series of switches as well as two communication endpoints and we also have a firewall. In the old days (or using ancient equipment like office IP phones) the telephones would need to come online, register themselves to a call manager and if they wanted to communicate with each other through the switches and firewall, the firewall would have to perform deep packet inspection and have access to encryption keys to sniff SIP conversations and hopefully reactively properly open ports intelligently to permit RTP streams to pass back and forth. Security on the network switches was extremely limited and for the most part, we just made separate voice VLANs that were generally insecure and had terrible quality of service settings.

Let’s look at the telephone on the left. The call manager (not shown because I’m lazy) software would register itself with the network controller. It would explain to the network controller that it is accepting incoming connections on https://myurl.local/api/register.json and from https://myurl.local/api/firmware/bootImage/{deviceId}.json from devices that identified themselves as telephones and contain valid client certificates. It would also explain that the devices should obtain a valid IP address via DHCP, be permitted to consume up to but not exceeding 30w during boot and withdraw to 21W TDP during operation after downloading the contents of the bootImage from the server. It would also suggest permitting guaranteed packet delivery but not priority queuing for request to the register and bootImage urls.

The telephones would come online and using LLDP identify themselves as phones as well as provide their client certificates containing X.509 subject metadata linking the MAC address of the device validating that the phones are in fact signed and approved by a reliable source as being actual phones. The switches as well as the firewall would then permit using DHCP to the network controller register DHCP server. The owner of the network would have configured the appropriate DHCP markings to applied by the switches via client ID, DHCP option 82 or other means to inform the DHCP server that it’s a phone requesting an address from an appropriate network. In addition, the DHCP server would be programmed to respond to the DHCP message from the phone by providing the appropriate URL to obtain firmware from as well as the appropriate URL for registering with the network.

The telephone would obtain addresses, download firmware from the call manager, finish booting and them make a request to register.json which would trigger decreasing power levels permitted to the obviously now operational device.

Once the phones were registered, the call manager would complete registration and provide an additional series of patterns describing based on the device type and registration the appropriate JSON URLs that should be accessible to the telephones and the switches and firewalls would immediate lose access to the firmware and registration URLs in favor of HTTPs REST APIs that would provide access to all the currently available telephone features.

Upon registration, the call manager would identify device types and capabilities and register the phone to the system and therefore identify the REST API URLS available to the devices at each given state of the device state machine. These URLs would be explicit and allow regular expressions to limit the range of characters allows when making request. In addition, for each REST URL, the appropriate JSON or XML schema valid for POST and PUT requests would be defined. This would ensure that only well formatted JSON could be passed and things like SQL injections could not occur.

Once a user initiates a call, the call manager would add additional available URLs that would allow in call functionality like “End Call” or “Post Picture” etc… to be sent to the call manager by the devices. In addition, the firewall would be informed by the network controller that informed by the call manager that two independent RTP and RTCP discussions would need to be permitted. For the switches and the firewalls, detailed access control and QoS settings would be described and implemented obsoleting any need to perform packet marking as well as the network controller identifying the optimal path for the “circuit” to occur and optimize routing while not exceeding administrator defined limits for bandwidth and buffer consumption at any point of the network. In addition, if NAT (straight NAT44 or even NAT 66 or NAT64 needed to occur) then the network controller would program the required static NAT translations into the firewall to permit the traffic for the conversations.

The telephones would the via HTTP push receive call setup information and initiate direct connectivity to each other.

I think at this point, you see that there’s a great deal involved in proper SDN. It is providing a means to permit explicit network communication to occur not based on configuration but based on state of the network. There is a lot of security to consider when acting as the network administrator as the call manager is actively making a lot of network changes to the switches and firewalls as the stages of management of the devices as well as call setup and tear-down occur.

It will require a great deal of scripting and planning to set policies and plan proper use and enforcement of security and quality of service at all points. It also places a great deal of the burden of networking on the developers of the products themselves. Again, I reiterate that SDN is not for IT admins. It’s for the developers and they depend on stable APIs and simply will not code to 100 different controller types. It is very likely that as real time extensions to Docker become available, we should expect those to become the native API for this sort of interfacing.

 
Leave a comment

Posted by on February 14, 2017 in Uncategorized

 

What is SDN in the Data Center?

SDN also know as software defined networking is quite simple. It’s a system where a developer can write software that is able to explain its needs for connectivity to the infrastructure and infrastructure is able to respond to it.

This may sound scary, but in reality, it depends on permissions and checks and balances. See software defined networking is meaningless on its own. It is necessary for a network engineer to design a network with a series of permission in place that can define whether applications can in fact have the resources they request.

So, for example if I were to use the mythical three-tier app most vendors are proud of describing it would consist of a few items.

– Application components

– A description of how those components speak with each other

– A description of the quality of service the developers defined as being optimal settings for service delivery

– A description of additional resources such as external network connectivity

So for the 3-tier app example consisting of a web based front end, a application middle-ware service and a storage service such as an SQL server, the application developer would design each of the tiers to be deployed as one or more instances.

So, let’s assume that the web tier is a web page which can be deployed on Microsoft Internet Information Server or Apache. This is a directory of files that should be uploaded to a one or more web servers. Once uploaded, a series of URLs should be presented to the outside world. It is assumed that the infrastructure already has a load balancer in place exposing the URL to the outside world, but it is likely we need to install a server certificate on the load balancer as well as a series of valid URL filters to guarantee that only specific pages can be accessed via the load balancer(s). If additional web server resources are needed, then the infrastructure should replicate the web app to more nodes and even spin up more nodes if needed during a peak period or due to failure of a previous node to guarantee consistent and reliable quality of service. The web tier would also define a reasonable amount of bandwidth that should be provided to the web servers as a whole. Finally, the web tier would describe how to register with the DNS servers the attachment virtual IP address once it’s known.

The app tier may be a group of Java server capable of managing HTTPS REST API calls from the web tier to the app tier. The app would provide information describe that web traffic should be load balanced with a session oriented affinity to the app servers. In addition, the app tier may define that it should always have a minimum of two active instance running and a maximum of 5, but there should always be at least one idle instance that can be made active in case of the failure of one other app node. This tier will request availability of CPU and memory resources to be spread across all nodes to provide the optimal user experience.

The database tier will describe an SQL database which should be deployed on a MySQL infrastructure and does not define how to deploy MySQL but instead, the table schema required to support the application as well as permission structures and quality of service requirements. The MySQL tier will define in detail the permissions such as whether the app tier can create new tables or delete records. If at any point the MySQL database is unable to meet the performance requirements set by the app, the infrastructure will expand to support the additional service requirements.

If you’re running an infrastructure where users can deploy their own applications, a series of policies describing whether the user has the permissions to allocate and consume the given resources. In addition, there should be budgeting policies in place that allow a chargeback mechamism. These checks allow companies to reliably be made aware of hardware, bandwidth, license, and security consumption and levels and act proactively to avoid over subscription of resources.

When an application is deleted, all resources consumed by the application are automatically deleted and released back into the pool for use by others. Think of this as an “uninstall script” but it’s really the exact same script that installed the application to begin with.

Currently, there are many systems that have been created to facilitate SDN. Docker and Azure applications are the first that come to my mind. These tools allow application developers to design and develop installation packages with no knowledge of the platform they’ll run on and allows complex systems to be distributed, installed and maintained without intervention from IT staff. 

Microsoft Azure, OpenStack and a small handful of other systems have systems in place that can read these packages and deploy these services again, without intervention of IT staff. This makes the process of installing and managing complex systems an experience as simple as installing or removing app on telephones. It works extremely well on turn-key systems but tend be impossibly hard to make work on custom systems as building an operational environment can require a thousand or more developers working together.

Tools like VMware NSX, Cisco ACI, etc… do not natively support any known method of standard distribution of applications by application developers today. It is possible to employ tools like VMware vRealize Orchestrator, Cisco UCS Director, Python, Microsoft System Center Orchestrator, Puppet, Chef, or Ansible to develop apps without the assistance of the original developers, but the process is very tedious and generally prone to absolute disaster in practice because these tools are barely supported and the people implementing the “recipes” are not generally well versed in professional software development. But they are allow for last ditch efforts by IT professionals desperate for something to work when all else fails. The proper tools to use are proper container systems made by the systems developers who actually understand the needs of their products and are educated and skilled in revision control.

 
Leave a comment

Posted by on February 14, 2017 in Uncategorized

 

Cisco UCS and Microsoft Azure Stack… and my new job

It’s official!!! Read the press release

For those that haven’t heard, I’ve signed on with Conscia where I’ll become the head of research and development to build finished products based on Cisco and Microsoft Azure Stack!

While details are still very vague, I expect Cisco to ship a Microsoft Azure Stack implementation preinstalled on Cisco UCS hardware with some integration stuff also. For the most part, I’ve been running this platform for about a year on my personal UCS rack that I bought (wife was NOT happy) and it’s a thing of beauty.

What I have planned is simple. Build the entire enterprise infrastructure into Azure Stack on UCS. What I mean by this is that I’m going to integrate the entire end to end IT experience into the Azure Stack portal. That includes everything from Cisco Wireless and Switch management to mobile device management as well. Microsoft’s Azure platform is just so far above and beyond all the alternatives that I just can’t imaging supporting anything else.

So what do I like about Azure Stack?

  1. First and foremost is that it’s not public cloud. I can’t stand the public cloud. I can write a 50 page article of bullet points of why public cloud is a bad idea for business (for tech… cool. not for business) that I wouldn’t want to touch it with a million meter pole. Just ask the people in England who were given gigantic price increases with no prior notice by every cloud vendor as a Christmas gift.
  2. It’s a solution. VMware sells bits and bobs. They sell you ESXi, vSphere, vRealize, NSX, VSAN, etc… OpenStack doesn’t sell you anything. They’re an open source project an while you can buy Ubuntu with a super slick management interface, it simply ignores that the platform might also need to support users. Azure Stack supports an end-to-end experience.
  3. It covers all bases. Unlike everything else I’ve worked with, it doesn’t treat individual technologies as silos. Storage, Compute and Network are a single thing to them. They have a real solution based on ideas like Docker. So, an software can actually define the compute, the storage and the network. This is also true for OpenStack but not for anything else. Vendors like SalesForce or Oracle and others can write a single installation program that can actually deploy their products properly in a tested fashion and the data center will just make it happen. There’s no stupidity related to having to get 20 different IT guys involved to prepare the system for it.

I can go on and on. But in the end, I’m giddy over this technology. It’s taken far too long to get here.

What am I going to do then?

Plug and Play!!!!

I’m going to make it so that wired and wireless networking will be entirely defined from the Azure environment. I’m going to use PowerShell DSC to handle the entire network roll out for the enterprise. I’m going to make it so that the entire enterprise network is configured from a single file. I’ll build a series of PowerShell DSC resources capable of configuring and verifying each feature of the network. Then I’ll make a PowerShell DSC Configuration file which allows the network to entirely defined based on how it’s physically plugged in. Once it’s rolled out, any change made to those files will be pushed to the network and tested as it’s rolled out. Then I’ll build a full Azure Stack GUI that will make it possible to manage the network entirely from within Azure Stack.

I have a lot of other plans as well, but this is first and initially the most exciting one. Imagine being able to roll out an entire Cisco TrustSec solution with full Active Directory integration (nope… doesn’t exist yet) in a single click and in less time than it takes to make coffee?

P.S. Other projects :

  1. I still teach for Global Knowledge. This is how I learn the tech and also what the users need.
  2. I’m still working on my more academic projects and will have more time for them now that I don’t need to worry about book keeping and running a business.
 
Leave a comment

Posted by on February 12, 2017 in Uncategorized

 

A proposal to replace traditional Layer-2 Networking at nearly zero cost.

WORK IN PROGRESS!!!!

Introduction

I’ve been thinking on this for a long time now. There have been many new technologies in the data center requiring all new switches to offer larger broadcast domains or better managed broadcast domains. This in itself is simply the wrong approach to networking in the network.

Cisco, Juniper, Alcatel and others have been selling us the same old switching technology since the beginning and have applied one bandage or piece of duct tape after the next. The switch itself was a broken piece of technology by design and evolved ever more into something worse over time.

Switching

Consider what a switch is.

Flooding

A switch at it’s basest level has only a single function. It is to buffer data between network segments. The simplest switch does nothing more than to receive an Ethernet packet into a buffer, then flood that packet out of all ports except the one it was received on. In its simplest form, a switch is extremely unintelligent and performs no intelligent traffic forwarding. It simply segments a network into separate connections in order to eliminate collision domains.

Learning, Forwarding and Filtering

This method of traffic handling was simply inefficient, so switches added some moderate intelligence that allowed the switch to learn which ports MAC addresses were last seen on and builds a table which provides the ability to either forward Ethernet frames directly to the port which a MAC address is known to be on. A relative non-feature would suggest that a switch can also filter incoming traffic destined for the same port which the traffic arrived on. This is of course a non-feature since switches only forward traffic out of ports except the one it’s received on, so this was never a possibility to begin with. It would violate the flooding rules.

Indiscriminate Flooding

Learning, Forwarding and Filtering are features which are present on all modern switches and they work entirely on a per switch basis. As such, switches implemented by different vendors wouldn’t require parity among protocols in order to handle traffic direction. This is an incredibly poor design which should have been done away with decades ago.

Flooding is the act of more or less randomly “spamming the network” with traffic towards and unknown destination. Since the MAC address FF:FF:FF:FF:FF:FF is reserved and would never be the source address of a frame, switches receiving a packet with such a MAC address would always require flooding it out of all ports but the one it’s received on.

There are endless reasons why this is a really bad idea, but the only reason we have to keep it this way is “Because we have always done it this way.”

Spanning Tree Protocol

The spanning tree protocol implements the graph theory algorithm of calculating a strictly single visitor map of a graph (network for us people) with minimal communication between nodes. Spanning Tree Protocol solves the problem of indiscriminate broadcast traffic becoming stuck within the network and possibly elevating itself to a threat through duplication of itself within a network that forwards or floods traffic with no understand of where the packet is actually going. Spanning tree is famous for being among the worst protocols ever made. And it is, because it solves a problem that shouldn’t exist if layer-2 Ethernet were not the standard method of forwarding traffic.

Why do we need broadcasts?

Broadcasts date back to a very long time ago. We have for a very long time taken this type of packet for granted, and we spend a lot of time and money trying reduce the size of broadcast domains… often with no intelligent reasoning for doing so. To understand why we do this, consider what broadcasts are actually used for

ARP (Address Resolution Protocol)

ARP allows devices within a broadcast domain to resolve the layer-2 address which corresponds to a given layer-3 address. This is a solution which exists to solve an issue which existed when devices within a network segment were directly connected in groups via a single pair of copper wires. This is obviously no longer the case as switches or wireless controllers are available in all modern circumstances.

Network address management/assignment

DHCP itself is a multicast protocol more than it is broadcast, but its function is still relevant for this discussion. DHCP clients that do not have a layer 3 address (IP/IPv6) can multicast out to a link-local multicast address a request for information relating to whether there is a DHCP server available. Again this was useful in an era where there was no intelligent devices located within the network segment, however this is no longer the case.

Neighbor resolution and service discovery

Protocols such as OSPF, EIGRP, ZeroConf (Apple Bonjour) and quite a few other protocols need to be able to identify devices within a given scope. Link local protocols like routing protocols prefer to look for other devices that are within the same routing subnet. ZeroConf is a service discovery protocol which should return a list of all discovered service providers within the reach of a user. ZeroConf within a corporate network should only return devices relevant to user rights and permissions via systems like mDNS which is commonly available in wireless infrastructures but is absent within wired infrastructures because of an inherent fear of letting switches be even mildly intelligent devices.

Multicast

The topic of discussion at this time is layer-2 broadcast domains not layer-3. Layer-2 Multicast addresses like layer-3 exist within a reserved range which can never be assigned to a client. As such switches can never learn these addresses and unless the switch contains some nifty mechanism like IGMP snooping, multicast traffic is always flooded unbounded within a broadcast domain.

As is mentioned, attempts at primitive intelligence that requires no interaction between switch and client already exists to bound multicast traffic within a switched environment. While the client and switch as a transit device have no direct communication with one another, a relatively inexpensive managed switch can intercept and snoop IGMP messages transmitted by client devices and build an internal database within the switch bounding the flooding domain of desired multicast messages.

Because the switch can’t talk with the routers within the network, broad control of the bounds of multicast traffic is not possible.

General discovery of Windows services

In a different time, shortly before Windows as an operating system began providing its own networking support, circa 1991… DOS, OS/2, Novell Netware, etc… provided networking and network service discovery via broadcast messages. At this time, network switches were relatively new and very rarely worked well. Switches were implemented in such a fashion that they would hide themselves from the clients they connected. Even expensive network switches did not have any management system on them. Switches were dumb devices.

At the time, servers weren’t entirely new, but servers existed for the purpose of providing business services and operating systems didn’t yet support things like user identity in any standard fashion and every vendor had their own approach to it.

Service discovery was simple. Just broadcast.

Now in 2016, we don’t have these issues. We can use services like DNS to lookup ownership of Windows Active Directory servers within an enterprise network or use services like mDNS to identify services within the network. mDNS is a somewhat hackish solution to the ZeroConf problem because multicast is used in a circumstance where it is clear that all clients are connected to a device such as a switch or a wireless controller of some type.

It is no longer necessary to broadcast or multicast to find print servers and such. There is infrastructure in place now to support  more intelligent resolution of available services. In fact most copies of Windows today no longer depend on broadcast for resolution of names or services.

What can a switch really do?

Even the most horrible switch Cisco makes, the 2960 series device which is genuinely the junk that should have never been is really a far more capable device than Cisco themselves tend to understand.

This device is capable of many excellent features, but the current features in their current forms are not interesting. However the proposal I’m presenting here must be able to function on such a device with no hardware alteration, so listing some features would be a good idea.

DHCP Snooping

DHCP snooping is able to intercept DHCP messages within the layer-2 network and can choose to forward them or block them based on trust rules. The point of interest here is that Cisco’s cheapest switch can intercept and perform “deep packet inspection” on higher level protocols.

Dynamic ARP Inspection

Again a higher level protocol tool for inspecting traffic more intelligently such as ARP messages passing through the switch. It can forward the messages or block them and can log and raise alarms.

In short, these are just two examples of switches selectively filtering and processing specific types of traffic whether multicast or broadcast passing through the switch.

Routing protocols like Spanning Tree Protocol

We tend to consider STP to be a “loop prevention protocol” as opposed to a routing protocol. This is nonsense since the intention of STP protocol is to collaboratively between switches and bridges identify an “optimal” or at least operational visitor pattern for the graph/network that ensures traffic floods to each node/switch only once.

This is precisely what a routing protocol does and is. The difference is that by identifying “root nodes” and “root ports”, the visitor pattern is relatively simplistic and primitive.

Let’s assume for the moment that even the absolute worst CPU shipped in a switch within the last 10 years is more capable than the CPUs for which Rapid Spanning Tree was designed for in the early 1990s.

In addition, while STP doesn’t actually establish a network scale relationship between devices to perform things like shortest-path first calculations, they certainly are far more capable devices than we give them credit for and CAN handle processes like SPF.

Switch to Switch and Switch to Device Communication

This is possibly the absolutely most important thing that switches can do. They can communicate with each other “intelligently” and they can communicate with servers located elsewhere via layer-3 and they can also communicate with clients intelligently over layer-2 for things like port authentication and neighbor discovery via protocols like LLDP (Link Layer Discovery Protocol).

LLDP

LLDP is incredibly important since there are two main purposes of LLDP.

First and most obvious is so that two neighboring devices can identify themselves to each other and register information about their neighbors in a database. Link layer discovery protocol is very specifically related to the link layer. In a modern network, this means that it’s a point-to-point relationship between two devices (even in wireless).

Second and far more important is that two neighbors can transmit capabilities and settings to one another to negotiate support for better features. There is common information people look for in neighbor discovery protocols like the neighbor’s name, management IP address, software version, etc… this information is meaningful to administrators for troubleshooting or management systems for “walking the network” to identify devices it should manage.

What is also possible is listing new features and their settings which are available to the client system before even an IP address is known by the client device.

The LLDP relationship in a modern network is a very personal relationship between two devices that can be used to grow networking into something much bigger and more intelligent and it is a cornerstone of this entire paper.

Within the IEEE standards such as DCBx (Data Center Bridging Exchange) already use LLDP for negotiating enhanced Ethernet features such as those required to deliver reliable Ethernet.

Switch-to-switch communication

Switch to switch communication has always been a sensitive spot because of lack of any real cooperation between vendors or worse, intentionally hostile competition between vendors. “Modern” Spanning-tree exists in as 802.1s (Multiple Spanning Tree) which is a fully engineered almost intelligent solution for loop prevention for multiple classes of traffic that is versatile enough to be used for more than just layer-2 loop prevention within VLANs. 802.1W (Rapid spanning tree) reduces and cripples 802.1S because vendors didn’t want to implement the more complex protocol. In addition, Cisco introduces their own extensions like “Per VLAN Rapid Spanning Tree Protocol” which in an abomination perpetuating the problems within layer-2 to begin with, though this is not a discussion for here.

That said, switch to switch communication and collaboration is entirely possible. There is absolutely no reason why a routing protocol couldn’t be functional within the layer-2 network. This is also a major point of this paper.

Switch-to-client communication

The client has existed without knowing of the existence of networking equipment other than routers and NICs for a very long time. There is no possible reason for this other than cost. When hubs and switches first came to be, inter-compatibility at zero cost was critical. Remember that Ethernet did not win out over other technologies due to technical superiority. It won because it was cheaper than everything else. Most other technologies were clearly superior in every way, but Ethernet’s incredibly low CapEx made it attractive. When bridges and switches became interesting, it was within organizations who had invested in hundreds of NICs where they were adopted and replacing NICs was out of the question. Doing so would have made it equally attractive to evaluate options other than Ethernet. In addition, switches shouldn’t need a CPU since at the time CPUs were still quite expensive. $0.99 options from Atmel weren’t an option.

Today, even remote control jumping frog toys from China for $0.99 with free world-wide shipping contain CPUs.

In addition, all devices are connected using strictly point-to-point links via switches or wireless controllers. There is absolutely no layer-2 “multi-access networks” left except for the “pseudo multi-access networks” we’ve created … like Ethernet switches.

Ethernet however is not multi-access. It is point-to-point, the unnecessary flooding nature of a switch does not alter this in any way.

Therefore, if both the client and the switch support a common layer-2 communication protocol, it is possible to have a purely routed network using the hardware found in layer-2 switches. Thankfully, LLDP is everywere. Windows 10 has it as a standard component.

What is a Fabric?

Switched based networking and fabric based networking is very similar in nature with the exception that all the layer-2 devices within a fabric communicate with one another to hopefully establish “optimal paths” from point to point in a similar nature to layer-3 networking. It is not necessary within a fabric for the client devices to actively participate in the routing of packets, but the devices within the fabric communicate with one another to share information “intelligently” regarding the locations of the end-points.

Technologies like FibreChannel or FibreChannel over Ethernet introduce an addressing method that allows each device to be uniquely identified within the layer-2 fabric by assigning addresses based on the physical ports which they are plugged into. These addresses known as FCIDs or FibreChannel IDs are 24 bit values containing a switch identifier, a sub-switch identifier and a port number. Since these addresses are expected to be local to within a FibreChannel layer-2 domain and the addresses are provisioned by the fabric itself, it is not necessary to be overly efficient with their use.

Centrally within the switched fabric, there is a name resolution database that links addresses known as World-Wide Names (WWNs) to their corresponding FCIDs. Client devices “homed” to the local device don’t require special routing. But for FCIDs that contain a switch ID different than the local switch ID, a lookup occurs to route the frames using the shortest path (SPF) to the switch identified in the FCID and the frame is forwarded correctly. The client devices employ the 24-bit FCIDs which were negotiated during fabric login (a process similar to LLDP exchange) as opposed to 64 or 128 bit WWN addresses since there’s no profit in using longer addresses within a layer-2 domain.

The point of this is not to explain the details of FibreChannel but simply to provide an example of an alternative technology that accomplished a similar task in a different way than traditional Ethernet and to present the proven concept of a FCNS or FibreChannel Name Server for “optimal” forwarding of traffic within a layer-2 domain.

An example of an Ethernet Fabric

There are two popular “almost compatible” though competing technologies within the data center for providing a layer-2 routed Ethernet fabric. Cisco FabricPath and TRILL (Transparent Interconnection of Lots of Links). These technologies employ a fabric design requiring the use of new switches that are more capable than traditional switches like the Cisco 2960.

To begin with TRILL makes use of protocols like IS-IS (not the be confused with ISIS) which is a routing protocol that is generally layer-3 protocol agnostic for routing between devices within a network and identifying shortest paths between nodes via transit links. It is very close in principle to OSPF except IS-IS is a properly engineered protocol designed as part of the Connectionless Networking Protocol from the ISO where OSPF is a relatively limited hack to provide a solution when no other existed without having to wait for a committee to hash out details over a period of years.

TRILL and FabricPath employ concepts which are nearly identical to MPLS or the Multi-Protocol Label Switching protocol. They build a fabric capable of encapsulating and switching traffic within a new frame type that isn’t inherantly incompatible with class Ethernet, but repurposes the destination and source MAC address fields of the Ethernet header to include routing information extremely similar in nature to that found within FibreChannel. Switching within the TRILL or FabricPath domain forwards traffic from switch to switch by employing a routing table within the domain to transit traffic in an optimal fashion from switch to switch.

TRILL and FabricPath also address the issue of Multicast by creating predefined trees for single visitor patterns that are used for flooding traffic to all members interested in such traffic. This is extremely sub-optimal because again, it assumes that the client devices are not going to actively participate with the routing of the multicast traffic. Though at this point, I believe it’s not relevant to discuss this further.

The two protocols take differing approaches to sharing MAC address information. An example is how FabricPath employs conversational MAC address learning. Switches where devices are “homed” learn the MAC address of the clients. Until such time as a packet is received by a switch where one device in a conversation exists that contains both a source and destination MAC address, the destination MAC address is not learned. Before a MAC address is learned for a destination, flooding is employed within the domain. This is clearly inefficient, but is designed as such for the following reason.

What makes TRILL and FabricPath special and not optimal for this topic of discussion is that they are meant to forward traffic “efficiently” within gigantic layer-2 domains such as those containing 100,000 virtual machines within a single VLAN. As such, new equipment is used and the Ethernet protocol itself is “broken” within the transit links of the fabric. This is obviously not a problem if all devices within the fabric are employing the non-Ethernet protocol for tunneling Ethernet packets across their transit links.

Employing non-Ethernet tunnels however is not a reasonable possibility within a classic Ethernet domain.

Characteristics of an Enterprise Layer-2 domain

Scale

Consider one of my customers. They employ a network of 1.2 million users across 2500 sites. To provide a network of this scale, cleanliness is critical and simplicity is absolutely mandatory as a network outage could mean 1.2 million workers as well as up to 10 million other people as a result would be unable to work until the problem is resolved. A wide-spread outage would have substantial measurable negative impact to every single stock market world-wide. It can spread panic. The result would actually be catastrophic and people would almost certainly die because of it.

This customer uses Cisco 6800 series switches with FEXes which are effectively external line cards employing a special “hack” to Ethernet to repurpose Ethernet hardware as an alternative to a chassis backplane. A Cisco 6800 switch in this configuration can have up to 1152 Ethernet ports connected to a single fabric and all 1152 ports… in some cases spread across buildings with internal capacities measured in square kilometers are a single managed fabric. To a system administrator, it is basically one switch with many line cards.

The Cisco 6800 series switches are by far the least expensive switches I’ve encountered on the market so far. They have a high CapEx, but they cost almost zero to manage once configured properly. What made them so inexpensive is that instead of building a switched network, we built a fabric.

So what does this have to do with the price of tea in China you might ask? It is to give an example of extreme scale within an enterprise network. A layer-2 domain containing 1152 ports is considered extreme. Nearly every successful large scale switched deployment I know of today is using this fabric technology to eliminate the management of their Layer-2 infrastructure. Airports, hotels, government buildings etc… by rejecting the classic switched Ethernet network and moving to (in this case proprietary) fabrics, up time is substantially higher, manageability is almost irrelevant and time to production is nearly zero.

As a point of reference, let’s assume for the rest of this paper that a layer-2 switched domain should scale to approximately 1000 clients. Beyond that, layer-3 should be employed.

Dump Ethernet!!!

Why don’t we all just dump Ethernet switching for fabric technologies like Cisco 6800s? It would really make a lot of sense. No joke, I commented that the Cisco 6800s are the only tool on earth I currently am more fond of than duct tape. But let’s be realistic, this isn’t going to happen and with a software/firmware upgrade to just about any managed switch made in the past 10 years, it’s likely not necessary either.

Performance

Enterprise users want excellent performance whenever possible, however they are perfectly willing to wait as much as 3 seconds for a visual change in connection status before clicking reload. Next time you dial your phone, count how long you wait before you hear a sign of the call being connected before you start wondering if anything is happening. The number is almost always about 3 seconds. We’ve been trained to this number.

This means that it is not necessary for all forwarding information to be cached at all devices at all times. It should be readily available, but it doesn’t have to be instantaneous.

Switch Limitations

A Cisco 2960 switch has a limitation of storing up to 8024 MAC addresses in it’s CAM (content addressable memory). Within a layer-2 domain, this can be quite limiting in the traditional sense, however when the domain is properly managed, with 1152 ports in the infrastructure, this would permit 6.96 MAC addresses per port. This would easily account for a PC and a telephone inline. In addition, there is even enough ports for the occasional 8 port switch on a desktop though this would be frowned upon within the transitional phase of this infrastructure design as my proposal is more focused on a two tier layer-2 network. While the design will also account for a three tier layer-2 infrastructure, it would require the desktop switches to also be upgraded.

Management is also an issue, network engineers have a deranged propensity for manually managing and configuring network devices individually. I am regularly asked in a Cisco ACI environment how to manage and troubleshoot individual port issues directly on the leaf switches. When I respond to the students “don’t”, they quickly become disgruntled. I try to explain they don’t have spine and leaf switches but instead a fabric. Cisco responds to the students by making the horrible mistake of letting them have their way and providing the means to do so.

Introduction summary

There are many problems and caveats to moving to a new layer-2 design which does away with broadcast domains and the horror of 1972 technology perpetuated into modern times. While I believe I’ve covered most cases in the sections below, there will always be corner cases. It’s entirely possible corporations like Cisco and Juniper will lack the agility to manage the transition as this paper should hopefully disrupt nearly every business unit within their organizations.

The design

To begin with, I’ll present an example of what I believe the new connection process to a switch would be. There are many steps involved and I’ve decided to try and keep it brief. I’ll continue with some more detail of technologies which don’t exist that will likely be needed at some point. Remember that all steps in the process should be something that can be implemented on as many existing switches without any hardware changes required as possible. This means that a managed D-Link switch or a top of the line Cisco should all be able to simply be upgraded through software to support this. In addition, changes to the client operating system should be able to operate in physical or virtual connectivity.

This also addresses only the wired infrastructure at this time. The wireless infrastructure will require a great many changes to handle master, session and temporal key negotiation. The workflow also favors an enterprise environment at this time. Implementation at a consumer level is also very possible, but would be much simpler and shorter.

Please also forgive my use of worded flow charts instead of graphics as I wish to keep this format as portable as possible between document editors for now.

Example of client PC connectivity.

  1. PC connects to brings layer-1 up
  2. PC brings layer-2 up
  3. Switch detects signal rise waits for layer-2
  4. Switch transmits LLDP packet identifying ID and capabilities including
    1. IPv4 address assignment
    2. IPv4 Route configuration
    3. IPv4 Address to MAC resolution
    4. Name server assignment
    5. Time server assignment
    6. Boot server assignment
    7. Boot configuration assignment
    8. Etc…
  5. PC responds with LLDP identifying name and capabilities including
    1. IPv4 address assignment client
    2. IPv4 route configuration client
    3. IPv4 address to MAC resolution client
    4. Name server assignment client
    5. Etc…
  6. PC sends query to switch via directed unicast as HTTPS over Ethernet requesting :
    1. IPv4 addresses
    2. IPv4 routes
    3. Name server address
  7. Switch connects to parameters configuration server via HTTPS REST and posts a query requesting parameters for connectivity to the subnet corresponding to the port which the PC connected and provides JSON encoded session parameters from client PC request.
  8. Address resolution server queries local database for options corresponding to the request and responds the the HTTP post request.
  9. Switch forwards REST response to PC as result to PC’s request.
  10. PC compares HTTPS certificate to its trusted certificate pool. If the certificate is known to be revoked or is invalid, the user is presented with a warning or error depending on configured system policy.
  11. The PC applies changes to its configuration, verifies connectivity to provided servers and routers.
  12. Using HTTPS over Ethernet, detailed session logs are posted to the switch in JSON format.
  13. The switch performs collates session log messages received from the client with its own notes regarding the session and pushes via HTTPS the session log formatted as JSON with detailed information and high precision time stamps.
  14. The PC connects to a TLS enabled SIGNAL/R session provided directly from the switch which is acting as a proxy to the services system (I.E. address configuration portal) to receive changes to services and things like address assignment information.
  15. The switch via HTTPS POST registers device location information including session information with the LDAP authentication server.
  16. The switch signals PC via unicast SIGNAL/R session that authentication services are ready and specifies the authentication servers (LDAP/AD/OAUTH/etc) to which their session has been enabled including default domain information for each server.
  17. The switch applies access control to the port denying all traffic except connectivity for secure LDAP to an authentication server. HTTPS should also be an option for OAUTH v2.0 authentication.
  18. The PC performs Using HTTPS over Ethernet, the PC requests the MAC address of the IP addresses of the configured routers for connectivity to the LDAP server.
  19. The switch queries it’s internal database for the corresponding MAC address
  20. If the switch draw a blank, it connects to the other designated address resolution and registration service for the local subnet and posts an HTTPS REST call requesting the address.
  21. The designated address resolution service switch receives the request and queries its database. If the query is empty, the service switch posts a query to the top level address assignment and management service to query the MAC address for the registered IP address.
  22. The address service responds with the corresponding MAC address
  23. The designated resolution switch responds to the query request
  24. The local switch returns the result to the address query.
  25. machine authentication against secure LDAP server.
  26. The LDAP server accepts authentication and triggers authentication event via SIGNAL/R.
  27. Authentication event triggers construction of port configuration data.
  28. The network permissions service queries the LDAP schema extension containing access control entries for what other machines can be accessed by the given machine. The machine names are queried against the address assignment/management service and constructs a JSON structure delineating the access control entries corresponding to the machine account session containing only explicit permit statements that have been properly summarized. The access control structure is passed via SIGNAL/R to the switch where the session exists.
  29. The switch translates the vendor neutral access control structure into vendor specific lists and applies the list to the interface.
  30. A user begins authentication on the client PC.
  31. The PC contacts the LDAP server to authenticate the user.
  32. Upon success, the LDAP server triggers construction of port authentication data.
  33. The network permission service construct and access control list corresponding to a permissions matrix define in the LDAP schema extension that lists all machines the user and machine are allowed to access using a explicitly inclusive means of configuration.
  34. The network permission service pushes the vendor neutral access control list to the switch via SIGNAL/R.
  35. The switch applies the list to the interface.

Technology choices

For people who read through the entire list, let’s discuss what’s new here.

LLDP

LLDP is just always the right way of negotiating Ethernet connections now. It’s a standard already widely available from nearly every vendor and is well understood. The protocol itself is an ugly mess of binary TLV nonsense, but other than what is likely legacy networking silliness and excessive processing overhead for no real gain, it’s really simple to parse and it can contain just about anything. Let’s also not discount that it is intended precisely for what it’s being used for here.

HTTPS over Ethernet

This is 2016 and we’ve moved from a world of several dozen special purpose protocols to a much simpler platform requiring nothing more than HTTP connectivity to be able to perform any action we’d like. The benefits of REST APIs employing JSON for formatting over TLS secured HTTP connectivity is proven. While it’s true that JSON and HTTP are extremely verbose, there’s not any practical reason from a development perspective to shun them. With the inclusion of text compression, they are not necessarily inefficient either.

This system proposes that all communication when possible occurs using HTTP for transit over Ethernet. To make this happen, it will be necessary to provide reliable transport across Ethernet without the use of IP. TCP is of course an option and should be considered, however it also is likely far too robust for smaller systems and unnecessary as a simply acknowledgement and retry mechanism is definitely better suited than a protocol with extensive flow control and packet fragmentation abilities.

Microsoft SIGNAL/R

Microsoft SIGNAL/R is used heavily for multiple reasons.

  1. It’s open
    1. It’s relatively simple
  2. It’s thoroughly proven
  3. It has native support for unicast, multicast and multicast emulation over replicated unicast.
  4. It a named message group join system similar in nature to DBUS but more network oriented
  5. In a modern environment, it’s the only signalling system I know of which works on every single platform.

What has occurred?

The PC which logged in was able to :

  • Securely connect to authentication services but nothing else
  • Communicate directly with the active directory server eliminating the need for silly systems like 802.1x which exist to solve the identity problem on legacy networks.
  • Receive IP address assignment
  • Receive a full routing table instead of just a default gateway
  • Receive changes pushed from the network to the client eliminating the need for logging out and in again when network changes occur either by accident or through active management.
  • MAC address resolution became a tiered unicast function
  • There was full redundancy at all points of the chain.
  • There was full logging, not just syslog at all points in a correlated fashion that would simplify root cause assessment and reporting.
  • Address propagation was query oriented instead of broadcast push oriented simplifying MAC address table population.
  • Machine and computer authentication was possible and secured at all times. (remediation and detailed authentication comes in a more detailed spec)
  • No special technologies like Cisco TrustSec were required to provide intelligent user and group permissions relative to the LDAP style user/group infrastructure.
  • The database of all things address oriented was centralized and could also be tiered.
  • All communication which traditionally was broadcast or multicast has been limited to strictly link local

What needs to be considered at this point?

Layer-4 reduction

By this, the matrix for security visualize above still requires the use of port numbers for UDP and TCP. Instead, access control should be service oriented. By this, systems like “Skype for Business” upon call negotiation and setup should push signals to the switch explicitly permitting conversations if the access matrix specifies “Permit Skype calls”. Of course TCP and UDP should be allowed, but should be avoided as much as possible. An access control entry within the permission matrix should look like :

permit machines in group MINIONS to Skype each other
permit machines in group MINIONS to connect to machines in group BANANA for services HTTP and HTTPS

Service registration

Currently, we depend on services like Bonjour which are multicast “Spray and Pray” protocols. But even with enhanced services like Cisco’s wireless implementation of mDNS which is really a nasty hack (not their fault), it should be that the clients and servers register their services that their providing or consuming through a centralized tiered registration server via explicit unicast. The network should then be actively adapted via access control list updates to permit connectivity to specific services based on the service oriented active control lists.

Quality of service

With the design so far, all communication is basically known to the servers and the servers can proactively update the network for specific traffic. As such, there is no reason why per stream quality of service can’t be considered. The same mechanism which is responsible for creating and applying access control lists to the switch can be used to create and apply class-maps and policy maps for per-stream traffic requirements. Instead of having “random trust boundaries” as we see today (even with Cisco EasyQoS), all call setup would be processed and managed at the server where policies can be configured and enforced. Each policy would be implemented and enforced on a per-stream basis and the call setup server (call manager) would be responsible for informing the network that the call is coming in and adapt to it on a per stream basis.

Backward’s compatibility

There’s absolutely no reason this system needs to have compatibility issues with older clients. On the switch, there would be a full DHCP implementation that would proxy queries to the IP address assignment server and an 802.1x fallback option for clients. That however should not be necessary in most circumstances as the authentication server should be able manage access control based on machine accounts for example.

BREATH!!!

I’ll be back soon to describe switch to switch traffic

 
Leave a comment

Posted by on December 1, 2016 in Uncategorized

 

Problems with DevOps

I’m gearing up for a big DevOps circuit around the states over the next two weeks. I’m truly excited about it, but there are business issues as well as technical issues that I’m concerned about on this trip. As many of you know, I’ve worked 20+ years in software development on some of the biggest and most used applications in the world. I’ve worked with some of the absolutely most brilliant developers I can imagine. Some of them are “Scary Amazing” in the sense that watching them work is like watching something from a movie that shows off some prodigy computer nerd that simply splats hundreds of different windows on the screen every minute.

I’ve worked in a continuous integration/continuous deployment environment since the late 90’s though we had different names for it at the time. The important thing to understand is that CI/CD is about methodology and methodology is computer science, not IT. When universities introduced IT (ICT in some countries) as a major, they had a major problem. There are many countries in the western world which moved their electronics-engineering programs into trade schools instead of the university because it simply wasn’t a science. IT/ICT is also generally not a science. It’s an applied engineering skill. Most topics in IT are better covered by 5 day to 2 month programs which build expertise in a specific area of technology. Often (though less so now than ever) certifications have more value than degrees as degrees take 3 years to get and the technology has moved on since the graduate learned it. As a result, many people do IT degrees and then come to me (or others like me) for two weeks to teach them a skill.

Things are changing around now. With the principle of DevOps and the evolution of IT into a science instead of an applied technology skill, training companies, schools, universities, etc… are simply not prepared to move in the direction of teaching something like “IT Science” or DevOps.

What I think DevOps is

So let’s talk a little about what a person in DevOps should be able to do.

  1. Basic IT skills in each main category. Not expertise. DevOps guys are useless if they spend 90% of their time on a single technology and to be an “Expert” in something like Servers or Networking requires thousands of invested hours in a specific area.
  2. Good programming skills. DevOps very likely won’t be doing the coding in the real world. They will for the purpose of prototyping, but a DevOps guy should be a competent developer. They hopefully will have worked on a large scale production system that used CI/CD methodologies as well as have built strong experience in modularization.
  3. Excellent project management skills. This doesn’t mean they’ll be the project manager, but it will be their job to gather information about each task which needs to be accomplished and should be able to plan and design each sprint in a SCRUM environment for example.
  4. Good written skills and communication skills. The PM should be able to bring a DevOps guy/gal into a meeting with people wearing ties and have them listen to “business needs” and without using technical terminology (it doesn’t belong there) provide a rough draft of the stages of a phased roll-out for the project described as well as identify approximate resources (humans of specific flavors) to complete the job.

A DevOps person should work at least part time to keep hands on skills and knowledge as an IT person and/or developer. They shouldn’t sit permanently in the roll. It becomes too easy to forget how easy or hard certain tasks are. Just because Windows Server 2012 R2 can be installed in 4.2 minutes on the current platform doesn’t mean it is guaranteed, they should learn about the things which can go wrong so in the future they can design contingencies.

A DevOps person should spend most of their time doing paperwork and code-review. A huge part of CI/CD is refactoring. We write the code once to ship it. We write it again to improve stability. We write it again to improve performance. We write it again to improve usability. We can keep bouncing through those phases.

IT people will focus almost entirely on troubleshooting and POCing technologies. They will be removed almost entirely from the deployment and repairing of the infrastructure. All changes to the IT infrastructure must be performed through a management system. It could be something is trivial and simplistic as Chef or as GUI oriented as Cisco UCS Director or something as horrible as Python (great tool… just not for this job).

For design and deployment of a new system, DevOps would ask IT to record step by step everything they would need to do to accomplish certain tasks. They would build a proof of concept to demonstrate the technology. DevOps would require that the “step by step” is reduced to only what could be accomplished by command lines or REST APIs.

DevOps would also ask IT to write verification steps for each task. This would be roughly equivalent to a low level design as seen historically.

DevOps would work with QA/QC to develop verification scripts for each verification step defined earlier. All the new tests should fail while all preexisting tests should pass.

QA/QC would report to DevOps about what preexisting tests should (by design) fail once a new change comes in. QA/QC and DevOps would review those tests to identify whether the tests are invalid or should be altered to pass correctly in the old state and new state but fail correctly in the old state and new state as well.

QA/QC would set the new tests as live with warnings instead of errors on failure of new tests.

DevOps would design sprints and build sticky notes for the tasks to be accomplished by the developers along with correlating tests. There should be an extremely strong focus on revision control to permit rolling out the new systems without downtime to the existing systems. With a proper CI/CD environment, planned outages should never be needed.

Developers will for each specific task, break the task into smaller sub-tasks that can each easily be rolled back. That means snapshotting (and testing that the snapshots are valid and cover a broad enough area).

Developers will provide more fine grained verification tests to decide whether each sub-task passes or fails. If the sub-task fails, then rollback will occur (automatically we hope) but if it succeeds will log it and move forward.

Developers will then implement code which changes the IT environment by snapshotting, changing, testing and rolling back if necessary.

Once all verification tests from QA/QC are passing as expected, future failures for those verification steps become errors not warnings.

What the problem is?

The “Experts”

There are simply no humans alive today with DevOps experience… at least what I would call DevOps experience. I have a few months experience and a lot of previous relevant experience but I wouldn’t consider myself an experienced DevOps guy… even if I have been doing something like it for years. There are probably barely a handful of IT people with “professional developer” experience in the world. There are probably a few handfuls of “professional developers” who would even be interested in IT enough to make the change. There are even less people with the business skills required on top of the technical skills to fit the mold.

Human resources

It’s utterly obvious based on the direction which IT certifications have taken that human resource departments have absolutely no idea how to hire talent for different jobs in IT. The world is held at gun point by certification groups which are simply not testing anything job-relevant anymore. There was a time when a CCNA, CCNP, MCSA, MCSE, etc… meant something. These days, they’re search terms for LinkedIn but having served as a SME (subject matter expert) in the design of several certifications, I believe the new technology has rendered most of these certs more or less useless. CI/CD and DevOps has basically eliminated almost all the value of those certifications.

So that leaves human resource departments completely lost. How can you search for the necessary talents to fill spots in DevOps when no one has them and there’s simply nowhere to get them in an “organized” fashion such as a university or training company?

By the time resumes show up on the desk of the PM or department manager, there’s just no hope that they will have any value.

Head hunters

Let’s just stop there. Head hunters have been so far out of their depths regarding IT talent for years. They’re just simply guessing these days. I receive calls from head hunters almost every single day. I’ve actually blocked most of their numbers. Head hunters belong in a lot of places, but in IT these days, they’re little more than dead weight. If you actually need a head hunter, consider outsourcing instead. That way, you at least have an external firm to hold responsible for failures.

Making the right talent

I believe strongly that DevOps requires experience and education. The closest thing we have to education regarding DevOps these days is computer science. There are many brilliant people who went to school for computer science that found out that there’s a lot of programming and even math involved in that topic and realized too late it wasn’t for them. It happens. Often they end up in product management or project management. I can think of a few people right now. These people could be good candidates. They should have some experience as a developer with the education to understand development procedures as well as modularization. They also likely worked through projects at least to the point of receiving a passing grade in development as well as hopefully having learned about process like Agile, eXtreme Programming or SCRUM. They could be promising.

I wouldn’t start with IT people at this time since IT people generally have a great deal of practical knowledge but even if they took classes on things like Python, lack knowledge regarding topics like optimization of code or much experience with refactoring. I don’t believe a person with 20 years of IT experience is well suited for DevOps. The skill set is just far too different.

Most people educated in computer science have a major character flaw. Ask any IT guy what their experience with developers has been. It should sound like ‘Hey, can you fix this for me real quick? Just click this, this and this.’ at which point the IT guy says ‘I can do it Tuesday’ and the programmer gets grumpy and makes a snide remark like ‘Come on, it’s easy, just give me the password and I’ll do it myself’.

The truth is, most computer science people are perfectly capable of handling individual IT tasks and can google their way through the rest of them. What they lack is the experience to know that if you change something right now without writing a ticket and documenting the change and verifying it, it will likely break more than it fixes. Letting the programmer fix it him/herself will lead to undocumented changes that will cause unexpected behavior which is never a good thing even if the change “fixes more than it breaks”.

So, take this person who can “fix it him/herself” and put them into IT training with the “IT noobs” and make them learn that there’s more to IT than just clicking “next, next, next finished”. IT has process too.

Make a big point out of making sure that the programmer learns some Cisco (I don’t care what network equipment you’re using, the best classes are currently Cisco’s), learns some VMware (again, course quality), some storage (no good training for this), some Linux (RedHat training is okish), and lots and lots of PowerShell (this is the heart of all things Microsoft).

Once they do this… get their hands dirty on real IT projects and learn enough about each technology to communicate with experts in each area (not as a peer, but at least as someone who doesn’t just nod their head and pretend they understand), then you might have someone DevOps worthy.

I think it’s far more likely that DevOps will be an educated developer currently working as a project manager as opposed to someone who spends most of their time hacking in code or IT today. It’s even better if they were a good coder, got bored of it and moved on as they got older.

DevOps Bonus!!!

I think that age discrimination is on your side here. No joke… and old school COBOL developer turned project manager is likely a great starting point. I constantly read articles about how old farts (40+… like me) are no longer marketable… we aren’t… get over it. I won’t work 18 hours a day 6 days a week grinding code I’ve written 20 times already for $80,000/year… but you can hire a dozen 23 year old hotshots who will. So… nope… not marketable.

DevOps however is a happy place for the old geezers. Someone who worked 3-10 years as a developer and along the road moved into project management long enough to learn how to use PowerPoint and Project… maybe even attended a few SCRUM seminars… there’s a great starting point. There are tons of these people out there. You can probably turn them into what you need in 6 months time with a few courses and some OJT. Slow and steady will win this race every time. I know guys (not gals this time… different era of computing… pin striped shirts with armpit sweat stain days) I would pull out of retirement who I think would ace the DevOps position.

No Silver Bullet

If someone has sold you the idea that DevOps is a silver bullet that will fix everything… stop now. DevOps is a great idea and absolutely should be the direction we move next in IT. It’s going to kill A LOT of jobs. It’s going to trim millions of consultant hours. In short, if all goes well, it should bankrupt TCS and bring jobs home.

What DevOps provides (if you take it seriously and don’t just through 10 random people in a room and call them devops as most companies do today) is accountability and predictability. What starts working should stay working. What isn’t working yet should be broken into tasks small enough that they can be accomplished and put in production quickly. A good DevOps oriented system (meaning adding developers and DevOps people) to IT should show amazing ROI… there’s no proof of it… unless you consider that the same methods are used for software development all through the industry and if you need proof, then ask yourself when the last time you got a Blue Screen of Death or had to constantly reboot your PC was? This machine I’m using hasn’t been reboot in 3 months and works just fine.

Conclusion

DevOps is coming. It’ll change absolutely everything we know about IT and none of us are really ready for it. If you’re a PM, you have no idea where to start. If you’re a developer, you probably would prefer to do something else. If you’re an IT guy, you probably hate the idea. If you’re a CEO, you’re going to be disappointed since DevOps… when it’s working, you shouldn’t notice it’s even there. It’s a progressive change.

What I will say is… when it starts working (and we’re getting there), the way we do business in IT will rock. Goodbye “Public Cloud” nonsense… just like the timeshare mainframe, it doesn’t makes sense anymore. Goodbye system integrators, you shouldn’t be needed anymore. In reality, we should be able to kill off most IT consultants as technologies like Azure Stack should eliminate you.

Don’t quit your jobs yet, it’ll take 10-15 years before we really feel the hurt in IT. DevOps is new and we’re still making it up as we go along. I’ve been to 5 DevOps seminars so far and they were all absolute jokes. They mostly were selling useless products that were obsolete before they ever shipped. DevOps will come, but it will hopefully come from the universities and not the head hunters.

 
3 Comments

Posted by on July 14, 2016 in Uncategorized