Automation has many possibilities. From politics to the grocery store, it has ways to remove redundant tasks that are often done by people. However, automation works very similarly like a tree structure. As we automate one thing, we usually need another system in place to manage what we automated. Sometimes this is where human’s come in as well as new jobs, but it can also create unnecessary work.
A good example is managing technology’s infrastructure, or in this case–servers, If you are running a data center or if you have facilities in your business that utilize servers, you usually have a team of people who overlook purchasing, analyzing, and designing these servers for their business. Besides the fact that cloud computing has solved the majority of this problem, there are still businesses today that run private cloud for various and specific reasons. The question becomes, how do you create a private cloud and automate the people managing it? This may not mean removing the entire team itself, but usually infrastructure teams are very large and require a lot of people in order to do data analysis and immediate troubleshooting. This is work that costs a lot of time and money. In today’s world, you want to be deploying preventive solutions in your environment not reactive. This means removing the bulk of your technicians and performance analysts.
Failure Is Inevitable
If we continue with our data center example, there is no way to prevent a hardware failure or even an application failure. These things are going to happen no matter what. However, the key is to not wait for them to happen, but to be actively looking for signs of their inevitable occurrence. If an SSD drive is showing signs of failure: replace it before it fails. If an application is showing signs of high demand on an environment: grant it resources until you can find the cause of the issue, fix it, and then scale back. All of these situations are handled on your standard public cloud environments. The questions becomes, how do they do it? How does a company like Amazon, Google, or Microsoft deploy fully scalable environments that react to failure before they happen? The answer isn’t straight forward.
Machine Learning and Big Data
A company today that isn’t keeping their data is a company bound to fail over and over again. In a sense, data is free money. It can tell you so much about your business that you aren’t aware of and it can be the fuel to powering your technology stack. Machine learning is the key to managing our data center problem. It is what is employed at Amazon, Google, Microsoft, and any other data center provider. However, how is the important part. Before machine learning, there was a lot of data and it was fed to people who tried to make sense of it. Using this data, you were able to create your own performance metrics and with these metrics you can set manageable guidelines for your business. In the perspective of a data center, you would employ strict, static metrics that you would place on your entire environment. When a metric hits a certain value you can automate the alerting process and even the remediation process. This is great in theory, but the reality is much different. Human designed metrics fail to take into account edge cases and they fail to understand that each application runs and does something different. In a perfect world an application fails, an alert is triggered, the problem is fixed, and we move on in our business. The reality is that an alert triggers on an application that isn’t running out of normal parameters and we all scratch our heads asking why? It turns out, there was a miscalculation in our design of the hardware or the application itself was patched with a bug that was causing certain functions to consume more resources than usual. It doesn’t matter why really; what matters is that our metrics didn’t work and their automated actions caused more headaches than relief.
Machine learning can take into account the edge cases and it can find problems before we even see them. A good example is when I employed machine learning on logistical servers. We found that sites requiring upgrades were sites that would never be hit hard in theory, but the reality is that past trending data shows that they were and in time they would be hit harder causing problems. So, these sights are upgraded and in the next few months they are slammed with data and processes. Luckily, we reacted before there was even a problem. We prevented a failure from ever happening by seeing it on the horizon. As we collect more data we become more preventive of failure and when a failure happens it didn’t actually cause a problem because we knew it was there already and we deployed a solution.
What Comes Next?
What happens when failure is automated to the point that a team of 12 is now a team of 3 doing nothing but designing new environments and maintaining current ones? New teams may be formed, such as application developers to automate the environment further and maybe some data scientists are employed to train better learning machines to prevent further failures. Essentially the team of 12 either learns new skills or sinks. However, these consequences are necessary for the survival of the business. The reality is that you cannot survive today if you are simply trying to react for when a problem arises. You need to be aware of a failure a month or a week before it happens, otherwise time is wasted and money is lost.