My name is Pierre Jannez, I have worked for Operae Partners for over ten years I coach IT teams in their Lean journey I will tell you about incident management In January this year In a datacenter a bunch of servers shut down at the same time, I was at home when my son asked me to get some bread for the sandwiches I ran to the bakery, and try to pay with my credit card but the payment failed I checked on my bank application on my phone the app didn’t work I was a bit worried, I walked to the ATM the ATM was down as well so I had to back home, get some cash to buy bread I realised that I was not the only one bothered Messages started popping up on social networks because there was actually a major system failure somewhere even the media mentionned it a Lean workshop even had to be interrupted because the team was impacted by the same failure It was a major IT incident This major incident actually hides something even bigger One the CIO of a major French retailer was telling us that his major painpoint is that every year, he has to deal with 300 000 incidents 300 000 ! that’s huge and of course among these 300 000, some are major ones but all of them must be taken care of what I am going to tell you today is how to have very efficient teams to deal with these incidents and reduce the stock, the volume, etc in order to manage to produce real value for the customers how? with what I believe is the magic in Lean management: pull flow So this talk will be about pull flow and jidoka, as well Pull flow comes from Michael Ballé who always used to tell us: “every time I walk into a factory, I wonder how I will put this mess into pull flow” After hearing that so many times I realised that it might actually make sense and started to wonder what that meant for IT Today, IT services are under a lot of pressure The pressure for digital transformation which requires to open the systems which leads to having more users to having more requests to needing more systems and all this contributes to an increase in the volume of tickets among all those tickets, 55% are incidents great room for improvement! Incidents cost a lot of money According to a company called Splunk, each incident costs approximately $140,000 According to their latest survey, the cost of one critical incident is around $140 000 Yet, we don’t know how much the small 300 000 incidents cost All the IT services face the same problem year on year, their budget does not evolve much The budget doesn’t increase but the demand for change does and so does the number of incidents the budget for change remains the same while the budget for RUN is impacted by the incidents The trick is to work on the incidents to reallocate some of the budget previously allocated to incident management to change what needs to be learnt is how to accelerate the production of quality how to do that? not with external consultants not through the outsourcing of the production not either by putting pressure on the teams we’ll use an old trick that is approximately 60 years old we’ll do that old trick with the people of today who are young and who therefore are not exactly like me… as I am a bit old we’ll apply an old method it is called Toyota Production System I will not go into a lot of details we’ll work on the visual management to make problems visible we’ll work on Just in Time with pull flow which will enable us to see quality through jidoka this is the theory, let’s see how it works in real life one more thing, TPS could also stand for Thinking People System as it is a tool to learn to think So why? well because it works really well to give you an idea of how well it works here are the results achieved in the last 20 projects on average, the volume on new incidents has been reduced by 37% The stock of incidents is reduced by almost 60% therefore the customer satisfaction goes up: +25% leadtimes are divided roughly by 6 production increases times 3 these are pretty good results the magic is pull flow Why is it important? Because thanks to this tool learning is massive acceleration is massive as well on the right topics it’s not for the sake of velocity! one needs to accelerate for the sake of the value for the customer in the right direction How we do that? This is the story of a team I coached two years ago in a bank this is the team and its sponsor at the center with this team, we achieved something incredible how we did it? the first step was to make the activity visible Build the right visual management to enable us to see the right problems and deal with them at the right time once the visual management us un place we pulled the activity to see the problems arise and then solve them through PDCA what the Lean authors say is: start with visual management But one has to build the RIGHT visual management to succeed, one needs to define performance indicators are we succeeding? What exactly are we succeeding at? Second: one needs a production board to visualize the current activity and then lead to problem solving For me these are the three pillars of visual management What do we mean by measuring performance? First, is the customer happy? Is she satisfied with our work? Then you need to measure quality as the job is to deal with incidents In this type of activity, you look at the volume of incoming incidents and the stock of incidents not solved yet and what we want is to reduce both. what matters is if we succeed in solving them more quickly the other interesting indicator is productivity why productivity? because everything we will put in place will enable to us to identify problems which we will solve and the question is: will all the problem solving, all the improvement help increase the team capacity and to produce more? Because what we want to achieve is not to produce more by increasing the pressure on the teams but to produce more by freeing some capacity in order to deliver more value to their users in this case as we start working on incidents it’s reparation but once you have gotten rid of all incidents, then value creation begins before that one needs a challenge The sponsor gave a challenge to the team The sponsor said: currently, you solve 3 incidents per month From now on, I want you to solve 10 per month This objective was almost reachable We’ll start by looking into the visual management Customer satisfaction: 7/10 that’s the score given by the client for the incident management activity not great The team wondered why and listed on the wall the Voice of the Customer which is the list of the pain points mentionned by the clients this is a useful list for problem solving from the customer point of view with this information, the team understands that there is room for improvement in other areas than the incident itself: the way the incident is solved the way it is told the client that is sometimes a problem that’s what is visible on this wall Then, we created quality KPIs the stock of incidents and the volume in the objective was to reduce both stock and volume in I tend to be a bit too pushy My goal is zero incident. In Lean, quality means zero bug Our projects are two months long so in 2 months, we must reach zero stock It’s almost reachable, when you work on it The objective of the sponsor was 10 so we also set a production KPI, on our ability to fix 10 bugs Finally, the productivity KPI so we can see if the improvement efforts work or if we’re solving the wrong problems Next step: make the activity visible why it matters? Because everything is hidden within the PCs as nothing is visible, it’s hard to solve problems you can’t see This is how we made the activity visible we showed the process a column for the backlog: all the incidents of the backlog a column for the team members and a column for each step of the process split in 3 parts: work in progress, ready (ie: ready for the next step) and a red bin where you put everything that goes wrong you place the Post it of the “incident” in that column with a first cause and eventually, the delivery trucks because the team delivered every month so various trucks were ready to deliver as the incident was customer dependent depending on the customer, the incident went to one truck or another once this is visible on the wall the team went through the various tools to gather the incidents to fill in the visual management tickets fill in the backlog are sorted with the customer by order of priority and sorted again by the team by complexity that gives a strategy to tackle the stock as fast as possible start with the incidents that were the most impactful for the clients among these, start with the simple ones that’s how the board is filled then we prepared the problem solving board there were two main axis: first is to accelerate (as it was the sponsor’s objective) we want to correct the bugs as fast as possible actually there was no real problem with the correction of the bug itself that’s why problem solving was focused on speed once the visual management was ready, we could start playing with it by pulling the flow pull flow AND jidoka at the same time we defined a takt time, created a continuous flow and started pulling the flow why? in order to see the quality problems with the red bins in the process that enables us to solve the problems as early as possible The magic trick to start with is the definition of the taktime some might say it is not a real takt as here we deal with numbers but not with frequency it’s not easy with IT teams. You can’t ask them every half hour: “is your incident out?” instead the team fixed a daily objective if you want to manage all the incidents that come in everyday while reducing the stock we must be able to get X out everyday that’s the formula take the volume in, your stock, divide it by the number of days necessary depending on the reduction you are looking at NB : this works not only for incidents but also for user stories in Scrum teams the objective of the day is on the upper side right start pulling the flow from the end starting with the tickets that are closest to the exit that’s what it looked like you see the process the various colors blue ones are problems so more complicated incidents all the yellow ones are incidents the ones in the red bins have been sent back from the next step for example the definition of the incident itself is unclear therefore the team can’t solve it then as the analysis was unclear, the developer could not code etc etc so the team must get 2 incidents out per day in order to kill the stock absorb the volume in so we started pulling to pull the flow, every day, the team gathers in front of the visual management the daily meeting aims at defining the contract of the day it starts by the end the team commits to get those 3 and this one out the goal is to get them out once these are out the team gets back to the rest of the activity to get new tickets close to the exit but the goal of the day is to get these 4 out the idea is to make this contract visible here every one knows what tickets must get out this way, at the end of the day, the team can tell whether she reached or missed her objective in this example, 2 tickets are out: number 1 and 3 number 4 is WIP number 2 has been rejected and is now in the red bin These are 2 opportunities for problem solving 1 is about a quality problem within the process the other one is unknown what happened? is it a skill related problem? is it a problem of environment availability for example? that will lead to problem solving you see many things when you look at such a visual management delay, “pending” means no one is taking care of that, quality issues and here rework here one can see there’s nothing to do the flow is interrupted whereas here there is a lot of production going on, maybe too much… that’s a very powerful tool which shows all the problems that need to be solved it can be a bit scary at first but if you take one problem at a time, things improve over time that’s what it looks for real it’s a bit messy red bins everywhere test Post ITs are piling up which means things don’t go out fast enough that being said, if you look at the objective of 10. they’re almost there! whereas the team used to get only 3 tickets out, now ther can get many out now we are getting to the 3rd and most important aspect problem solving all we did before was headed towards this one goal: having the team solve problems During the daily meeting 3 questions did we succeed yesterday? ie: did we reach the fixed objective? in order to do a problem solving session after the daily meeting then what will we do today? and will we manage to reach today’s objective? the three questions trigger the question about problem solving once the problems are spotted what we want is to develop this continuous learning every day the questions are: why did we miss the objective? who takes charge of the problem? what was the problem, in what context? what are the causes? have they been confirmed? what’s the root cause? can we test a solution quickly? what’s the check method? and eventually what did we learn? these are the PDCA questions what really matters is to solve the problems one by one you can’t solve everything at the same time or the production stops one must be smart: visual management will show the obstacles this will enable to react very quickly to put the post its back into the flow then you take a step back to work on the root causes which will eventually give way to standards so 1st: protect the customer by all possible means be it trouble shooting and 2nd: improve this is done through PDCA: there is a gap, I carry on an analysis I find a countermeasure which I test and check and then I adjust so I learn something these are real PDCAs, hand made what matters is to learn to think with this technique: what’s the gap, what could be the cause? what’s the main cause if any? if not, what are the hypothesis? if the cause is confirmed: why? ask 5 whys until you get to the root cause then apply the countermeasure, then check this way you have learnt something! Basically, what the team did in this specific story was to accelerate This PDCA shows the process to deal with an incident. the leadtime is 98 days, which is huge why? because there is a team in charge of incident planning then there is an analysis the document produced goes to a technical analyst who writes another document called DOU checked by a functional analyst who aproves it then it goes to a technical analyst who aproves it again then it goes to a developer who codes then the tester tests and finally it reaches the deployment step and the team delivers to its client the issue is that all these steps are totally useless the more you plan, the less it goes out once you get to the step where actual value is created, you realize that there are quality issues what’s been analyzed, approved, re-approved is not correct so it goes backwards in the process when it gets to the testing step, one realizes that coding has not been so well done so it goes back and forth… there are quality issues within the process and therefore a lot of waste the analysis on the causes of the delay was done they suggested improvements that were approved And the leadtime of incident management went down from 98 days to 9 days in average at the beginning Pretty nice improvement! That’s an other PDCA focused on bug correction leadtime The process is a bit stupid here as well one needs to ask an autorization to test what the tester suggested was to communicate directly with the developer who was based in India the only problem was that he did not have his phone number. He found it and started calling him the simple fact to communicate directly led to a reduction of leadtime from 2 days to 1 hour to perform a test What we see as well is that red bins help improve the skills I’ll go through this PDCA quite fast as I am running out of time What happened on this incidents as there is a lot of turnover in the team the person who had the knowledge moved to another team so what Naïm did in this PDCA was to go and collect the knowledge from this person who had left the team and train the others at their desk so they were able to solve these incidents from then on, the person who had the knowledge was no longer essential Let’s have a look at the results obtained by this team: after 2 months, the customer satisfaction hasn’t changed It remained at 7/10, I will tell you why a bit later whereas the stock of incidents has decreased by 80% the stock of problems was being taken care of because before the project, they did not handle the problems Out of 19 problems, 7 are left in the stock 3 are being analysed and 6 in development. That means we are getting closer to the exit Productivity’s been multiplied times 3.3, the production has improved as well The team went from 3 bugs corrected per month to 17 and then 12, two months later What’s interesting is the situation 6 months later: Everything has changed There are only 2 incidents left per month that means incidents are no longer an issue And the team spends 99% of her time producing value for their customers How did they do it? They started to handle the small changes which they delivered at the same pace as the incidents They were so fast doing so that they were actually faster than the customers they dealt with the whole customer demand which created some stress as they ran out of activity the sponsor said: “that’s not a problem, we have a big project coming up” they were given some features from that big project to develop and since they had learnt to divide into small pieces, that’s what they did with those large features The sponsor asked me afterwards: “Can you tell me what is the maximum capacity of the team?” “Because when the team has reached its maximum capacity, I will transfer people from the project to that team so they can learn this new way of working” This is how we built a larger team with new ways of working As a conclusion, this is how you create wonderful teams who deliver fantastic products How you do it? With pull flow which actually shows the way for the team to learn faster through jidoka which means dealing with quality within the process while learning to do this, one improves one’s skills one learns to better collaborate with the other teams and delivers more quality faster As a reminder: start with a performance board That’s the “check” step then “plan” then “do” with problem solving and go back there to see if things improve with all the problems solved and thanks to this, you achieve incredible results I am done!