
IT'S a Tech Podcast
The “IT’S a Tech Podcast” is an engaging conversation about the game-changing technology solutions being advanced by the state’s Office of Information Technology Services. Learn how we make IT happen for 53 state agencies and 20 million New Yorkers, as well as help government leaders deliver for New York.
IT'S a Tech Podcast
Episode 6: Defeating the Blue Screen of Death; The Global IT Outage, One Year Later
On July 19, 2024, a faulty software update by cybersecurity company Crowdstrike took down millions of Microsoft Windows operating systems all over the world.
It caused computers to crash and lock, displaying the so-called “blue screen of death.”
The outage led to widespread disruption across multiple sectors of the global economy, including airlines, banks, hospitals, businesses and even governments like ours in New York State.
Dubbed the “largest IT outage in history,” it caused more than $10 Billion in financial losses, impacting some businesses for weeks or even months.
By contrast, ITS worked round the clock to bring critical New York State systems back online within 24 hours. The State’s IT provider swiftly remediated tens of thousands of blue screens – by hand and at hundreds of different locations across the State – so agency employees could get back to work and New Yorkers could continue to access the services they needed.
This is the story of how a large government agency worked as a team to mitigate the impact of this worldwide outage.
It’s a story of neighbor helping neighbor, of employees from every department, even the non-technical ones, being deputized as technicians to lend a hand, resulting in a highly coordinated and massive response effort the likes of which we have never seen.
This is the story of skill, leadership and determination, of going the extra mile to get the job done.
This is OUR story of New York’s response to the Global IT outage: one year later.
Thank you for listening to the IT’S a Tech Podcast. For more information about ITS, visit our website at its.ny.gov. Follow us on X, LinkedIn, Instagram and Facebook.
0:01
You're listening to the It's a Tech podcast, an engaging conversation about game changing technology solutions being advanced by the state's Office of Information Technology Services.
0:11
Learn how we make IT happen for 53 state agencies and 20 million New Yorkers while helping government leaders deliver for New York.
0:22
On July 19th, 2024, a faulty software update by cybersecurity company Crowdstrike took down millions of Microsoft Windows operating systems all over the world.
0:33
It caused computers to crash and lock displaying the so-called "Blue Screen of Death."
0:38
The outage led to widespread disruption across multiple sectors of the global economy, including airlines, banks, hospitals, businesses, and even governments like ours in New York State.
0:48
Dubbed the largest IT outage in history, it caused more than $10 billion in financial losses, impacting some businesses for weeks or even months.
1:00
By contrast, ITS worked around the clock to bring critical New York State systems back online within 24 hours.
1:07
The state's IT providers swiftly remediated tens of thousands of blue screens by hand and at hundreds of different locations across the state so agency employees could get back to work and New Yorkers could continue to access the services they needed.
1:22
This is the story of how a large government agency worked as a team to mitigate the impact of this worldwide outage.
1:30
It's a story of neighbor helping neighbor, of employees from every department, even the non-technical ones, being deputized as technicians to lend a hand, resulting in a highly coordinated and massive response effort the likes of which we've never seen.
1:46
This is the story of skill, leadership and determination, of going the extra mile to get the job done.
1:53
This is our story of New York's response to the global IT outage, one year later.
1:59
Chief Information Security Officer Chris DeSain shares his experience from the early hours of the outage.
2:07
As an incident responder, we're always on call 24/7.
2:11
We have an on-call rotation.
2:14
So in this particular incident we are called at 2:00 AM in the morning, which is not abnormal for our situation.
2:27
Was contacted primarily from one of our sister agencies, the Division of Homeland Security and Emergency Services, that one of the products that we supply as a shared service for the state of New York and point detection and response through CrowdStrike was having an issue.
2:46
So, in response I called into our ITS bridge to find out there was a Sev 1 already started within the New York State Information Technology Services, and called in.
2:58
And our role as cybersecurity is to find out if there's a cyber component, if the threat actor is working against the state of New York.
3:09
So, got on the bridge and found out very quickly that ITS had already identified that the security component that we supplied to the state was having an IT failure and it was not a bad actor threatening New York State.
3:27
So, at that point, the security event turned into an IT event and a recovery event, so the Office of Information Technology Services, and our operations staff, took over in recovery of the systems that were locked up for the state of New York.
3:46
As the situation developed, Joe Miller, Deputy Director of Workplace Services, helps coordinate the ITS response.
3:52
So, on Saturday morning, I got a phone call that I needed to jump on an emergency call.
3:59
At that time, we determined that the CrowdStrike issue was a viable issue throughout the state and at that time we decided it was about 25,000 machines that were incapacitated due to this to the CrowdStrike issue.
4:15
I am responsible for all workplace services staff across the state, about 300, 350 individuals.
4:22
But on a Saturday and into Sunday, it's hard to get all the staff, but I was able to get a hold of quite a few of them and we started doing some emergency calls once we had the procedures in place in order to get this remediation.
4:36
Fortunately, this was something that I have never seen before that we actually had to go to each machine physically and touch it.
4:44
We couldn't do things remote.
4:45
It was very difficult to do it over a phone call or anything like that.
4:50
So, we had to dispatch staff out to hundreds, if not thousands of locations, in order to remediate a lot of these machines.
4:59
Once I realized how big this task was, I started a Teams channel for information gathering and information dissemination.
5:11
As the hours built throughout Saturday and into Sunday, we started getting more data for staff to access.
5:19
We started to determining emergency locations, hospitals, medical facilities, OMH, DOH, OPWD, areas that had a very heavy medical presence and wanted to address those first and get those out for priority sites across the state.
5:41
As time went on through Sunday, we realized that, you know, we needed more people.
5:46
So we started doing more calls, trying to reach out, e-mail, phone messages, text messages, whatever we could do, and started having daily cadences.
5:56
I had two cadences a day.
5:58
I had one at 12:00 and I had one at 4:00.
6:02
The first cadence had about 50...45/50/60 people on it at noon on Monday.
6:09
At 4:00 on Tuesday, that grew to about 150 people, 200 people.
6:15
All of a sudden I'm getting phone calls, I'm getting emails.
6:18
How can I help?
6:19
Didn't matter who it was.
6:21
I had grade fourteens, grade eighteens.
6:25
I had people from network, customer relation managers, their teams, executives at the time, Jenn Lorenz called me and said, "Where do you need me?"
6:36
"What do you need me to do?"
6:38
So this progressed over the couple days that we were dealing with this.
6:44
Once we kind of got a cadence and the cadence was great because everybody knew, you jumped on the call at noon or you jumped on the call at 4:00.
6:54
You got direction.
6:56
We had a Teams site with all the machines and all the locations that we knew of that people could visit and see and kind of strategize out in the field.
7:07
"Hey, I'm going to go here, here and here" and somebody else would say, "OK, I'm going to go,"
7:12
"You know, these locations." Some of these locations, a lot of windshield time, 30 minutes in the car to go into a location and fix one machine.
7:21
Luckily not all machines were affected.
7:22
So we had about 25,000 we had to deal with.
7:27
The Teams channel was probably the most important thing we did.
7:33
That is what prompted chats, ideas, communication.
7:40
Where are we going, who's going where, where are we at?
7:44
What's our percentages?
7:45
How close are we?
7:46
Because of course, I was being asked quite a bit by Jenn and and the CIO and Chamber, "Where are we?"
7:56
So that was something that we had to track quite a bit.
7:59
And the only way we could do that was through shared communication channel.
8:03
And we did Teams.
8:04
That team site's actually still up.
8:08
Non-WPS staff, Workplace Services staff, started coming on board.
8:13
Everybody wanted to be a part of this, which was pretty incredible.
8:18
People really wanted to help.
8:19
What do I need to do?
8:20
What do I need to go?
8:23
Some of the things that we had to do, which I worked with finance, I worked with some of the other leadership team, we had to get some of the roadblocks out of the way.
8:34
You know, people were worried about, hey, am I going to get overtime?
8:36
Yes, I got overtime approved right there on the spot.
8:39
Hey, am I going to get mileage?
8:40
Because everybody was using their own vehicles to drive all over the place.
8:44
Unfortunately, OGS has these rules that if you drive over X amount of miles, like I think it's 60 or whatever it is a day, you, you have to rent a car.
8:52
Well, people aren't going out and rent cars to go to all these locations.
8:55
So, you know, part of my job was to try to run block for a lot of these individuals and clear the roads for them.
9:01
So, I did that on the backend, which had, you know, a good accountability for the, the, the back-end people that you didn't see, finance, management, that kind of cleared away for a lot of these things that I, I spearheaded.
9:19
Everyone was involved. Everybody got into the office.
9:24
I cancelled telecommuting for, I think that whole week, for all 350+ staff across the state for WPS and nobody balked about it.
9:34
Everybody was a team player.
9:35
And this is where I was actually very, very impressed that the ITS staff, they all stepped up. Everybody stepped up to rectify this problem and they took kind of like a personal gain in it, like I'm doing something really good here.
9:55
I'm helping people.
9:56
A lot of people never get out into the field.
9:58
A lot of people don't go to the users.
10:00
A lot of people didn't know...
10:02
we have over 2,000 locations in New York State, whether it be residential facilities for OPWDD, halfway houses for OMH, whatever.
10:13
Nobody knew about a lot of these locations, so it was very interesting for them to understand how big the state of New York is and how big the population of users we really support on a daily basis.
10:29
So it was kind of a, you know, wake up call for a lot of them.
10:33
Like wow, we really do a lot for a lot of people.
10:38
We...for remediation.
10:41
Once we, as we went through this process, we were learning hour by hour by hour.
10:48
What can we do better?
10:49
How can we get these people resolved quicker?
10:52
We started standing up walk-up sites.
10:56
We did it here in the Plaza.
10:57
We did it in Buffalo, we did it in Rochester to where, you know, we, we got e-mail blasts out to all state users.
11:04
Hey, if you're having a problem, you can go here, you can go there.
11:08
Just walk in with your laptop.
11:09
They'll fix it and you can leave.
11:12
Worked out great.
11:13
We never focused on a single approach.
11:17
We tried to diversify and get as many approaches as possible in order to rectify this situation as soon as possible.
11:26
And like I said, as we move forward, we found new ways of remediation, found out that hey, if somebody reboots their machine 8 times, it's going to grab the new package and it's going to be fine.
11:38
We actually don't have to go there, but still staff have to be on the phone talking to users, calling users, tracking down users.
11:46
So that that itself for 25,000 people is a is a challenge.
11:52
But I can't think of a time that I've seen more people come together on the same page all at once with the same goal of getting as many New York State employees back up and running sooner than later.
12:10
And that to me was very impressive.
12:15
You know, I don't use that word lightly or, or often, but this is where it is.
12:21
Great team, great effort and impressive.
12:25
Director of Client Engagement Boris Ribovski shares the overall experience of the response effort.
12:31
I knew it was going to be an odd day because I woke up to probably 30 or 40 missed messages from starting about 3:00 AM.
12:43
And at the time we we had a lot of 24/7 locations because our main agency to remediate was Department of Health.
12:51
And between all of their locations and specifically the Wadsworth Center, which is required to keep a certain presence, they do all sorts of testing and they have all sorts of stuff that's going on overnight, you know, biology stuff, stuff that I'm probably not qualified to speak to.
13:17
So I look at my phone and I see all these missed messages and just naturally you kind of open up Google News or whatever news program and it is just...the world is down.
13:32
The world is down.
13:35
It was going to be a tough day.
13:39
It took a second to figure out what's going on.
13:42
And I think almost immediately, Ariel Vitello, who was my, what was she at the time, deputy director.
13:55
We got on the line, we got a Teams chat going, trying to put together a response immediately.
14:02
I think I want to throw some shout outs to the wonderful team I had around me, specifically Bart Bassem.
14:10
I'm sure I'm missing all my staff, but everybody in that what was formerly known as the Zone 18 team and on my CRM team, so James, Dennis and Russ and all of those folks definitely came through.
14:28
I think we did it differently.
14:30
So I know there was a main call going on for everybody all at once, and we had in DOH alone probably 3,000 machines.
14:42
So there are machines that we needed to go and address immediately because they were running, like I said, they were in the middle of running some scientific program or they were gene sequencing or something that was public safety or health related.
14:57
So immediately we sort of engaged the agency and we said, "Look, we can't have bodies everywhere." We're going to go.
15:08
And you know, I went with my team and we immediately started going through Axelrod Institute and the other Wadsworth locations in in the Corning Tower.
15:24
But we knew that a general response is needed, especially on Monday.
15:28
Come Monday, everybody was going to come in.
15:29
We can do whatever we could over the weekend between access, between who needed what and setting a priority.
15:37
But come Monday, we needed to have a serious response.
15:41
So, Bart actually took the initiative of putting together some great training slides almost immediately that were...it was a process.
15:51
Maybe it wasn't the easiest process, but it was a process we could replicate over and over.
15:56
And we did a few test trainings. We did a few trainings for other folks in ICS.
16:02
And then we set up a Teams channel with all of our volunteers from Department of Health.
16:07
And these were just regular people.
16:09
These are folks that maybe considered themselves like a little bit into IT or maybe some of them were very computer literate.
16:17
But generally these were your day-to-day staff and folks just joined up.
16:26
They really came through for us and they would be putting machine serial numbers.
16:32
Hey, can you look this up?
16:33
Give us a key.
16:34
We had a whole process for them to follow.
16:37
There was a video they could watch and that helped out tremendously.
16:45
Alongside with that, we got a lot of support from ITS and we got, we got a tremendous amount of support from the e-discovery team.
16:53
So, shout out to Scott Geer and his entire team.
16:58
And they really came through.
16:59
They gave us some bodies.
17:01
And so Monday morning, I think it was like 7:00 or 8:00, we had a room, I forgot what it's called, but it's right there before you walk into the tower, into the Corning tower, all set up, staffed.
17:21
Some of the network guys came through.
17:22
They put in some access points for us to be able to get to the servers to pull, to pull down that the keys and folks would just line up.
17:34
We had a triage out front.
17:37
Everybody was happy, people were understanding.
17:40
And at the same time, for the folks already in the tower, we had the, what was again, what was formerly called the Zone 18, the workplace services team that was managed by Bassem.
17:54
He, he was running it for the folks that were already inside the tower.
17:57
So, people were just coming up with laptops and we, we did have some advantages.
18:01
So Department of Health is mostly laptops.
18:04
It's not 100%, but more so than not.
18:07
So people were able to bring it.
18:10
And at the same time it's, it's, it's an agency that's used to working from home.
18:15
So a lot of people had their hardware with them.
18:19
So they were able to come straight to us.
18:21
They had a lot of interesting pieces of hardware because a lot of them are purchased on funds that aren't ITS.
18:27
So, they some of them had different BIOS, some of them had different configuration just in general.
18:33
But we were able to figure that out and a lot of that goes out to our cooperation between the agency and our wonderful staff and I...
18:46
And if there's somebody I missed in the shout outs, it's it's my first podcast and the mind's racing, you guys can understand.
18:55
So that was day one.
18:56
And I thought, you know, folks did really well on day one.
19:00
People brought us doughnuts.
19:03
I mean, it was a collaborative issue.
19:05
I think people understood.
19:06
Everybody saw it on the news and people were very understanding and it was pleasant, almost exciting.
19:14
You can OK, everybody wishes it hadn't happened, but we we really stepped up as an organization as the state, but that was just day one.
19:25
So we continued that triage room and basically no questions asked, bring it and we'll fix it type of room, running for a couple of days.
19:35
But on top of that, we kept that group chat going and it kept growing and more and more people started stepping up.
19:43
And it was interesting because I, I remember this, this one lady, she was like a Ph.D., she, she was not IT at all, but she, she felt like she wanted to help, and she really got the process down.
20:00
And for several days she covered her locations where her staff might be and, and I was really impressed.
20:08
And we kept that hotline open for months because...
20:13
You know, somewhere in the back in some computer or some scientist or some employee was out on leave for months, maybe, maybe somebody went on maternity leave and they came back.
20:25
Their machine was still shut down.
20:26
We might have missed it in a sweep.
20:28
So we kept that going and people got to know each other.
20:32
People got to see our faces, which was important, because I think a lot of times it's just somebody that shows up probably when you're not there and like just fixes the issue.
20:43
And then obviously for software installs, a lot of that gets pushed remotely.
20:46
So they don't really see us.
20:48
But this was really an opportunity for us not only to engage primarily, but to continue that engagement and get folks to really get to know us and to show them that, hey, you know, jokes aside, we are very competent.
21:02
We can really get stuff accomplished when we put our minds to it.
21:06
We can really get all hands on decks and, and, and pull push resources and, and we kept that going for months.
21:14
And it was all in all a very important effort.
21:20
Mike Allen, the deputy commissioner for technology for the Department of Labor, describes how the ITS response got crucial agencies back up and running.
21:30
So at 7:00 AM, my phone rings.
21:32
And normally at 7:00 AM, I'm not getting a phone call, but it was right from the executive team at ITS about what was going on with CrowdStrike that had started a couple hours earlier.
21:43
So I had to jump right into action getting a hold of the commissioner, the executive deputy commissioner at Department of Labor to give them a high level what's going on and what they're running into as their staff start rolling in later on in that morning.
22:00
And that was a difficult thing to do because a lot of things were down.
22:05
DOL, Department of Labor, primarily uses Teams to communicate.
22:10
So to get a hold of somebody with a regular cell phone and regular landline was a challenge.
22:16
But after a lot of phone tag, calling a friend of a friend to get a hold of everyone, we were able to inform the executive team at Department of Labor about what's in store for that day.
22:29
At that point was able to then pivot to focusing on our people, staff on the ground.
22:37
We were able to rally our workplace services teams to organize them from a "how we're going to tackle this" perspective.
22:46
We have a couple main offices in Endicott as well as up here in Albany as really our two main ones, which is where the call centers are.
22:54
So a large staffing needs are are handled there.
22:59
And so we had to start rolling that day, pulling people together, organizing a plan of how we're going to find all the machines that need to be addressed and put implementing, implementing the fixes.
23:16
And the fixes kept changing throughout the day as we better understood the problem.
23:20
So by the end of the day Friday, we knew what our fix was going to be as we got ready for the weekend.
23:26
We...it was an all-hands effort.
23:29
We had developers, programmers, customer care reps, everyone that was an IT that was available was pulled in.
23:37
We started bright and early, breakfast...
23:40
I brought in for the team.
23:41
So that way we were energized for the day, and we made sure that we got right to it.
23:47
Going through the building floor by floor, room by room, making sure every machine that had a blue screen didn't have a blue screen.
23:56
And so this continued all day throughout Saturday.
23:58
We started again on Sunday doing the same thing to ensure that staff would be able to start Monday that were impacted on Friday.
24:07
They would be able to get right back to work.
24:09
And then on Monday morning, we set up a triage area right in the building 12 on the floor.
24:15
So any staff that had devices at home that were still having problems, they would be able to come in and we would be able to get their machine up and running that day.
24:23
Similarly, we performed other activities throughout the weekend for all devices that we could find, as well.
24:28
Staff traveling to different sites, coordinating with the business to ensure someone's there to open the door for those sites.
24:35
And we were able to really pretty much be 90% or more operational on Monday morning.
24:41
So DOL, Department of Labor, was able to do their mission critical business for the state of New York.
24:47
Following the swift and successful response to the outage, ITS received unsolicited and nearly universal praise from New York State employees.
24:56
Here are a few of our favorite quotes.
24:58
"Thank you for your timely response and getting everything back on track."
25:02
"You did a great job keeping us informed and checking on us in person."
25:07
"IT Wizards, thank you for all you do...
25:10
We appreciate you."
25:12
Chief Strategy Officer Stacy Panis talks about the days following ITS's response, including lessons learned.
25:18
I think overall our response to the global outage as an agency was exceptional, but there's always things that we can do better.
25:25
I think one of the things that we recognized very early on is that we needed to have a way to communicate without e-mail.
25:32
So, my suggestion would be that you might have to go old school and use a phone tree.
25:37
Also, I think it's important that we know our customers, know their critical business functions, know the systems that support them, and all of that information has to be documented, up to date and available to anyone that needs them during a crisis.
25:53
In this case, you know, the first thing that we focused on was making sure that the systems that we needed to communicate with our customers and even internally as an agency were up and running.
26:02
And then we focused on the most critical applications for the agencies, focusing as we always do on health and safety.
26:09
I think the other challenge that we encountered, because we were using a lot of volunteers and we were working outside of business hours, was getting access to the agency buildings.
26:20
We needed to make sure that key support staff have access at all times and also that there's a process established for getting others access in case of an emergency.
26:30
So, in general, I would say that you should prepare for the worst, hope for the best, and overcommunicate.
26:35
I think communication is key throughout a crisis at all levels of the organization.
26:39
Don't assume that just because information is being provided to the executives that it's filtering down to the lower levels and the boots on the ground because that is not always the case.
26:48
We're very proud of ITS's
26:50
response to this crisis.
26:51
It was really a team effort with volunteers throughout the agency asking how they could help and taking on a lot of tasks that were outside of their regular job duties.
27:00
Crisis response and helping in these types of situations is one of the things that ITS does best, and our efforts were definitely recognized and appreciated by the governor and all of our customer agencies.
27:11
Thank you for listening to It's a Tech Podcast.
27:14
For more information about its, visit our website at its.ny.gov.