中文 English

Troubleshooting a Network Storm in My Home Network

Published: 2021-03-26
Network Storm Switch Router AP network

March 7th, Sunday. It was supposed to be a day for rest and relaxation.

Around 10 AM

The XiaoAi speaker at home suddenly stopped working, showing “Network Connection Failed”.

I initially thought it was just a minor issue, but after some inspection, I discovered that all networked devices at home had lost connectivity???

Moreover, the LAN was completely down—any two devices couldn’t communicate with each other???

This was essentially like going blind, with absolutely no clue where to start!!!

Analyzing the Problem

Based on general patterns of device behavior, it’s quite unlikely that two or more devices would fail simultaneously.

Yet the current situation is that almost all devices behaved as if broken. It had been raining heavily yesterday—could a lightning strike have damaged the devices?

I quickly dismissed this idea though, as the network was still working normally half an hour ago, so the rain definitely wasn’t the cause.

Could it be that the core switch at home failed? After all, it’s a 16-port gigabit switch—if the switch malfunction caused all devices to lose network connectivity, that would be quite normal.

So I took out the backup 8-port switch to replace it, but to no avail…

Then I suspected that the other 8-port generic-brand switch might be faulty, since it was from a generic brand after all, but swapping it out also didn’t help…

Around 12 PM, No Progress

The network had been down for over an hour, with absolutely no clue and no way to pinpoint the issue.

I was starting to lose my cool—this was really frustrating.

It was正好到了午餐时间,so I decided to stop, eat, calm down, and reorganize my thoughts before continuing.

Around 2 PM, Minimal System

LD also offered a suggestion: since the network couldn’t be fixed, let’s at least ensure the house has normal internet access first.

That made sense, so I disconnected the home LAN from the telecom modem and used only the backup router + telecom modem, successfully establishing a network connection.

At least there was a temporary WiFi at home now, and my spirits lifted a bit.

But I still needed to continue troubleshooting, otherwise the dozens of networked devices at home would remain offline.

Elimination Method

Since the network was working now, I started replacing devices one by one to test.

Replaced the software router with the backup router—to no avail…

So it wasn’t a software router issue either.

First, I organized and labeled the ethernet cables from each room and access points for easier inspection.

Then, on the working network, I added the backup switch and started plugging in the ethernet cables one by one.

When I plugged in the first few cables, the network still worked normally.

Then I plugged in a few more cables, and it still worked…

However, after 1-2 minutes, the new network went down too, with exactly the same symptoms as before…

Breakthrough

Finally, a breakthrough! I had been focusing my troubleshooting entirely on the weak current box at home and hadn’t considered devices in other rooms.

It must be a device in one of the other rooms causing the network to fail.

Once there’s a breakthrough, things become much easier.

Around 4 PM, Solving the Problem

Now I started checking the cables one by one, and after plugging in a cable, I had to continuously ping for a few minutes—only moving on when it was completely stable.

After quite some time…

I finally identified which room’s cable had the problem.

So happy now, I could restore connectivity for over 80% of the home devices.

Around 5 PM, Truth Revealed

The remaining work was easier too—the device connected to this cable totaled 4: a switch, notebook, PC, and AP. I checked each one.

In the end, I pinpointed the issue to the AP…

Problem Summary

Honestly, it wasn’t until I identified the AP as the culprit that I suddenly realized—this problem was a network storm.

Network Storm: I learned this term while working in the HW and DG machine rooms. It generally occurs when loops are created during new network configurations, causing network congestion.

Who would have thought this could happen to a network that had been working normally for over a year? I really let my guard down.

After solving the problem, I Googled it and found similar cases of AP-induced network storms online…

This culprit AP has been retired, after all, it ruined almost an entire Sunday—two hours in the morning, four in the afternoon, all wasted.

Wired Network Device Organization

APs really do dominate half the network. An AP failure was expected, but causing a network storm was going too far.

I really need to think about how to handle this and prevent it from happening again in the future.