COVID-19 Facts and Fiction
As someone who has been working with data as a professional for many years now, I’ve struggled with general misconceptions as well as my own understanding of Probability and Statistics. Now, Stats are like Quantum Mechanics — if anyone tells you they understand it, they are most definitely full of it. Humans exist in a macro world of single realities, which conflicts fundamentally with the underlying maze of overlapping probabilities. So naturally it’s hard for us to really get a feel for what the math really means. Yes, we can go through the motions, we can plot charts, we can run models and statistical tests etc etc, but at the end of the day, what matters is the interpretation. If you don’t get what it actually means, what’s the point?
Over the years, I’ve found going back to the basics and challenging my own understanding of data over and over, as I wrestled with the difficulty of presenting a technical argument to nontechnical folks. Now, in the face of the Coronavirus pandemic, I’m seeing the same problems popping up everywhere. People across the globe are jumping in and generating an enormous amount of (mis)information using readily available tools that make data “analysis” easier than ever before. But while news sources are starting to compete on the flashiness of their data visualizations and fanciness of their dashboards, the underlying story is mostly missing.
So let’s try to build the story that is actually relevant, maybe a bit more comforting, and hopefully at least somewhat actionable. To start, let’s make a clear distinction between what we know and what we speculate. And when we speculate, let’s be clear to what extent and with what accuracy. The most basic thing I learned way back in grad school working on analyzing decays of elementary particles was the inherent error in counting statistics. The √N law for errors on counts is one of the most important rules of thumb you can remember. It goes back to Standard Errors and Margin of Error:
This simple enough chart should be enough to dismiss any reported statistics (especially ratios, such as death rate for a disease) in the first days or weeks of a new outbreak. If we think the death rate is anywhere between a fraction of and a few percent, then a sample of a dozen people is completely and absolutely meaningless, not even thinking about all the complications of categorical variables like location and age. Now, we could dive into the rabbit hole here and never come out, but let’s leave it at that for now. I’ll just throw in these charts here to illustrate the erratic nature of measuring a new type of stochastic process for the first time:
The second most important thing about time-varying statistics is the concept of a baseline. Numbers, especially raw totals, are meaningless without a baseline measurement to compare to. Some very basic information is entirely absent from all media reporting on COVID-19 I’ve seen to date. Some questions I wanted to know the anwers to before drawing any conclusions were:
- What is the actual number of deaths we expect on any given day?
- What fraction can be attributed to existing illnesses, such as seasonal flu?
- What does a typical flu season look like? Start? Peak? End?
- What is the availability of medical care for sick patients? Beds? ICU beds?
- How do all of those vary geographically and demographically?
Finding references to these types of questions turned out to be very nontrivial… You can find case numbers by any variable you want. Death rates to scare you to, well, death. Quotes from health officials taken out of context. And yes, some 1:1 comparisons, but generally using all the wrong variables. You can’t easily compare a brand new disease to one that comes every year like clockwork and has been studied for a century!
But okay, let’s see what we actually do know. First of all, here’s the baseline death rate I found on MacroTrends:
This by itself does not tell you all that much, but it’s good to know where we generally stand, as well as recent change in long-term trend that was in place since the 80’s. You can dive into stats by country to get a better idea of the severity of any new outbreak’s effect on the overall baseline. This also feeds into the question of hospital care resources availability. Of course, this is REALLY macro, but you gotta start at the top to see the whole picture.
Next, let’s look at flu in a bit more detail.
This is what the flu looks like from year to year:
Holy crap, that’s frightening! We are talking about 10+% of the US population getting the flu every year, with 10’s of thousands dying…
And this is how the season usually looks:
So we are generally peaking in Jan-Mar timeframe and are done by May. This is good to keep in mind as you look at the growth curves of Corona. But also remember we haven’t seen it before and don’t know what to expect. We have barely begun testing on a wide scale, so the volumes (as well as the timing of the initial onset) are probably way off…
Now we can start building some basic comparisons. Perhaps the single most defining characteristic of the COVID-19 virus has been its disproportional likelihood of transmission by and effect on adults, as opposed to children. The latter are everyone’s favorite engine for rapid spread of seasonal flu via the daycare and school systems. However, in this case, we are dealing with the “flu for grownups”, it seems. And the more grown up you are, the more in danger.
This has huge implications on everything else. The skewed demographic susceptibility combined with complicated geographic distribution of humans makes for a very challenging case. Italy, for example, has the oldest population in the EU, with average age more than 10 years above that of the US! It’s hard to draw conclusions from one country and apply them to another, without factoring in these effects. Same goes for our local epicenter of the virus in the US, hitting a nursing home in Kirkland, WA. Not putting these highly concentrated pockets of population into perspective makes reporting high-level totals futile and in fact misleading. Just as misleading as reporting death rates early on…
Take a look at this chart from Italy, for example:
The Bell-curve flu and the COVID spike look nothing alike… And until we see what the actual distribution for the latter looks like, making predictions is highly subjective and dangerous. There are just too many unknowns at this time. This, unfortunately, is itself nontrivial, as it may take several Coronavirus seasons to see the real trend.
So this got me thinking: If there’s so little we know, and the overall numbers are not at the level of the seasonal flu, why is there so much panic? Most importantly, why such grave concern over the ability of the hospital systems around the world to cope with this? If they get hundreds of flu patients a year, shouldn’t they have this down? What’s a few more? Well, that’s the crux of the problem…
The real issue is the combination of intensive care availability and the concentration of severe cases. In general, hospitals have from one to several dozen of ICU beds. When things are “normal” and spread out, like the flu, that’s enough. But when the flu is bad (like it is this year…) and you get the extra bump on top from Corona (especially a very spiky one in some places), that’s when the system breaks down… Take a look at this:
Does not sound like much at first glance, but that’s just about the limit of ICU care availability there… Here’s an example breakdown for a fairly large US hospital, for context:
And more general US numbers:
So that’s the REAL problem with COVID-19. And it’s coming on top of already crappy flu numbers that are stretching the system thin (but everyone forgot about back in February…)
So with all this, it is my personal (and semi-professional, I guess) opinion that watching the typical metrics that include total cases, new cases, known cases, and various rates based on those is pointless. None of the following really help make any sort of informed decision, as they lack information, could be orders of magnitude off, and are not viewed in the right context:
Yes, those all look very frightening. But what do they really show??? The one and only metric that I’m paying very close attention to at this time is this:
This is what we do know about the known cases. Given that someone has the virus, how likely is it to be serious? How is that number changing? Now, off of this, you can look at a meaningful death vs recovery rate. And this also tells you the severity of the lack of hospital beds problem. If we are lucky, this stays flat and does not curve back up. Fingers crossed!!!