Rendered at 05:45:16 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
srean 19 hours ago [-]
The way I understand this is that adding of random variables is a smoothening operation on their densities (more generally the distributions, but let me speak of densities only).
A little more formally, additions over random variables are convolutions of their densities. Repeated additions are repeated convolutions.
A single convolution can be understood as a matrix multiplication by a specific symmetric matrix. Repeated convolutions are therefore repeated matrix multiplications.
Anyone familiar with linear algebra will know that repeated matrix multiplication by a non degenerate matrix reveals it's eigenvectors.
The Gaussian distribution is such an eigenvector. Just like an eigenvector, it is also a fixed point -- multiplying again by the same matrix wil lead to the same vector, just scaled. The Gaussian distribution convolved is again a Gaussian distribution.
The addition operation in averaging is a matrix multiplication in the distribution space and the division by the the 'total' in the averaging takes care of the scaling.
Linear algebra is amazing.
Pagerank is an eigenvector of the normalised web adjacency matrix. Gaussian distribution is the eigenvector of the infinite averaging matrix. Essentially the same idea.
kurlberg 15 hours ago [-]
Convolution alone does not smooth. Eg consider a random variable supported on the pts 0 and 1 (delta masses at 2 pts.) No matter how many convolutions you do, you still have support on integers - not smooth at all. You need appropriate rescaling for a gaussian.
Also, convolving a distribution with itself is NOT a linear operation, hence cannot be described by a matrix multiplication with a fixed matrix.
srean 15 hours ago [-]
You are absolutely right. Even edge detection can be written as a convolution. That's why I mention averaging.
I address scaling, very peripherally, towards the end. Of course, depending on how you scale you end up with distinctly different limit laws.
CoastalCoder 19 hours ago [-]
> Anyone familiar with linear algebra will know that repeated matrix multiplication by non degenerate matrices reveals it's eigenvectors.
TIL that I'm not "familiar" with linear algebra ;)
But seriously, thanks for sharing that knowledge.
srean 19 hours ago [-]
If you are not speaking in jest (I strongly suspect you are), knowledge of linear algebra is one of the biggest bang for buck one can get as an investment in mathematical knowledge.
So humble and basic a field. So wide it's consequences and scope.
Sharlin 18 hours ago [-]
Their point was that "familiarity" apparently means different things for different people :P Someone using linalg in computer graphics applications may say they're familiar with it even though they've never heard the term "eigenvector". I'm not actually sure about what you mean – how does repeated multiplication reveal eigenvectors?
srean 18 hours ago [-]
Consider a diagonalizable matrix A. For example, a real symmetric matrix. Start with any vector b and keep multiplying it with A.
A A A ... A b
The vector that the result will converge to is a scaled version of one of the eigenvectors of the matrix A.
But which one ? The one with the largest eigenvalue among all eigenvectors not orthogonal to b.
Ah… that "diagonalizable" is doing some heavy lifting there! I was wondering how exactly you’re going to make, say, a rotation matrix to converge anything to anything that’s not already an eigenvector. And rotation matrices certainly aren’t degenerate! Though apparently non-diagonalizable matrices can be called defective which is such a dismissive term :( Poor rotation matrices, why are they dissed so?!
srean 15 hours ago [-]
Love them, those rotation matrices.
Take logarithm of the eigenvalues and you get back the angle. This to me had solidified the notion that angles are essentially a logarithmic notion ... Made more rigorous by the notion of exponential maps
CoastalCoder 19 hours ago [-]
My first sentence was in jest. I've used LA for various things, but haven't had many dealings with eigenvectors. So that information was genuinely new to me.
My expression of gratitude was sincere.
srean 18 hours ago [-]
Understood and thanks for the opportunity of sharing together in the joy of something so amusing.
riffic 16 hours ago [-]
You're doing this multiple times, but it's can only mean "It Is" or "It Has".
srean 16 hours ago [-]
Thanks for the heads up. I meant 'its'.
Phone autocorrect always interferes and I get tired and lazy about correcting it back. It does get it right most of the time.
Sharlin 19 hours ago [-]
Yeah, I don't think this was revealed on my undergrad linalg course, and neither during all my years of using linalg in computer graphics =D
stevenwoo 15 hours ago [-]
I remember my professor talking about eigenvectors in Linear Algebra and it's been 50 years - though I barely remember anything else from that class. It was taught very early on in the course and eventually we used them all the time to solve problems.
Sharlin 15 hours ago [-]
Yes, I was taught about eigenvectors but not that they’re a fixpoint of matmul. At least I don’t think so.
The article doesn't share the actual math, but also not the relatively easy intuition. When you roll a pair of dice, there are more combinations that add up to 7 than any other number. Change the numbers on the dice (change the 1 to a 6, e.g.), there's again more combinations that add up to some numbers than to others. The histogram of the number of combinations that add up to different results is a bell curve. That's why it pops up everywhere you have addition of independent events.
It's sad that even introductory statistics courses skip this simple intuition.
> It's sad that even introductory statistics courses skip this simple intuition.
I was probably lucky.
We got homework as one of the first lessons in statistics course, for exactly this case.
Roll pair of dice, save the result, do it 200 (or some other bigger number) times, plot the histogram, do some maths, maybe provide any conclusions, etc.
Such things then definitely stuck with you for a long time.
srean 14 hours ago [-]
And if you take the log of the number of ways you get the entropy corresponding to the 'ways' macrostate.
AlexCornila 15 hours ago [-]
Yeah that is definitely not the relatively easy intuition for this. The relatively easy intuition comes from learning about the Bernoulli trials, binomial distribution and Pascal triangle. Once you understand those you understand why normal distribution is so prevalent.
Or just watch this
https://youtu.be/AwEaHCjgeXk?si=tV72uauquCHvzkNE
estearum 14 hours ago [-]
They explained the intuition in like two sentences that an 8th grader can understand and test themselves.
Sounds simpler than whatever you’re talking about here
cortesoft 13 hours ago [-]
Anybody who has ever played Craps will know this.
srean 19 hours ago [-]
A result of broader applicability is that of convergence to infinitely divisible distributions, more generally the stable distributions
This applies even when the variance is not finite.
Note independence and identical nature of distribution is not necessary for Central Limit Theorem to hold. It is a sufficient condition, not a necessary one, however, it does speed up the convergence a lot.
Gaussian distribution is a special case of the infinitely divisible distribution and is the most analytically tractable one in that family.
Whereas, averaging gives you Gaussian as long as the original distribution is somewhat benign, the MAX operator also has nice limiting properties. They converge to one of three forms of limiting distributions, Gumbel being one of them.
The general form of the limiting distributions when you take MAX of a sufficiently large sample are the extreme value distributions
Very useful for studying record values -- severest floods, world records of 100m sprints, world records of maximum rainfall in a day etc
topaz0 18 hours ago [-]
I think part of why we're much more likely to learn about the iid, finite-variance CLT is that it's a lot easier to prove than the more general ones.
srean 18 hours ago [-]
Yes that's a big part. The proofs get hairier otherwise.
But I think there is more to it, the convergence to Gaussian also gets slower.
In practice, we deal with finite averaging, so speed of convergence matters. For some non-iid case, the convergence may be so slow that the distribution cannot be approximated well by a Gaussian.
mikrl 1 days ago [-]
Great article. Personally I have been learning more about the mathematics of beyond-CLT scenarios (fat tails, infinite variance etc)
The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.
Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.
For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.
btilly 22 hours ago [-]
The key principle is that you get CLT when a bunch of random factors add. Which happens in lots of places.
In finance, the effects of random factors tend to multiply. So you get a log-normal curve.
As Taleb points out, though, the underlying assumptions behind log-normal break in large market movements. Because in large movements, things that were uncorrelated, become correlated. Resulting in fat tails, where extreme combinations of events (aka "black swans") become far more likely than naively expected.
srean 19 hours ago [-]
Some correlations are fine though, there are versions of CLT that applies even when there are benign correlations.
I know you know that and were just simplifying. Just wanted this fact to be better known for practitioners. Your comment on multiplicative processes is spot on.
It's bit of a shame that these other limiting distributions are not as tractable as the Gaussian.
btilly 11 hours ago [-]
Absolutely. The effect of straightforward correlations is a change in the variance, which can be measured in finance.
The effect of the nonlinear changing correlations is that future global behavior can't be predicted from local observations without a very sophisticated model.
20 hours ago [-]
parpfish 1 days ago [-]
As to ye philosophy of “why” the CLT gives you normals, my hunch is that it’s because there’s some connection between:
a) the CLT requires samples drawn from a distribution with finite mean and variance
and b) the Gaussian is the maximum entropy distribution for a particular mean and variance
I’d be curious about what happens if you starting making assumptions about higher order moments in the distro
orangemaen 1 days ago [-]
The standard framing defines the Gaussian as this special object with a nice PDF, then presents the CLT as a surprising property it happens to have. But convolution of densities is the fundamental operation. If you keep convolving any finite-variance distribution with itself, the shape converges, and we called the limit "normal." The Gaussian is a fixed point of iterated convolution under √n rescaling. It earned its name by being the thing you inevitably get, not by having elegant closed-form properties.
The most interesting assumptions to relax are the independence assumptions. They're way more permissive than the textbook version suggests. You need dependence to decay fast enough, and mixing conditions (α-mixing, strong mixing) give you exactly that: correlations that die off let the CLT go through essentially unchanged. Where it genuinely breaks is long-range dependence -fractionally integrated processes, Hurst parameter above 0.5, where autocorrelations decay hyperbolically instead of exponentially. There the √n normalization is wrong, you get different scaling exponents, and sometimes non-Gaussian limits.
There are also interesting higher order terms. The √n is specifically the rate that zeroes out the higher-order cumulants. Skewness (third cumulant) decays at 1/√n, excess kurtosis at 1/n, and so on up. Edgeworth expansions formalize this as an asymptotic series in powers of 1/√n with cumulant-dependent coefficients. So the Gaussian is the leading term of that expansion, and Edgeworth tells you the rate and structure of convergence to it.
ramblingrain 1 days ago [-]
It is the not knowing, the unknown unknowns and known unknowns which result in the max entropy distribution's appearance. When we know more, it is not Gaussian. That is known.
mitthrowaway2 1 days ago [-]
Exactly this. From this perspective, the CLT then can be restated as: "it's interesting that when you add up a sufficiently large number of independent random variables, then even if you have a lot of specific detailed knowledge about each of those variables, in the end all you know about their sum is its mean and variation. But at least you do reliably know that much."
D-Machine 1 days ago [-]
Came here basically looking to see this explanation. Normal dist is [approximately] common when summing lots of things we don't understand, otherwise, it isn't really.
sobellian 1 days ago [-]
IIRC there's a video by 3b1b that talks about that, and it is important that gaussians are closed under convolution.
gowld 1 days ago [-]
That makes it an equilibrium point in function space, but the other half is why it's an a global attractor.
pfortuny 22 hours ago [-]
There must be a contractive nature in "passing to the limit". And then Brower's fixed point theorem.
(I know it is very easy to do "maths" this way).
derbOac 1 days ago [-]
IIRC the third moment defines a maxent distribution under certain conditions and with a fourth moment it becomes undefined? It's been awhile though.
If I'm remembering it correctly it's interesting to think about the ramifications of that for the moments.
>natural processes tend to exhibit Gaussian behaviour
to me it results of 2 factors - 1. Gaussian is the max entropy for a distribution with a given variance and 2. variance is the model of energy-limited behavior whereis physical processes are always under some energy limits. Basically it is the 2nd law.
wodenokoto 19 hours ago [-]
> the “steadfast order of the universe” that eventually overcame any and all deviations from the bell.
I can’t believe the author wrote that without explaining why it’s called the bell curve.
I find the article spends a lot of time talking about repeating games without really getting to the meat of it.
If you throw a dice a million times the result is still following a uniform distribution.
It isn’t until you start summing random events that the normal distribution occurs.
api 18 hours ago [-]
A huge amount of nonfiction writing, especially nonfiction books, feels padded to length.
It’s something I’ve gotten out of AI. Summarize, please, and it’s pretty good at extracting the key ideas.
If I want a story I read fiction, which is writing with a much wider set of objectives than just conveying information and ideas (though it can do that).
xbar 16 hours ago [-]
True and false. You don't need anywhere near a million samples to get a good approximation for your normal distribution. Far fewer than 100 is sufficient (and 14 is a fine place to start if you are really constrained on data and need to get to 90-10).
"Order in Apparent Chaos.-I know of scarcely any-, thing so apt to impress the imagination as the wonderful form of cosmic order expressed by the " Law of Frequency of Error." The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and • in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason."
at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.
fiforpg 1 days ago [-]
On opening the article, I was somehow expecting a mention of the large deviations formalism, which was (is?) fashionable in late 20th century, and gives a nice information theoretic view of the CLT. Or something like that. There's a ton of deep math there. So having a bio statistician say "look, the CLT is cool" is a bit underwhelming.
Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.
causalityltd 23 hours ago [-]
Causes mostly add up: molecular kinetic energies aggregate to temperature, collisions to pressure, imperfections to measurement errors, etc.
So, normal or CLT is the attractor state for the unexceptional world.
BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.
bicepjai 1 days ago [-]
This is one of my favorite philosophical questions to ponder. I always ask it in interviews as a warmup to get their thoughts. I’ve noticed that interviewees often curl up, thinking it’s a technical question, so I’ve been modifying the question one after the other to make it less scary. The interviews are for data scientist roles.
Buttons840 1 days ago [-]
I haven't read the article, but my understanding is that a normal curve results from summing several samples from most common probability distributions, and also a normal curve results from summing many normal curves.
All summation roads lead to normal curves. (There might be an exception for weird probability distributions that do not have a mean; I was surprised when I learned these exist.)
Life is full of sums. Height? That's a sum of genetics and nutrition, and both of those can be broken down into other sums. How long the treads last on a tire? That's a sum of all the times the tire has been driven, and all of those times driving are just sums of every turn and acceleration.
I'm not a data scientist. I'm just a programmer that works with piles of poorly designed business logic.
How did I do in my interview? (I am looking for a job.)
srean 10 hours ago [-]
> How did I do in my interview?
You did very well.
But if you haven't had exposure to this either through work experience or through course work it would be unfair to ask this question and use your answer to judge competence.
For a potential coworker role I would certainly be curious about your curiosity but a sharp ended question is not a way to explore that.
abetusk 1 days ago [-]
Say I have N independent and identically distributed random variables with finite mean. Assuming the sum converges to a distribution, what is the distribution they converge to?
Buttons840 1 days ago [-]
A normal distribution.
abetusk 1 days ago [-]
Levy stable [0].
If I had made the extra condition that the random variables had finite variance, you'd be correct. Without the finite variance condition, the distribution is Levy stable.
Levy stable distributions can have finite mean but infinite variance. They can also have infinite mean and infinite variance. Only in the finite mean and finite variance case does it imply a Gaussian.
Levy stable distributions are also called "fat-tailed", "heavy-tailed" or "power law" distributions. In some sense, Levy stable distributions are more normal than the normal distribution. It might be tempting to dismiss the infinite variance condition but, practically, this just means you get larger and larger numbers as you draw from the distribution.
This was one of Mandelbrot's main positions, that power laws were much more common than previously thought and should be adopted much more readily.
As an aside, if you do ever get asked this in an interview, don't expect to get the job if you answer correctly.
It's amazing that you find so many that are uncomfortable with this question. I literally teach a first-year data science course and I ask the students this very question. I spend half a lecture on it and put it in their assessment.
This is one of the most fundamental things to understand in statistics. If you don't have at least some degree of comfort with this, you have no business working with data in a professional capacity.
GuB-42 19 hours ago [-]
You can be comfortable about the concept, but not comfortable about the interview.
The way I understand it, OP asked this as a way to open the conversation, while candidates interpreted it as a math problem to solve, unintentionally getting their mind into "exam" mode.
hilliardfarmer 1 days ago [-]
A lot of times I can't tell if I'm the idiot or if everyone else is. Says that this isn't an interesting question at all and the article was horrible. I studied data science for a few years but I'm no expert, but it seems pretty obvious to me that if you make a series of 50/50 choices randomly, that's the shape you end up with and there's really nothing more interesting about it than that.
alanbernstein 1 days ago [-]
I don't think "obvious" is the right word here. It makes perfect sense when you understand it, but it's not a conclusion that most people could come to immediately without detailed, assisted study.
smcin 24 hours ago [-]
Sampling 50/50 choices would be a binary distribution that (very crudely) approximates a normal distribution.
But the counterintuitive thing about the CLT is that it applies to distributions that are not normal.
abetusk 1 days ago [-]
Sorry, does the article actually give reasons why the bell curve is "everywhere"?
For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.
The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).
Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.
Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.
woopsn 23 hours ago [-]
There's a paragraph on discovery that multinomial distributions are normal in the limit. The turn from there to CLT is not great, but that's a standard way to introduce normal distributions and explains a myriad of statistics.
WCSTombs 1 days ago [-]
It's not super hard to prove the central limit theorem, and you gave the flavor of one such proof, but it's still a bit much for the likely audience of this article, who can't be assumed to have the math background needed to appreciate the argument. And I think you're on the right track with the comment about stable distributions.
abetusk 1 days ago [-]
The Fourier transform of a uniform distribution is the sinc function which looks like a quadratic locally around 0. Convolution to multiplication is how the quadratic goes from downstairs to upstairs, giving the Gaussian.
Widths of different uniform distributions along with different centers all still have a quadratic center, so the above argument only needs to be minimally changed.
The added bonus is that if the (1-w^2)^n is replaced by (1-w^a)^n, you can sort of see how to get at the Levy stable distribution (see the characteristic function definition [0]).
The point is that this gives a simple, high-level motivation as to why it's so common. Aside from seeing this flavor of proof in "An Invitation to Modern Number Theory" [1], I haven't really seen it elsewhere (though, to be fair, I'm not a mathematician). I also have never heard the connection of this method to the Levy stable distributions but for someone communicating it to me personally.
I disagree about the audience for Quanta. They tend to be exposed to higher level concepts even if they don't have a lot of in depth experience with them.
A requirement is multiple independent influences. An example of what shouldn't target a normal distribution are a single course's grade outcomes, having a teacher and a defined curriculum goes against that. Yes, there is a variability of student effort and aptitude. But a top teir university selects a group of students based on some merit their student body isn't random. There are airheads who were dragged over the finish line with connections and family money and some students fall prey to substance abuse and mental illness. I argue a different distribution recognizing that a skilled teacher can get a class grade distribution centered around at least a B of not B+, A-. I feel grading on the curve and limiting A's to a fixed percent target can encourage bad test design or worse bad grading.
FabHK 13 hours ago [-]
> Place a measuring cup in your backyard every time it rains and note the height of the water when it stops: Your data will conform to a bell curve.
That strikes me as unlikely, actually: that the amount of water to fall (per area) across rain showers ("when it stops") is normally distributed. Why would the author think that?
Also, not much of "the math that explains" the CLT in the article. The basic conditions are:
The samples you add together must be
- sufficiently independent
- sufficiently well-behaved in the sense of not having huge outliers (finite variance is good enough for this)
Not sure either condition holds for rainfall.
10 minutes ago [-]
sayYayToLife 17 hours ago [-]
Okay at my core I'm an inductionist. However this article is a mere tautology at best.
The article doesn't explain why. It explains a bunch of cases and works backwards to show that the original premise was true. This sounds fine but the end of the article specifically mentioned that this is dangerous because the world doesn't always work like this.
This is the problem with induction, it might work in 99% of cases, I've never seen a Black Swan so there must not be any black swans?
Deduction has more value when it comes to math specifically... I'll admit that as an inductionist.
fritzo 1 days ago [-]
Hot take: bell curves are everywhere exactly because the math is simple.
The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.
The central limit theorem generalizes beyond simple math to hard math:
Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.
BobbyTables2 1 days ago [-]
It also took me a little while to realize “least squares” and MMSE approaches were not necessarily the “correct” way to do things but just “one thing we actually know how to do” because everything else is much harder.
We can use Calculus to do so much but also so little…
roenxi 19 hours ago [-]
That isn't the case; mathematicians will do pages of calculations (particularly and especially the statisticians) if they can prove one approach is technically superior to another. These people, as a class, are the crazies who invented matrix multiplication. Something like MMSE is used because it provably optimum properties for estimating a posterior distribution.
It is certainly possible that there are complex approaches that the statisticians have not discovered or don't teach because they are too complicated, but they had a big fight about which techniques were provably superior early in the discipline's history and the choices of what got standardised on weren't because of ease of calculation. It has actually been quite interesting how little interest the statisticians are likely to be taking in things like the machine learning revolution since the mathematics all seems pretty amenable to last century's techniques despite orders of magnitude differences in the data being handled.
fritzo 14 hours ago [-]
> optimum properties for estimating a posterior distribution
Circular reasoning: that's true only if the posterior is normal, or if your "optimal" is defined by second moments. In infinite variance cases, the best estimator can be median or an alpha moment for alpha < 2, but yikes the math is much more difficult.
-- A mathematician who has indeed fallen into the beauty trap
roenxi 8 hours ago [-]
> Circular reasoning: that's true only if the posterior is normal, or if your "optimal" is defined by second moments.
That doesn't sound right, it is an error minimising technique. Are we not talking about minimising mean square errors? Why would the posterior need to be normal? And why would optimal need to be defined by 2nd moments?
atrettel 1 days ago [-]
I've often described this as a bias towards easily taught ("teachable") material over more realistic but difficult to teach material. Sometimes teachers teach certain subjects because they fit the classroom well as a medium. Some subjects are just hard to teach in hour-long lectures using whiteboards and slides. They might be better suited to other media, especially self study, but that does not mean that teachers should ignore them.
orangemaen 1 days ago [-]
The CLT is everywhere because convolution/adding independentish random variables is a super common thing to do.
fritzo 14 hours ago [-]
Right. And the CLT is not actually limited to normal distributions. Both of the distribution families I mentioned are central limit theorems. The CLT we first see in school regards means of finite variance distributions, where the finite variance assumption is made because it makes the math easier.
Most things aren't infinite or extreme, though. Almost by definition, most phenomena aren't extreme phenomena.
D-Machine 1 days ago [-]
No, but when you get into the nitty gritty of most things sometimes being influenced by extremely rare things, and also that the convergence rate of the central limit theorem is not universal at all, then much of the utility (and apparent universality) of the CLT starts to evaporate.
In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).
fritzo 14 hours ago [-]
Heavy tails are everywhere. Normal distributions have absurdly light tails. Levy alpha stable distributions have power law tails. Power law tails are everywhere.
Some things with heavy tails:
token occurrences
comment thread upvotes
startup IPOs
social follower counts
network latency
github stars
git diffs
power station size
weather events
AndrewKemendo 1 days ago [-]
That’s exactly the right take and the article proves it:
Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one
The median is actually more descriptive and power law is equally as pervasive if not more
fsckboy 1 days ago [-]
combining repeated samples of any distribution* (any population density fuction including power law distributions) will converge to the normal distribution, that's why it appears everywhere.
* excluding bizarre degenerates like constants or impulse functions
abetusk 24 hours ago [-]
No, that's not correct. Sums of power law distributions can converge to power low tailed distributions, not normal distributions.
AndrewKemendo 15 hours ago [-]
No use arguing with them they don’t have enough mathematical understanding to understand what they’re saying
bandrami 19 hours ago [-]
I flinch at "everywhere", particularly when people keep asserting they are places that they aren't (and in fact can't be). Nothing with a hard zero can be normally distributed, for instance, but people will keep insisting quantities with a hard zero are.
Epa095 19 hours ago [-]
Is this not just a linguistic issue, where people say normal distributed but actually mean approximate or assumed normality? Its not like height is normally distributed (there is nobody 8 feet tall), but its not like the distribution bares no resemblance to the normal distribution either, and in a colloquial sense the term seems to he used more freely than the mathematical defined term.
nsnzjznzbx 1 days ago [-]
So Abraham de Moivre was the worlds first quant?
fedeb95 18 hours ago [-]
Nassim Nicholas Taleb is triggered, then calms down a bit toward the end.
gwern 1 days ago [-]
A little disappointing. All about the history of bell curves, but I don't think it does a very good job explaining why the bell curve appears or the CLT is as it is.
100 year floods are not happening more often in most cases - it is just that the central limit therom teachs us the 10 year flood is almost as high water as the 100 or even 1000 year flood.
thaumasiotes 1 days ago [-]
> it is just that the central limit therom teachs us the 10 year flood is almost as high water as the 100 or even 1000 year flood.
No, the central limit theorem specifically doesn't address that. It says that the sum of iid random variables is well approximated by a normal distribution near the mean; it doesn't tell you how well that approximation works in the tails. The rarer the event you're modeling is, the less relevant the normal approximation is.
> suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
abetusk 12 hours ago [-]
That sentence is flat out wrong.
If the probability distribution converges, it converges to a Levy stable distribution [0].
It's not a bad article, but I have to point something out:
> Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”
This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.
Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.
However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.
Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.
abetusk 1 days ago [-]
The fact the article said that is a gross error. You've identified the issue head on.
The sum of independent identically distributed random variables, if they converge at all, converge to a Levy stable distribution (aka fat-tailed, heavy tailed, power law). In this sense, Levy stable distributions are more "normal" than the normal distribution. They also show up with regular frequency all over nature.
As you point out, infinite variance might be dismissed but, in practice, this just ends up getting larger and larger "outliers" as one keeps drawing from the distribution. Infinities are, in effect, a "verb" and so an infinite variance, in this context, just means the distributions spits out larger and larger numbers the more you sample from it.
D-Machine 1 days ago [-]
This is also right I believe, normal distributions are not ubiquitous really, just they are approximately ubiquitous (and only really if "ignoring rare outliers", and if you also close your eyes to all the things we don't actually understand at all).
The point on convergence rates re: the central limit theorem is also a major point otherwise clever people tend to miss, and which comes up in a lot of modeling contexts. Many things which make sense "in the limit" likely make no sense in real world practical contexts, because the divergence from the infinite limit in real-world sizes is often huge.
EDIT: Also from a modeling standpoint, say e.g. Bayesian, I often care about finding out something like the "range" of possible results for (1) a near-uniform prior, (2), a couple skewed distributions, with the tail in either direction (e.g. some beta distributions), and (3) a symmetric heavy-tailed distribution (e.g. Cauchy). If you have these, anything assuming normality is usually going to be "within" the range of these assumptions, and so is generally not anything I would care about.
Basically, in practical contexts, you care about tails, so assuming they don't meaningfully exist is a non-starter. Looking at non-robust stats of any kind today, without also checking some robust models or stats, just strikes me as crazy.
throwaway81523 24 hours ago [-]
Now do power laws.
Heer_J 15 hours ago [-]
[dead]
tsunamifury 1 days ago [-]
Bell curves are everywhere because all distributions of any properties clump in some way at some level. The basics of any probability shows this. The result is you “seeing” bell curves everywhere. Aka clumps.
This is a tautology to the extreme.
abetusk 24 hours ago [-]
No, that's not true.
If sums of independent identically distributed random variables converge to a distribution, they converge to a Levy stable distribution [0]. Tails of the Levy stable distribution are power law, which makes them not Gaussian.
Yes but really what our brains do is use Gaussian Mixture model to cut up those distributions into more granular bell curves which we then call “normal”. Because we find what we are tuned to find.
Eg we find bell curves because we look for bell curves. And given infinite resolution we can find them at some granularity.
D-Machine 1 days ago [-]
Yup. And in general more heavy-tailed bumps are in fact better models (assuming normality tends to lead to over-confidence). Really think the universality is strictly mathematical, and actually rare in nature.
jibal 1 days ago [-]
First, every mathematical theorem is a tautology ... don't conflate "tautological" with "obvious".
Second, your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
thaumasiotes 1 days ago [-]
As I'm sure tsunamifury would agree, it is incredibly common for people to label "bell curves" by eyeball, regardless of whether they are normal curves. To most people, "clumping" in a one-dimensional spectrum is all they mean by the phrase "bell curve".
D-Machine 1 days ago [-]
This was sort of my reading as well: I took "clumping" to mean "bump-shaped".
jibal 21 hours ago [-]
This completely misses the point, which is that the central limit theorem says that it isn't just any old clumping, it's always the normal distribution. tsunamifury dismissed this strong finding as "tautology" because clumping is obvious ... but that it's always precisely a bell curve is far from obvious. Again,
> your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
That it's "incredibly common for people to label "bell curves" by eyeball, regardless of whether they are normal curves" is not just not relevant, it's anti-relevant ... the central limit theorem says that the distribution of the means is always a bell curve--a normal distribution--not merely a "bell curve".
Anyway, this is covered in far more detail in other comments and material elsewhere, so this is my last contribution.
thaumasiotes 12 hours ago [-]
> the central limit theorem says that the distribution of the means is always a bell curve--a normal distribution--not merely a "bell curve"
It doesn't say that. And it shouldn't, because that isn't true.
tsunamifury 12 hours ago [-]
Wow aside from the fact that none of that support is in the article it still boils down to
Normal curves are everywhere normal curves are -- which are an observational tautology -- and a fundamental over our observation of "stuff". You're dismissive as if im some illiterate, but you'd be surprised at the contributions on math I've made to the world.
tsunamifury 1 days ago [-]
[dead]
DroneBetter 1 days ago [-]
I hate Quanta a lot
a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it
they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling
they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way
andyjohnson0 19 hours ago [-]
I probably don't have your mathematical sophistication - but I like and appreciate Quanta precisely because it helps people like me to understand a little bit about challenging things. This enriches my tiny life, and I hope it also makes the world a fractionally better place for us all.
Perhaps you're just not in their intended audience?
KnuthIsGod 1 days ago [-]
3Blue1Brown
Seems a bit like Ted Talks.
Lightweight popcorn for the simple minded.
tptacek 1 days ago [-]
A lot of times on HN when a math topic comes up that isn't about 3b1b, someone will jump in to say "this isn't as good as 3b1b". Last time I saw that, I was moved to comment:
3b1b doesn't have the same goal as Quanta, or as introductory guides. It's actually not that great a teaching tool (it's truly great at what it is for, which is (a) appreciation and motivation, and (b) allowing people to signal how smart they are on message board threads by talking about how much people would get out of watching 3b1b).
This is prose writing about math. It's something you're meant to read for enjoyment. If you don't enjoy it, fine; I don't enjoy cowboy fiction. So I don't read it. I don't so much look for opportunities to yell at how much I hate "The Ballad of Easy Breezy".
bmenrigh 1 days ago [-]
I don’t fault Quanta (or 3b1b) for being the way they are. Each is serving their goal audience pretty well.
My compliant is only that there should be a dozen more just like them, each competing with each other for the best, most engaging math and science content. This would allow for more a broader audience skillevel to be reached.
As it stands, we’re lucky even to have Quanta and 3b1b.
I think there is hope though, quite a few new-ish creators on YouTube are following in Grant’s footsteps and producing very technically detailed and informative content at similar quality levels.
paulpauper 1 days ago [-]
there is no getting around that learning math requires actually having to buckle down and read and do math . A video will not suffice.
tptacek 1 days ago [-]
Couldn't agree more, which is why I think it's odd to suggest that a pop-sci magazine article is somehow a disservice that 3b1b would correct.
DroneBetter 1 days ago [-]
well for one who does buckle down and read and do math, the expected amount of new information brought to them by a 3B1B video as supplementary material upon a topic (with the normal distribution being one that admits a direct comparison from the article) is nonzero, by merit of it possibly having ideas to convey from outside their usual purview and formal background that may be applicable to the doing of math (as has been the case for me, someone who [does math](https://oeis.org/wiki/User:Natalia_L._Skirrow)), while for Quanta fluff pieces it's zero.
by the metric of "if this expository piece were to be taken to a time before its subject had been considered and presented to researchers, how useful would its outline be towards reproducing the theory in its totality," Quanta's writings (on both classical and research math) mostly score 0
1 days ago [-]
throwaway81523 22 hours ago [-]
Quanta used to have tons of good stuff and not much crap. Now there's enough crap that if there's still good stuff, it gets lost in the noise.
A little more formally, additions over random variables are convolutions of their densities. Repeated additions are repeated convolutions.
A single convolution can be understood as a matrix multiplication by a specific symmetric matrix. Repeated convolutions are therefore repeated matrix multiplications.
Anyone familiar with linear algebra will know that repeated matrix multiplication by a non degenerate matrix reveals it's eigenvectors.
The Gaussian distribution is such an eigenvector. Just like an eigenvector, it is also a fixed point -- multiplying again by the same matrix wil lead to the same vector, just scaled. The Gaussian distribution convolved is again a Gaussian distribution.
The addition operation in averaging is a matrix multiplication in the distribution space and the division by the the 'total' in the averaging takes care of the scaling.
Linear algebra is amazing.
Pagerank is an eigenvector of the normalised web adjacency matrix. Gaussian distribution is the eigenvector of the infinite averaging matrix. Essentially the same idea.
Also, convolving a distribution with itself is NOT a linear operation, hence cannot be described by a matrix multiplication with a fixed matrix.
I address scaling, very peripherally, towards the end. Of course, depending on how you scale you end up with distinctly different limit laws.
TIL that I'm not "familiar" with linear algebra ;)
But seriously, thanks for sharing that knowledge.
So humble and basic a field. So wide it's consequences and scope.
But which one ? The one with the largest eigenvalue among all eigenvectors not orthogonal to b.
https://en.wikipedia.org/wiki/Power_iteration
Take logarithm of the eigenvalues and you get back the angle. This to me had solidified the notion that angles are essentially a logarithmic notion ... Made more rigorous by the notion of exponential maps
My expression of gratitude was sincere.
Phone autocorrect always interferes and I get tired and lazy about correcting it back. It does get it right most of the time.
https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distributio...
I was probably lucky.
We got homework as one of the first lessons in statistics course, for exactly this case.
Roll pair of dice, save the result, do it 200 (or some other bigger number) times, plot the histogram, do some maths, maybe provide any conclusions, etc.
Such things then definitely stuck with you for a long time.
Sounds simpler than whatever you’re talking about here
https://en.wikipedia.org/wiki/Infinite_divisibility_(probabi...
https://en.wikipedia.org/wiki/Stable_distribution
This applies even when the variance is not finite.
Note independence and identical nature of distribution is not necessary for Central Limit Theorem to hold. It is a sufficient condition, not a necessary one, however, it does speed up the convergence a lot.
Gaussian distribution is a special case of the infinitely divisible distribution and is the most analytically tractable one in that family.
Whereas, averaging gives you Gaussian as long as the original distribution is somewhat benign, the MAX operator also has nice limiting properties. They converge to one of three forms of limiting distributions, Gumbel being one of them.
The general form of the limiting distributions when you take MAX of a sufficiently large sample are the extreme value distributions
https://en.wikipedia.org/wiki/Generalized_extreme_value_dist...
Very useful for studying record values -- severest floods, world records of 100m sprints, world records of maximum rainfall in a day etc
But I think there is more to it, the convergence to Gaussian also gets slower.
In practice, we deal with finite averaging, so speed of convergence matters. For some non-iid case, the convergence may be so slow that the distribution cannot be approximated well by a Gaussian.
The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.
Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.
For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.
In finance, the effects of random factors tend to multiply. So you get a log-normal curve.
As Taleb points out, though, the underlying assumptions behind log-normal break in large market movements. Because in large movements, things that were uncorrelated, become correlated. Resulting in fat tails, where extreme combinations of events (aka "black swans") become far more likely than naively expected.
https://en.wikipedia.org/wiki/Central_limit_theorem#Dependen...
I know you know that and were just simplifying. Just wanted this fact to be better known for practitioners. Your comment on multiplicative processes is spot on.
I say more here
https://news.ycombinator.com/item?id=47437152
It's bit of a shame that these other limiting distributions are not as tractable as the Gaussian.
The effect of the nonlinear changing correlations is that future global behavior can't be predicted from local observations without a very sophisticated model.
a) the CLT requires samples drawn from a distribution with finite mean and variance
and b) the Gaussian is the maximum entropy distribution for a particular mean and variance
I’d be curious about what happens if you starting making assumptions about higher order moments in the distro
The most interesting assumptions to relax are the independence assumptions. They're way more permissive than the textbook version suggests. You need dependence to decay fast enough, and mixing conditions (α-mixing, strong mixing) give you exactly that: correlations that die off let the CLT go through essentially unchanged. Where it genuinely breaks is long-range dependence -fractionally integrated processes, Hurst parameter above 0.5, where autocorrelations decay hyperbolically instead of exponentially. There the √n normalization is wrong, you get different scaling exponents, and sometimes non-Gaussian limits.
There are also interesting higher order terms. The √n is specifically the rate that zeroes out the higher-order cumulants. Skewness (third cumulant) decays at 1/√n, excess kurtosis at 1/n, and so on up. Edgeworth expansions formalize this as an asymptotic series in powers of 1/√n with cumulant-dependent coefficients. So the Gaussian is the leading term of that expansion, and Edgeworth tells you the rate and structure of convergence to it.
(I know it is very easy to do "maths" this way).
If I'm remembering it correctly it's interesting to think about the ramifications of that for the moments.
to me it results of 2 factors - 1. Gaussian is the max entropy for a distribution with a given variance and 2. variance is the model of energy-limited behavior whereis physical processes are always under some energy limits. Basically it is the 2nd law.
I can’t believe the author wrote that without explaining why it’s called the bell curve.
I find the article spends a lot of time talking about repeating games without really getting to the meat of it.
If you throw a dice a million times the result is still following a uniform distribution.
It isn’t until you start summing random events that the normal distribution occurs.
It’s something I’ve gotten out of AI. Summarize, please, and it’s pretty good at extracting the key ideas.
If I want a story I read fiction, which is writing with a much wider set of objectives than just conveying information and ideas (though it can do that).
"Order in Apparent Chaos.-I know of scarcely any-, thing so apt to impress the imagination as the wonderful form of cosmic order expressed by the " Law of Frequency of Error." The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and • in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason."
https://galton.org/cgi-bin/searchImages/galton/search/books/...
https://en.wikipedia.org/wiki/Galton_board
at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.
Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.
BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.
All summation roads lead to normal curves. (There might be an exception for weird probability distributions that do not have a mean; I was surprised when I learned these exist.)
Life is full of sums. Height? That's a sum of genetics and nutrition, and both of those can be broken down into other sums. How long the treads last on a tire? That's a sum of all the times the tire has been driven, and all of those times driving are just sums of every turn and acceleration.
I'm not a data scientist. I'm just a programmer that works with piles of poorly designed business logic.
How did I do in my interview? (I am looking for a job.)
You did very well.
But if you haven't had exposure to this either through work experience or through course work it would be unfair to ask this question and use your answer to judge competence.
For a potential coworker role I would certainly be curious about your curiosity but a sharp ended question is not a way to explore that.
If I had made the extra condition that the random variables had finite variance, you'd be correct. Without the finite variance condition, the distribution is Levy stable.
Levy stable distributions can have finite mean but infinite variance. They can also have infinite mean and infinite variance. Only in the finite mean and finite variance case does it imply a Gaussian.
Levy stable distributions are also called "fat-tailed", "heavy-tailed" or "power law" distributions. In some sense, Levy stable distributions are more normal than the normal distribution. It might be tempting to dismiss the infinite variance condition but, practically, this just means you get larger and larger numbers as you draw from the distribution.
This was one of Mandelbrot's main positions, that power laws were much more common than previously thought and should be adopted much more readily.
As an aside, if you do ever get asked this in an interview, don't expect to get the job if you answer correctly.
[0] https://en.wikipedia.org/wiki/L%C3%A9vy_distribution
This is one of the most fundamental things to understand in statistics. If you don't have at least some degree of comfort with this, you have no business working with data in a professional capacity.
The way I understand it, OP asked this as a way to open the conversation, while candidates interpreted it as a math problem to solve, unintentionally getting their mind into "exam" mode.
But the counterintuitive thing about the CLT is that it applies to distributions that are not normal.
For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.
The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).
Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.
Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.
Widths of different uniform distributions along with different centers all still have a quadratic center, so the above argument only needs to be minimally changed.
The added bonus is that if the (1-w^2)^n is replaced by (1-w^a)^n, you can sort of see how to get at the Levy stable distribution (see the characteristic function definition [0]).
The point is that this gives a simple, high-level motivation as to why it's so common. Aside from seeing this flavor of proof in "An Invitation to Modern Number Theory" [1], I haven't really seen it elsewhere (though, to be fair, I'm not a mathematician). I also have never heard the connection of this method to the Levy stable distributions but for someone communicating it to me personally.
I disagree about the audience for Quanta. They tend to be exposed to higher level concepts even if they don't have a lot of in depth experience with them.
[0] https://en.wikipedia.org/wiki/Stable_distribution#Parametriz...
[1] https://www.amazon.com/Invitation-Modern-Number-Theory/dp/06...
That strikes me as unlikely, actually: that the amount of water to fall (per area) across rain showers ("when it stops") is normally distributed. Why would the author think that?
Also, not much of "the math that explains" the CLT in the article. The basic conditions are:
The samples you add together must be
- sufficiently independent
- sufficiently well-behaved in the sense of not having huge outliers (finite variance is good enough for this)
Not sure either condition holds for rainfall.
The article doesn't explain why. It explains a bunch of cases and works backwards to show that the original premise was true. This sounds fine but the end of the article specifically mentioned that this is dangerous because the world doesn't always work like this.
This is the problem with induction, it might work in 99% of cases, I've never seen a Black Swan so there must not be any black swans?
Deduction has more value when it comes to math specifically... I'll admit that as an inductionist.
The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.
The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.
We can use Calculus to do so much but also so little…
It is certainly possible that there are complex approaches that the statisticians have not discovered or don't teach because they are too complicated, but they had a big fight about which techniques were provably superior early in the discipline's history and the choices of what got standardised on weren't because of ease of calculation. It has actually been quite interesting how little interest the statisticians are likely to be taking in things like the machine learning revolution since the mathematics all seems pretty amenable to last century's techniques despite orders of magnitude differences in the data being handled.
Circular reasoning: that's true only if the posterior is normal, or if your "optimal" is defined by second moments. In infinite variance cases, the best estimator can be median or an alpha moment for alpha < 2, but yikes the math is much more difficult.
-- A mathematician who has indeed fallen into the beauty trap
That doesn't sound right, it is an error minimising technique. Are we not talking about minimising mean square errors? Why would the posterior need to be normal? And why would optimal need to be defined by 2nd moments?
https://en.wikipedia.org/wiki/Central_limit_theorem#The_gene...
In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).
Some things with heavy tails:
Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one
The median is actually more descriptive and power law is equally as pervasive if not more
* excluding bizarre degenerates like constants or impulse functions
He has several other related videos also.
https://www.youtube.com/@3blue1brown/search?query=convolutio...
No, the central limit theorem specifically doesn't address that. It says that the sum of iid random variables is well approximated by a normal distribution near the mean; it doesn't tell you how well that approximation works in the tails. The rarer the event you're modeling is, the less relevant the normal approximation is.
What are "most cases"?
> suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
If the probability distribution converges, it converges to a Levy stable distribution [0].
[0] https://en.wikipedia.org/wiki/Stable_distribution
Unfortunately, many "researchers" blindly assume that many real life phenomena follow Gaussian, which they don't... So then their models are skewed
> Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”
This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.
Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.
However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.
Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.
The sum of independent identically distributed random variables, if they converge at all, converge to a Levy stable distribution (aka fat-tailed, heavy tailed, power law). In this sense, Levy stable distributions are more "normal" than the normal distribution. They also show up with regular frequency all over nature.
As you point out, infinite variance might be dismissed but, in practice, this just ends up getting larger and larger "outliers" as one keeps drawing from the distribution. Infinities are, in effect, a "verb" and so an infinite variance, in this context, just means the distributions spits out larger and larger numbers the more you sample from it.
The point on convergence rates re: the central limit theorem is also a major point otherwise clever people tend to miss, and which comes up in a lot of modeling contexts. Many things which make sense "in the limit" likely make no sense in real world practical contexts, because the divergence from the infinite limit in real-world sizes is often huge.
EDIT: Also from a modeling standpoint, say e.g. Bayesian, I often care about finding out something like the "range" of possible results for (1) a near-uniform prior, (2), a couple skewed distributions, with the tail in either direction (e.g. some beta distributions), and (3) a symmetric heavy-tailed distribution (e.g. Cauchy). If you have these, anything assuming normality is usually going to be "within" the range of these assumptions, and so is generally not anything I would care about.
Basically, in practical contexts, you care about tails, so assuming they don't meaningfully exist is a non-starter. Looking at non-robust stats of any kind today, without also checking some robust models or stats, just strikes me as crazy.
This is a tautology to the extreme.
If sums of independent identically distributed random variables converge to a distribution, they converge to a Levy stable distribution [0]. Tails of the Levy stable distribution are power law, which makes them not Gaussian.
[0] https://en.wikipedia.org/wiki/Stable_distribution
Eg we find bell curves because we look for bell curves. And given infinite resolution we can find them at some granularity.
Second, your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
> your "aka" is incorrect --- there is all sorts of clumping that is not a normal distribution.
That it's "incredibly common for people to label "bell curves" by eyeball, regardless of whether they are normal curves" is not just not relevant, it's anti-relevant ... the central limit theorem says that the distribution of the means is always a bell curve--a normal distribution--not merely a "bell curve".
Anyway, this is covered in far more detail in other comments and material elsewhere, so this is my last contribution.
It doesn't say that. And it shouldn't, because that isn't true.
Normal curves are everywhere normal curves are -- which are an observational tautology -- and a fundamental over our observation of "stuff". You're dismissive as if im some illiterate, but you'd be surprised at the contributions on math I've made to the world.
a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it
they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling
they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way
Perhaps you're just not in their intended audience?
Seems a bit like Ted Talks. Lightweight popcorn for the simple minded.
https://news.ycombinator.com/item?id=45800657
3b1b doesn't have the same goal as Quanta, or as introductory guides. It's actually not that great a teaching tool (it's truly great at what it is for, which is (a) appreciation and motivation, and (b) allowing people to signal how smart they are on message board threads by talking about how much people would get out of watching 3b1b).
This is prose writing about math. It's something you're meant to read for enjoyment. If you don't enjoy it, fine; I don't enjoy cowboy fiction. So I don't read it. I don't so much look for opportunities to yell at how much I hate "The Ballad of Easy Breezy".
My compliant is only that there should be a dozen more just like them, each competing with each other for the best, most engaging math and science content. This would allow for more a broader audience skillevel to be reached.
As it stands, we’re lucky even to have Quanta and 3b1b.
I think there is hope though, quite a few new-ish creators on YouTube are following in Grant’s footsteps and producing very technically detailed and informative content at similar quality levels.
by the metric of "if this expository piece were to be taken to a time before its subject had been considered and presented to researchers, how useful would its outline be towards reproducing the theory in its totality," Quanta's writings (on both classical and research math) mostly score 0