Tutorial: Statistics and Data Analysis (1:05:30)
Date Posted:
August 12, 2018
Date Recorded:
August 12, 2018
CBMM Speaker(s):
Ethan Meyers All Captioned Videos Brains, Minds and Machines Summer Course 2018
Description:
Ethan Meyers, Hampshire College/MIT
Overview of statistics including descriptive statistics, data plots, sampling distributions, statistical inference, estimation, regression, confidence intervals, and hypothesis testing.
Download the tutorial slides (PDF)
ETHAN MEYERS: So this tutorial right now is a kind of overview of statistics. The purpose is because I assume most people have taken the class in statistics or are familiar with statistics-- you're kind of all scientists. But the point here is that I think even when we do take classes in statistics, there can be a bunch of concepts that are a little tricky or murky. So the point is to kind of just go over some concepts and kind of keep this interactive and fun. And so if you have any questions about anything that's ever come up with statistics, you can ask and we'll go over some basics.
And then, tomorrow, I'm giving another tutorial that is listed as neural data analysis, but it's really more about a bunch of results about how information is kind of transformed as it goes to the brain to allow us to do behaviors. And there, I'll also talk about data analysis as well and how you can apply some of the methods that I use in my research. But it's more kind of results research focused, while this is more tutorial-- hopefully something useful for you to learn. So, again, please feel free to interrupt throughout and try to keep it interactive.
So as a little bit of motivation, I'm going to play a video from 538. It's basically describing how the concept of a p-value is often murky in the head of scientists and even people who analyze data. So hopefully the volume works.
[VIDEO PLAYBACK]
[MUSIC PLAYING]
- So the question is, what is a p-value?
- What's a p-value?
- What is a p-value?
- What is a p-value?
- A p-value.
- Oh.
- What is a p-value.
- I'm going to pass on that.
- So-- wow-- the p-value is--
- The hypothesis you're testing is--
- You need statistics to try to estimate if what you think is there--
- I know what many people that I have respected have written about and in fact quoted them. Is that around about enough ways to dodge your question?
- Can you explain what a p-value is in a sentence?
- Well, I've actually spent my entire career about the definition of p-value, but I cannot tell you what it means and almost nobody can.
[END PLAYBACK]
ETHAN MEYERS: OK. So just to say that even these basic concepts that we hopefully have some familiarity with can be tricky and subtle. So how many people here think they could explain what a p-value value is? OK, a couple of people. Say this seems like this talk might be worth going through. And there are going to be-- following up on Chris's picture-taking-- some bad jokes to hopefully keep it entertaining.
So like I said, statistical concepts can be a little tricky, and I thought it would be useful to go over things. And, please, ask questions if anything comes up that you're confusing or you don't know why I've put things on slides or whatnot. So an overview-- I'm just going to talk about descriptive statistics and then inferential statistics. If there's interest and time at the end, maybe we'll take a little bit of a break. But for those of you who analyze neural data, like spiking activity, I could also go through specific methods that are used more by those communities, like mutual information. So again, depending on time and interest, otherwise, I'm sure no one's going to object to getting out a little early if we hopefully can do that too.
So I guess keep to keep it interactive, where does data come from?
AUDIENCE: Which data.
ETHAN MEYERS: Which data? Right. Good question. So I would put storks. So they deliver your data. But really, when we're thinking about it, conceptually, the way you statisticians frame it is from things called distributions, which you all took the probability tutorial the other day, and so these are often described mathematically. And it's basically, if we had infinite amounts of data, we would have access to the full distribution. Or maybe there's some sort of process that generates data, and if we could somehow know and see the truth of this process, we would have access to this full distribution.
So the distribution is kind of the truth. So we could put a picture of Plato up there and say, the truth relies on the distribution. But in reality, we don't have access to that. We don't have infinite amounts of data. So we just have the shadows. So it's like Plato's cave, we can't see reality. We only can approximate reality through our data.
And so a big point, particularly the point of statistical inference, is to be able to try to say something about the truth and this underlying process that generates data from only these vague samples of data that we have. That's kind of the name of the game. So again, how do we get data? Well, it's a little rhetorical. We do the science.
So you have labs and collect it in many different ways. And if we're collecting data, often what we want to do is we want to collect it using simple random sampling. So if we're recording, let's say, from a particular brain region that we think has one function and we're recording from neurons, we want to sample them kind of randomly. Each neuron has equally likely probability of being selected. So that's called random selection.
And so this is a real question-- why would we want to do random selection?
AUDIENCE: To avoid sampling errors.
ETHAN MEYERS: To avoid sampling error. Right. So there's a related concept--
AUDIENCE: To avoid bias in your sample
[INTERPOSING VOICES]
ETHAN MEYERS: Right. To avoid bias. So sampling error is just the random fact that if you have different samples you end up with different statistics. That's called sampling error. And bias is being systematically off And so, sampling error is kind of unavoidable. But I think you were getting at the same concept. But it's the concept's actually called bias-- when you're systematically off.
And so, if you do simple random selection, you'll be able to take your sample and then say something about that underlying process. You'll be able to generalize. And, again, that's the name of the game-- to say something about the underlying process.
And so the way to think about it is the soup analogy. If you have a bowl-- have a pot of soup-- you can tell whether, let's say, the whole pot needs more salt just by taking a simple spoonful of it. And the reason that works is because your spoonful has millions of molecules or thousands of molecules on it, and that's, although it's a small sample, very representative of the whole pot. And so by just using a small amount of data, you can generalize to the whole pot, because it's been randomly selected. Obviously, if you have sampling bias and just get a potato, then that's not going to generalize, unless it's potato soup, right? So that's why we want to have a good sample.
So here's some data. This is data about flights and how long they were delayed. So nothing to do with neuroscience. But any data, say, you get, often has this format, where you have what are called cases here. These are the individual items that were recorded and collected. And then you have the rows are called variables. That's the statistical term, not to be confused with variables, let's say, in computer science.
And then there are different types of variables. So you can have variables that are categorical. Those fall into discrete groups. And you can also have quantitative data, which is data you can do math on. You can't do math on categories.
So if you're analyzing your data, what's kind of a good first thing to do?
AUDIENCE: Look at the distribution.
ETHAN MEYERS: To look at the distribution. So usually a good first step-- yeah.
AUDIENCE: I would even [INAUDIBLE], but I guess [INAUDIBLE]
ETHAN MEYERS: Right. So cleaning-- yeah. So that's probably even a good first thing to do as well, make sure-- but kind of one way to tell if you need to clean it is to visualize it or plot it as well. So if you see some big problems-- if you don't plot it first or look at it then, you don't just want to jump into the inferential statistics. All right. So what's a good way to plot if we have categorical data?
Bar plots, right? So bar plots are-- all you do is you count how many items are in each category, and you just plot the total. Or you can normalize and plot proportion. So if you have categorical data, you kind of go with a bar plot. If you have quantitative data, what are some ways to plot that?
AUDIENCE: Scatter plot.
ETHAN MEYERS: Scatter plot, if you have two variables and you want to look at the relationship, or a time series maybe. Other ways? Yep.
AUDIENCE: Histogram.
ETHAN MEYERS: The histogram. Yeah, that's a good one. That's kind of-- I would say that's the go to. Because when you're looking at a histogram, this gives you the shape of-- or, again, a shadow of-- the underlying distribution. And so you can kind of see, this is a bimodal distribution and get some kind of intuitive understanding, again, before you jump into more advanced analyses.
There are other types of plots you might see for quantitative data. This is a box plot. Do people know how to read a box plot? Yes? Too basic? Does everyone know it this is? Median. This?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Third quartile. So this is 75% of your data is less than this value. This is the first quartile, so 25% of your data is less than this value. What about these guys?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Yeah. So these are the extreme values-- the maximum and the minimum-- that don't include outliers. Does anyone know what the length of the box is called?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Yeah. It's called the interquartile range. So this is the middle 50% of your data. And so outliers, often, in these box plots, are usually plotted with little circles or x's above the maximum min. And there are any point that is 1.5 times the interquartile range. So if you took this one up and went 1.5, if there was some point that far out, then you wouldn't plot that point as your maximum or minimum, you'd just put a little dot.
So, again, hopefully a lot of people have been exposed to this. But if you've forgotten what that is, now you can read those and remember. This data comes from a hot dog eating contest. These are all the contest winners. So what this is not showing you is the time of progression, because the people who are eating 70, in the later years, they got better as they kept on going.
And there are other ways to plot it. So you can plot kind of something similar. Does anyone know what those are called? Oops, I have the title-- violin. Violin plots.
And so this is kind of a smoother version of it. So here, this is more of like a histogram that's been smoothed and mirrored. And then it kind of gives you the same thing. Obviously a bit more detail there, but these can still be useful if you just want to look at these key statistics.
And people are still inventing or reinventing new ways to analyze data. So there was something called a joy plot that was all the rage a couple years ago. I guess you guys don't follow the latest in statistical plots, but a lot of people find violin plots to be kind of ugly. So I don't know, those are not the most beautiful looking things. Any guesses why they find them ugly?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Because they look like Christmas ornaments or other things. So the joy plot looks much better. This is a joy plot. And so if you're comparing a bunch of items, you've just plotted a little smoothed, kind of density function-- kind of this smooth histogram for your different categories, and it's easy to compare and looks nice. Anyone know why this is called the joy plot?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: You know?
AUDIENCE: Joy Division.
ETHAN MEYERS: Right. So it looks like the cover of the Joy Division album.
AUDIENCE: [INAUDIBLE] another point is everyone calls this the Joy Division thing, but the graphic designer stole it from a physics [INAUDIBLE].
ETHAN MEYERS: I see. I see. So I should put-- we should call this the physics plot or whatever it is.
AUDIENCE: It's just from some [INAUDIBLE].
ETHAN MEYERS: I see. I see. Good. That's good to now. See, this is why I'm doing this, so I can learn as well-- improve my talks. And again, there's other types of plots. So there are dynamite plots. Everyone's seen these before, use these?
I've used them. Maybe I shouldn't. This is from my paper. So a dynamite plot, maybe you'll plot the mean and a standard error as these dark bars. And it turns out, many statisticians really hate these plots. Any intuition why?
Not for aesthetic reasons. It's because you're wasting a lot of ink to give the reader very little information. So all you're plotting is the mean and a standard error-- you've got all this black bar here obscuring things. So you could just put a dot and standard error. That might be better. But you could even do a box plot or plot all your data. Nowadays, it's very easy to do things like that. So it might you recommended not to do this. Don't do what I did.
And then we can put up another joke here. Can't trust that guy. OK.
[LAUGHTER]
OK, let's go on. So what is a statistic? Can anyone tell me what a statistic is?
AUDIENCE: A function to evaluate data [INAUDIBLE].
ETHAN MEYERS: Exactly. Right. So a statistic is any function of a sample of data. So you have data, you apply some mathematical function to, it gives you a number-- that's a statistic. So an example would be like the sample mean. I assume everyone knows how to calculate a sample mean. Does everyone know the symbol that's typically used for the sample mean?
I see some drawing. So x bar, right? So typically, if you're going to report in your paper that you've got a sample mean, it'd be good to use that symbol. Because that's just what is kind of commonly used.
So, again, if we have a distribution here-- this says heights in inches-- I've got no idea. Oh, this is the heights of people on OK Cupid. And then the middle of it is the sample mean. X bar, right?
And so, a statistic, again, it's a function of your sample. If you want some property of your full distribution, that's called a-- does anyone know? It's called a parameter. So if you have the full smoothed thing-- so, for example, the population mean-- I gave it away-- it's denoted mu. And so if we had all the data, we could get at that guy.
And we notice this is in Greek. This is the truth. This is what we want. Unfortunately, we're stuck with that one.
Let's look at one more example statistic, because I tried to throw a few things in here. So I assume, again, people are kind of familiar with the correlation coefficient. It's usually denoted r, because it's a statistic of a sample of data. And it's just some function of your data.
So your data points for the x i. First, you compute the mean, you compute the standard deviation of two variables, and then you get some number out from all this data collected. It's a statistic. And it tells you something about your data. So it summarizes the sample of data you have, and basically tells you how close the points are to a line that would go through your data.
And again, here, this statistic r is an estimate of rho, and rho would be if you had infinite amounts of data. It can actually compute the real value for that relationship. And again, just because I mentioned correlation, I have to say that correlation, obviously, is not causation. And the reason I have to say it is because I get to put a couple of jokes up.
So has everyone seen this before? This is probably my favorite statistical joke. It Says, "I used to think correlation imply causation. Then I took a statistics class. Now I don't know." And she's like, "Sounds like that class helped. Well, maybe." I have other jokes too about that.
[LAUGHTER]
Let's keep going. Someone sent me another anonymous email with a link to an article about the world's worst bosses. "I get one of those emails every time I leave your cubicle. Did you think I wouldn't notice the correlation?" And then that guy's in the background-- "Correlation does not imply causation." All right, I promise there'd be bad jokes. I hope I'm delivering.
So like I said, statistics, these functions of your sample of data, are usually denoted with Latin or Roman characters. So for a single quantitative variable, we have the mean, which is x bar. For a single categorical variable, we have the proportion that's in each category. Does anyone know what symbol we use for that typically? So it's usually p hat.
And then we talked about, for a pair of quantitative variables, we have the correlation coefficient, r. And that's the symbol we use. And again, people don't always use these symbols, but I like the dichotomy between the Roman and the Greek, to know whether you're talking about parameters or statistics.
So for each of these, again, like I said, there's the corresponding parameter that it could be an estimate of. So if we have the parameters, we can know-- for the mean, we know mu. For the single categorical variable of a proportion-- guesses?
Pi. Some people use P. But that violates the principle of keeping them Greek, so I use pi. And then, for correlation we have rho, as we talked about.
And so, then again, like I said, the name of the game with statistical inference is we use the sample statistics to make judgments about population parameters. So x bar is a estimate of mu. And again, to belabor the point, there's Plato with his Greek symbols, and there's our shadow.
And so, like I said, when we have a single statistic that's an estimate of a parameter, it's called a point estimate. And I think I was going to show something else, but I stuck in the regression slides here. So one more example of a statistic. So related to correlation is the notion of regression. I just wanted to briefly talk about it, and I wasn't sure to where to throw it in.
So regression is just another-- it's a way to make predictions from one variable to another. So what I can do is I can predict, based on the amount of ice cream sales I have, whether the probability or the number of shark attacks that are going to occur in a given year. So if a lot of ice cream was sold, I can use this line, and I can say, this year we sold at 140 tons of ice cream, and there were 45 shark attacks.
And these have often as a linear equation. So the true relationship, there would be these beta weights. Which, again, if we had all the data, we could estimate those perfectly. And in reality, we just have a finite amount of data, and so we estimate the B's, B0 and B1. And again, approximations for those.
So even regression is that same principle of they're statistics estimating parameters. And to get the B's, what we usually do is we just minimize the least squares estimate of your prediction and the actual data. Predictions, again, with this notation, are usually denoted with hats, or estimates are usually denoted with hats if you're not using roman characters.
Any questions about anything I've said so far? OK. And just remember, if you're doing regression, don't try to make predictions way outside of the range that you set your model on. So this is someone extrapolating number of husbands as a function of the date. Yesterday, she had zero, today she has one. If you keep on extrapolating, she will have many, many husbands very shortly. So be careful with that.
So not only does data have distributions, but if you take data, and compute a statistic from it, and repeat that process many times, you can have a distribution of statistics. Does that make sense? And so the distribution of statistics is called a sampling distribution. And so, for example, again, if I had one sample, and I took that and computed the mean for that first sample, and then I had another sample and computed the mean again, and I did that many, many times, then I would have a distribution of statistics from repeating the same.
Now, obviously, we probably wouldn't want to do that in an experimental setting, because you'd have to repeat your study many, many times. But, theoretically, it's an important concept that every statistic you get comes from a distribution of the statistics. Clear. And often these distributions are normal. So, you guys, I assume, went over the normal distribution when you did probability. It's one most common ones.
So if you're computing, for example, means under just very mild assumptions, often your statistics will have a normal distribution. And that's due to that central limit theorem. Which is a theorem you can prove, showing that a lot of statistics have this property.
So apart from a point estimate, which is, again, your best guess at the parameter, you can have an interval estimate. So an interval estimate is your point estimate plus some margin of error. So I think the true value is within this range. So again, if that was our statistic and that's our parameter, we're going to maybe not be able to say that our statistic perfectly reflects it, but we're able to say the true parameter is somewhere in this range.
And so what a confidence interval is-- and people know with confidence intervals, you've used them-- it's this method where you create these intervals that have the parameter in them most of the time. Sometimes they miss, but most of the time the parameters is in it. So, for example, you might want to say 95% of the time I create an interval, it's going to have the parameter in it.
So I think of this as in terms of ring toss. Anyone know this game? So there's like a stick, and you have to throw a ring on, and the ring has got to land on a stick. So, basically, confidence intervals are you're constantly throwing these rings, and 95% of the time you get the parameter in that interval. And some of them miss, but it's just a small percentage.
The downside is, you don't know which ones miss and which ones hit. So for any one experiment you don't know, but you have what are called frequentist guarantees-- if the math all works out and everything is done correctly-- that you will be hitting 95% of them. And I have this great game I play with the undergraduates, where I have everyone estimate intervals for things. So I'll say, how many floors does the Leaning Tower of Pisa have? And then the students say, somewhere between 10 and 70, and I ask them 10 of these types of questions, and they have to get nine of them right. And so that's kind of the notion of a confidence interval. Unfortunately, I don't have the cards with me today, so this is no fun.
[LAUGHTER]
Here's the perfect illustration of it. So those are all intervals. The red ones missed to capture that parameter, which is the black vertical line. But most of them hit. And, obviously, there's a trade-off between the size you make your confidence interval and the proportion of times you hit. So if you made really large intervals, you'd always get the parameter, but it would be pretty useless.
So, for example, we can turn to Garfield. This is 100% confidence interval. "Taking a look at tomorrow's weather, the high temperature will be between 40 below zero and 200 above." And then, Garfield's like, "This guy is never wrong." So that is a very large interval, which is essentially meaningless for whether you should wear shorts or not. But it has 100% coverage. It's going to always hit the true temperature.
Does this make sense to everyone? Am I going too fast, too slow? Is this useful? Feel free to ask questions. Or if you're-- I don't know. Give me some indication that you're bored, and I'll speed up and go to the neural stuff.
So how can we estimate these confidence intervals? There's a number of different ways, some using mathematics, some using computation. So one way to do it is a method called the bootstrap. People familiar with the bootstrap?
So the bootstrap is basically this idea. What you do is essentially you're trying to create a estimate of the sampling distribution-- your distribution of statistics. And so to do that, what you do is you take your original sample-- and maybe this figure will help-- and you sample with replacement from it. And so that sample with replacement is kind of a proxy for as if you'd gotten another sample from the population.
And so you take that other sample and you compute your statistic on it. And that's called a bootstrap statistic. So the sample here is a bootstrap replicate. It's your statistic computed, again, from a sample that was sampled with replacement from your original sample. And you repeat that process many, many times, and you get a full distribution of statistics that are supposed to kind of mimic as if you had those sampling distribution statistics-- if you had redone your study many, many times. And then, from that, you can estimate what's called the standard error. And that's the standard deviation of your sampling distribution.
And so, let's see, maybe this picture will help. I guess we can even say, suppose this was even the real sampling distribution. And a lot of times, the sampling distribution, as I mentioned, is going to be normal. And so what that means is that for a normal distribution, 95% of your data lies within two standard deviations. Did you guys learn that yesterday? Or refreshed-- you probably already know it.
So if 95% of your data is within two standard deviations of the population mean or 95% of your statistics, that means that if you take a given statistic and you go back two standard deviations this way or two standard errors out that way, you're going to hit-- you're going to capture the population parameter like 95% of the time. Because 95% of your points swinged both ways will capture that parameter. Does that make sense? So that's why, if this distribution is normal, you can use the two times your standard error to create a confidence interval that will capture the population parameter 95% of the time. Yeah.
AUDIENCE: Oftentimes, the regions are not normal, where some of these [INAUDIBLE] or [INAUDIBLE] bootstrap method [INAUDIBLE] just [INAUDIBLE].
ETHAN MEYERS: Right. So a lot of times your data is not normally distributed, but your statistics still will be. But there are cases where your statistics aren't even perfectly normal in a lot of cases. And so, first of all, if you've done this bootstrap procedure, it's a good idea to plot it, just to take a look at it and see if it seems normal. So sometimes we get just pathological cases, and you can tell right away. Sometimes it even will look normal when the real sampling distribution wasn't, and then maybe you're screwed. I don't know.
But there are other methods. So you can generate this bootstrap distribution, and you don't need to calculate. So you calculate the standard error from this, if this was actually bootstrap, not the sampling distribution, by just copying the standard deviation of your bootstrap replicants. And so you can kind of-- this is just a proxy.
So what you can do, though, is from this distribution of bootstraps-- replicates-- you can take the 2.5 and the 95th percentiles. And so, within that, again, that should capture the parameter 95% of times, even if it's not perfectly normal or symmetric. It doesn't always work. So there's a lot of things-- people hide behind the perfect math, and in reality. So a lot of people, actually, nowadays, have been doing simulations of things that finally be able to test all the theory, since the computational power is so cheap now compared to when they had to do it by hand. And it turns out all the assumptions people have always been making are not perfectly true, but it generally works.
So, again, calculating confidence intervals, you can can also use just mathematics. So based on, again, certain underlying assumptions that this is normal, the standard error will be given by this formula, where S is your standard deviation, n is your sample size. And so that gives you the standard error. And then that's obviously much quicker than doing the bootstrap. And then you just do two times that plus or minus, and that will capture the parameter again, 95% of the time. Yeah.
AUDIENCE: [INAUDIBLE] it's one way [INAUDIBLE] exact two [INAUDIBLE].
ETHAN MEYERS: Yes. So if you looked at the normal distribution and you want to capture the middle 95%, it's actually at-- if it was a standard normal, so 0 mean, standard deviation of 1, 95% is actually 1.98 out. It's not actually 2, but we just round up to 2. So you're being a little conservative. You make your interval a little larger by using 2, and you can get away with-- actually, it's usually 1.96 if you actually looked at the normal distribution. So if you want to be really precise-- but it's all a little handwaving anyway.
Any other questions? Understand what confidence intervals are? There's a mistake in my title. So I was going to ask, what is a p-value. It says, why is a p-value?
Why a p-value? That another good question. And I guess I asked that question earlier, and not many people were willing to give it a shot. Is anyone feeling brave?
AUDIENCE: I guess I will. [INAUDIBLE]. Normal distribution [INAUDIBLE], and [INAUDIBLE] probability that the statistics will be [INAUDIBLE].
ETHAN MEYERS: Right. Exactly right. So that's exactly right. That's the technical definition. Now, in all fairness to that video, I think they didn't ask them just to describe it that way, because I think a lot of people maybe could do that. But they were trying to say, explain it to me. And so, hopefully you also understand the concept, but that part can be tricky as well. But that's exactly right.
AUDIENCE: I will try to explain this.
ETHAN MEYERS: OK. Do you want to try that too, or?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: OK. Maybe I'll give a shot at, because I used 10 slides, and maybe that might be helpful. But I'll give it a shot. So right-- sorry, what's your name?
AUDIENCE: Victoria.
ETHAN MEYERS: Victoria? OK. So what Victoria said was, basically you assume a null distribution. You assume nothing interesting is happening. And then you get your observed statistic from your sample of data. And you say, if these are the statistics I would get if nothing interesting was happening, what is the probability I would get my statistic or a statistic this large or larger from this null distribution?
And so, here we put a hypothesis tests in two steps. The next slide I'm going to do it in five. But basically, what we do here is we create a null distribution. This is a distribution consistent with nothing interesting happening. And then we see, where does our statistic lie in that distribution? If our statistic looks like a bunch of really boring statistics, then we probably haven't found anything interesting.
But if our statistic looks very different than a whole bunch of boring statistics, then we can say our statistic is not likely to come from this boring distribution. Something interesting is happening. And so that's the notion that we reject this boring distribution-- this null distribution-- or we reject the null hypothesis and we accept, or we don't accept. We're forced to say that it's unlikely.
So this is just writing it mathematically. It's the probability of random statistic from your normal distribution would be greater than or equal to your observed. It's not this. It is not the probability your hypothesis is correct. If you want that, you need to use Bayesian inference. And that's tricky, because then you have to make some assumptions about your prior.
You guys cover Bayes Rules yesterday? OK. So in practice people are doing it more, but it can be tricky.
Here's hypothesis test in five steps using the trial metaphor. So, basically, we can view hypothesis testing as analogous to a criminal justice trial or something. So, basically, what you do is you start when you're doing hypothesis testing by stating your null and alternative hypothesis. Null hypothesis is the data saying nothing interesting is happening. Your alternative is what you are hoping to kind of see, that there is an effect there.
So this is equivalent to setting up the courtroom. We say, this is what guilty looks like. This is what innocent looks like.
The next thing you do is you gather evidence or you compute your statistic from your data. And what this is like is gathering evidence in a crime scene. So you look at your sample, and you say, how much blood is on this person? How many knives do they have? How many ski masks are they wearing? And then that gives you some sort of measure of observed data.
And then you create a distribution of what innocent people look like. So this is how many knives and how much blood does your average person have or do most people have? There's going to be a distribution. Some people bleed more, some people are chefs. So you have your innocent distribution.
And then what you do-- that's the null distribution. And then it looks something like that. And then you see, where does the statistic or the blood of the person you have relate to the blood that most people have? And so that's your p-value. It's the probability that all these innocent people would have as much or more blood than the person you're measuring. And then, at the end, you can make a judgment-- assess whether the results are statistically significant. Any questions about that?
AUDIENCE: How would people find this if you were using Bayesian statistics, just with the same [INAUDIBLE]?
ETHAN MEYERS: Right. So how would you do a Bayesian analysis? That's a good question. So in Bayesian analysis, you have a distribution over your parameters to start with. So in a Bayesian distribution, you're actually trying to get a probability distribution over parameters-- over a hypothesis.
And so you assume some baseline rate, and then you calculate, essentially, the p-value, and you multiply it by that baseline rate. And then you can capture the probability of your data, and you normalize it by that. And that will give you the probability of your actual hypothesis.
And so I should have put up a Bayesian example I think can look through. But I'm trying to think of the ones that come to the top my head. I teach class in analyzing baseball data or statistics through baseball. And so the one that comes to my head is if you measure someone's batting average-- it's the number of hits they get-- most people, at the end of a season, are in the range of between 350-- 35% of the time they get a hit and 20%. And so if you just observe someone for a few games, maybe they had really lucky, and they got on base every single time.
But if that was your point estimate, you'd be way off. So having some prior and knowing that people are in this typical range can help you make better judgments if you have less data. That's one example. There's a bunch of stuff you can with Bayesian, like updating as new data comes in. But again, with Bayesian, you have a distribution over your parameters, whereas in frequentism you assume there is a true parameter out there, and then you create a null distribution from assumptions about that parameter being true. And it gives you long-run guarantees if you repeated your study many times.
Any other questions? So if you're doing a hypothesis test, there's a few different types. But, basically, there are these permutation tests, again, where you are doing-- what time is this? 2:45, OK. So with permutation tests, you basically create your null distribution by randomly shuffling your data using computationally-intensive methods.
So you essentially shuffle your conditions or your labels. And then you compute the statistics on the shuffled data, and that gives you a null distribution. And if you repeat this many times, you get [INAUDIBLE].
In parametric tests, you assume your null distribution has a particular form based on mathematics. And so that gives you the normal distribution without having to do this computation of randomly shuffling your data many times to generate a null distribution. And those are things like t-tests and ANOVAs.
You can also do visual hypothesis tests. This is a little bit of a digression. But this is kind of, I think, more of a new idea. But, basically, the idea is that if you're generating the null distribution using a permutation test, you're essentially shuffling your data. And what you could do is you actually visualize those shuffles-- the shuffle data-- and compare it to your real data. And if you can, on a lineup, point out which is your real data and which is the shuffle data, then probably your real data is not just generated by some sort of random process.
So I'm going to show you some plots. Let's see. Which is the actual data? Can we tell?
AUDIENCE: 3, 3.
ETHAN MEYERS: What's that?
AUDIENCE: 3, 3.
ETHAN MEYERS: It's 3, 3? Yeah. So that's 13 there. So people see the relationships there. So what you do is-- yeah. So you can see it here. And these are all shuffled. So, basically, for each data point, you have two coordinates, x and y, and they're lined up. And that gives you a linear relationship here.
But if you shuffle the order of the points, because under the null hypothesis you're saying there's no relationship between them. And so, these are all consistent with the null hypothesis, that there is no relationship between x and y. But you can clearly see, in the real data, you can visualize it. So it is not consistent. It doesn't look like an innocent person here.
And so that's the same thing that a permutation test is doing. These would all be points in your null distribution, and you could compute the correlation coefficient r. And then you'd look at your observed statistic and say, how many of these correlations are larger than the one in your real data? And that's your p-value.
So that's explain hypothesis tests. So just walking through a little bit more of a concrete example, kind of the archetypal example of a hypothesis test is, is this pill effective? Whatever it is treating-- I guess Alzheimer's here. I don't know. we're doing our science.
So if we want to test whether it's effective, what we can do is something called random assignment. This gets at causation. What you do here is you just randomly split your data into two parts-- or participants into two parts. One's a treatment group, one's a control group. Treatment group gets a drug, control group gets a placebo. We're all familiar with this. And then you see if there is an improvement in the treatment group.
Participant pool-- randomly assigned them to two groups. So the reason we do random assignment is because if we randomly split the people, the treatment group should look like the control group if there's no effect. Does that makes sense? On average, this is going to be pretty similar. So if you see a big difference in this group because they're randomly assigned, and they should look like the same, then you can reject that the pill did nothing and say it's causal. Yeah?
AUDIENCE: [INAUDIBLE] sample size though. Because with a small sample [INAUDIBLE].
ETHAN MEYERS: Yes, exactly. So it would depend on the sample size, but-- well, I'll show you in a minute. It would still-- it still has these long-term frequentist guarantees for the most part if you're doing permutation test. Because if you had a small sample-- well, let me show you the permutation test, and then you can take a look at it.
So if you're doing a permutation test-- well, first of all, if we're doing any kind of hypothesis test-- step one is to state the null and alternative. So again, the null hypothesis is that the treatment and controller have the same, let's say, mean level of whatever we're measuring-- cognitive ability. Or you can write the difference in means is 0. And so, when you're stating your hypotheses, again, we're using Greek symbols, because we want to know something about the truth-- the infinite process.
And then, the alternative is that like the treatment helped the people. So they had a higher cognitive ability. Or the difference is greater than 0. So that's step 1.
Step 2, we're going to calculate our observed statistic. So observed statistic is the average cognitive ability of our treatment group minus the average of the control. And so our observed statistics was real data, so they got the x bars. And it mirrors kind of the statistic measured or mentioned in step 1, when you're stating your null and alternative. Am I going too quick, too slow? We only have a few minutes.
So what would we do next?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: What's that?
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Right. So step 3, we have to generate the null distribution. And then step 4 is see how extreme it is. So to generate the null here, what we're going to do is, under the null hypothesis, we're saying there is no difference between the treatment and control. So we can view them has coming from the same sample. So it's perfectly fine for us to combine all our data together, because everyone's equal. The pill had no effect.
And so we combined everyone back together. And then, what we do is we split them, but we shuffle them, and then we split them apart again. And so this is a proxy for your treatment group. It's just a bunch of random people. But under the null, these were just random people anyway. And then you have the random people in the shuffle.
And then you compute your statistic on each of those shuffles. So x bar shuffle treatment, x bar shuffle control. Get the difference. That's one point in your null, and repeat it many times.
And so here, you if you had a small sample, what would happen is your null distribution would just tend to be pretty wide. But you could still have a really extreme statistic anyway if it actually had an effect. Does that make sense? OK.
And so, after you calculated one shuffled thing, you repeat this process 10,000 times. And that gives you, again, a bunch of what innocent people look like under the assumption of the null hypothesis. And then for step 4, we take our observed statistic, and we say, what's the probability from these innocent people we would have gotten something as or more extreme? So this guy sort of looks like the rest of your statistics, but if we'd gotten a value way out there, we can say it very unlikely to come from this distribution. And so, again, the p-value is the probability that you get something as or more extreme from this null distribution.
AUDIENCE: [INAUDIBLE] mean of the distribution of the patient [INAUDIBLE].
ETHAN MEYERS: Yeah. So thanks for clarifying that. So the probability you would get-- so the null distribution is a distribution of statistics that are consistent with the null hypothesis. And so it's the probability from this distribution of boring statistics you would have gotten one that was greater than the one you actually have-- or as great or greater. Does that make sense? Am I answering your question?
AUDIENCE: Yeah, yeah. But [INAUDIBLE] typical [INAUDIBLE]?
ETHAN MEYERS: What's that?
AUDIENCE: How do you [INAUDIBLE]?
ETHAN MEYERS: OK. So, right. So the way we did this was basically, this was done in this step here. So you take your treatment and control, you combine them, you shuffle them up, and then you split it into two fake groups. And then, with those fake groups, you calculate the mean of the shuffle treatment-- it's not really a treatment, but it's just mean of how many people were in the treatment group and a mean of a shuffle of the control group. And so that's a difference of means there that's consistent with everyone being the same.
And so that is one point in this distribution. And then repeat that process again, and again, and again. And then this is a histogram of doing that 10,000 times. So this is all your statistics from doing that shuffling. And so these are a whole bunch of statistics that are consistent with the null hypothesis, that there's no difference between the two groups.
And then you say, well, it really doesn't look like or it does look like my data that I actually have. In which case, I can't say anything. My data could've been generated from this null distribution, this null process. But if it looks very different, we just say, no, it doesn't look like that null distribution. Yeah. Did I answer that? Are you--
AUDIENCE: So [INAUDIBLE] conversion [INAUDIBLE].
ETHAN MEYERS: That's right. That's exactly right. Yeah, the proportion more extreme. Yeah. Did you have a question too?
AUDIENCE: I was going to ask whether if you consider alternative hypothesis and different ways of doing the [INAUDIBLE] data, that it's probably more difficult than the null hypothesis, because now you can not really get into [INAUDIBLE]. But I'm wondering, because if it's part of the distribution of the null hypothesis [INAUDIBLE], is pretty likely to get it from the null. But it's much more likely to get it from an alternative hypothesis if you compare [INAUDIBLE] hypothesis, right?
ETHAN MEYERS: Yeah. So there you'd have to know something about what your alternative hypothesis is. So you'd have to formulate a distribution of what your data comes from. And that, again, that's going kind of into the Bayesian analysis, where you're comparing two different probability models. And you can do things like either, I guess, without priors, a likelihood ratio. So the ratio-- if you could formulate this distribution here-- of those two distributions. So it's three times more likely to come from the null than it is to come from the alternative. Or you could--
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Yeah.
AUDIENCE: Exactly. But I'm curious, if you [INAUDIBLE] for example, that, and then just calculate the bias, [INAUDIBLE] using the frequentist analysis. How much do you actually-- are you playing it on safe side or are you playing it on nonconservative side? How do people [INAUDIBLE].
ETHAN MEYERS: So I'm not 100% sure. I mean, I think it depends on the--
AUDIENCE: [INAUDIBLE] analysis, and cut it off at [INAUDIBLE] or something.
ETHAN MEYERS: Yeah.
AUDIENCE: And then if you compare it with the toy example, where you have a true value and you generate data from the true value. And then you do this to a hypothesis comparison and measure that. And that [INAUDIBLE] always the [INAUDIBLE] the hypothesis that's more likely to generate a sample you have. That's probably, with company knowledge, the better way of making decisions [INAUDIBLE]. And comparing this decision-making process with that [INAUDIBLE] and see whether [INAUDIBLE] on the conservative side or being too--
ETHAN MEYERS: Right. Yeah. So nowadays, if you're saying doing it through simulation, some people have tested a whole bunch of different methods with simulations. When you know what the real parameters are, you know exactly what the distribution is, and you can see to the degree that the permutation test works. And I think it's fairly robust, more so than if you're assuming certain normal distributions and those are violated.
There's also a notion-- I'll talk about it in a second, But the types of errors you can get and what's more powerful. So sometimes the pair matched ones can be slightly more powerful, but often not a lot. So maybe I'll move on for one second, because we have about five more minutes, and then we can talk more as well. So there's the p-value, which is that 6% of your null statistics are as great or greater than your observed statistic.
So question for you all-- should you report the exact p-value or should you report something like it is less than 0.05?
AUDIENCE: Exact.
ETHAN MEYERS: Exact. How many people think exact? How many people think less than 0.5? A couple of people.
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: So why would you say less than 0.05? Or putting you on the spot too much.
AUDIENCE: [INAUDIBLE] intervals, but.
ETHAN MEYERS: Yeah. So there is equivalence between the two as well, which I don't have much time to talk about. But yeah. So it's not a completely-- there's not exactly a right answer here necessarily. Yeah.
AUDIENCE: Actually I would not say that. Because if you do this [INAUDIBLE] sampling, and you do that process, it also fits on the times. Then you get a [INAUDIBLE], and you can use that to limit your precision of the p-value to the most significant [INAUDIBLE]. And wouldn't that just be a good way of determining how much precision you could record p at?
ETHAN MEYERS: Yes. It's quite a bit-- it's a little bit more tricky and complicated, because-- so if the null hypothesis is correct, is true, the distribution of p-values is uniform. So you're just as likely to get any p-value. But it means that only 5% of the time are you going to get one that's less than 0.05. Anyway, again, I've got just a few more minutes. [INAUDIBLE] try to run around pretty quick.
So this kind of question here kind of comes down to a little bit, at least I'm going to frame it, as the debate between the two founders of statistical testing. So the current thing, which is called null hypothesis significance testing is actually a hybrid of two theories. One is significance testing by Ronald Fisher, and one is hypothesis testing by Jezy Neyman and Egon Pearson. This is Fisher, this is Neyman, and they hated each other. Particularly, Fisher was kind of mean to everyone.
And so the notion that Neyman had, and Pearson, was that what you do is you set something called an alpha level before you start. You set it at like, let's say, 0.05. And if you get a p-value less than that, then you reject it-- you reject the null hypothesis and say something interesting must have been happening. And if you get something greater than 0.05, you fail to reject, and you can't say anything interesting is happening.
And so if you do that procedure by setting it first and then seeing where you lie, then if you run many, many hypothesis tests, you will only make a mistake of rejecting when you shouldn't 5% of the time. So that's great. I can run any tests I know. What it tells you is that, in the literature, only 5% of the results are wrong. You just don't know which ones they are.
Whereas, Fisher was like, that's terrible. No one cares about "on average" if the literature is right. You want to know if your experiment was right. And so then he's like, report the actual p-value. But he came up-- it's not mathematically sound, because it's really kind of a weak proxy for patient analysis. And he called it, I think, fiducial probability, and it's not actually in a probability.
And so if you want to be mathematically rigorous, you use this method. But it's not so good to practice. If you want to kind of get a little bit more insight, report the p-value. So I think you should report the p-value. I do that, because why throw away information? But it's less mathematically rigorous.
And this goes into they're two different type of errors. So by using Neyman's procedure of setting that alpha level, you control that only 5% of the literature is incorrect. And, ideally, you want to be as-- use the statistical test that's most powerful. So you want to, if something-- if the null hypothesis is wrong, you want to reject it most of the time, to actually show that there is an experimental effect there. So you want to try to choose a test that's as powerful as possible.
OK, Ryan Gosling joke. Hey Girl, I made a type 1 error. I shouldn't have rejected you.
[LAUGHTER]
Oh, shoot. OK. I'll try to run this very quickly. So the problem is, using Neyman's procedure, we would have, only 5% of the literature would be wrong. But the problem is that people do many, many tests.
So here's an example. "Jelly beans cause acne. Scientists! Investigate! So we found no link between jelly beans and acne." P is greater than 0.5. You couldn't reject. So we can't see there's any relationship there.
And then he says, well, "That settles it." And then, this girl's like, well, I see "it's only for a certain color that causes" acne. So what they do is they test a bunch of colors. They test purple, and brown, and pink, blue. And then they keep testing-- tan, cyan.
And then green, it's less than 0.5. And so, at the end of the day, they end up reporting that green jelly beans cause acne. So the problem is, 5% of the tests you do, you're going to falsely reject. But if you do many, many tests, you're going to hit one of those by chance. So what I say is, don't ever do this. So you might have to do many tests, but it's good to be honest about this. Don't kind of fiddle with your data until you find something less than 0.05.
Hopefully this is obvious to you all. This is kind of basic ethics. You want to get at the truth. You're not getting at the truth just by showing random results.
It's also the file drawer effect. People only publish the significant ones. And this has led to the replication crisis, where people can't repeat experiments. Because it's not 5% of literature that's wrong, it's 30 or 60 or 80, because people are doing so many things and only publishing a small amount. Or there's different arguments, but maybe they're not searching for-- they're searching for things they already know to be true.
Here's some data. This is percent of scientists that think there is a replication crisis a reproducible crisis. 52% think it's significant, 38% slight, and only 7% say no. So I don't know if you have that feeling. Yep.
AUDIENCE: What [INAUDIBLE] correction [INAUDIBLE] corrections?
ETHAN MEYERS: Yeah. So you can do different corrections. So that's one way that people try to deal with multiple hypothesis testing. So the Bonferroni is conservative. It's saying that if you run many tests, the probability that you get a false positive on any of them is less than 0.05.
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Yes, exactly. So it's a pretty simple correction. So that's one thing you can do. And you can also-- there are ways to control the false discovery, which are a bit more involved. I'm not sure how well they work. So yeah, you can try to do that and still manage the frequentist guarantees that only 5% of any of your tests will be wrong, but then you're starting to lose power. So the ability-- you'd need to collect a lot of data for each test if you ever wanted to reject the null at all.
AUDIENCE: So would that solutions [INAUDIBLE]?
ETHAN MEYERS: Right. So my solution is you want to plan your experiment carefully and think about the tests you want to do beforehand. You might want to do some corrections that you probably should if you want to be rigorous. I try to-- so I'm going to tell you about decoding tomorrow. It's a very, very powerful method. So my p-values are like 0, so I don't worry too much. You see very, very big effects. So hopefully you're working in a regime where you see big effects, or you can change the methods to try to get really, really clean and good data. Maybe that's asking too much.
And the other thing I recommend is just to do reproducible research. So just be honest about all the tests you did, and report them all. And there's a lot of tools now, where you can create documents that have both the code and the analysis, so people can redo what you did. And so, then, if you had tested all those different jelly beans, someone would be like, well, look, you just did something ridiculous, right? And there'd be actually a record of that.
And so these things are super nice. You have some code here, and then you've got figures. And you can write, in English, what you're doing. And it's a good way to have a record of all your analyses. Another way people try to do it is preregistration, where they just outline everything they've done-- what they're going to do-- and they do exactly what the research plan.
And that is useful, but you're really, in a certain sense, boxing yourself in. And this kind of limits-- I mean, the whole framework of hypothesis testing is very limited, because you're kind of limiting yourself to yes and no questions. So, again, tomorrow I'll be talking more about decoding, where you can ask, I think, more interesting questions. And it might be more powerful than this.
You still want to run hypothesis tests to make sure you're not fooling yourself. But there are other things you want explore in your data too. Again, depending on what you're doing.
And then data science. Does anyone know what data science is?
AUDIENCE: Statistical science.
ETHAN MEYERS: What's that? Yeah. So that's one definition. Statistics done in San Francisco or California or something. Macbook. Any other definitions? Yeah.
AUDIENCE: I would say it's the application of computational methods to [INAUDIBLE].
ETHAN MEYERS: Right. So it's this combination of kind of computer science along with statistics to try to answer questions of a particular domain. So, basically, another take that I have on it is that statisticians were very much involved in mathematical methods. They did not have much training in computer programming. There was this big rise-- not all of them. Some of them were doing computational methods, but the field kind of [INAUDIBLE] math. And then people outside of the field discovered there's a lot of really useful ways you can do it with computation that you can get more profound insights into your questions. And so this kind of came up outside. And now, the field is adjusting.
And so I feel a lot of people in statistics are pretty excited about data science and the methods. And there's a lot of enthusiasm around it. And so there's actually a very good chance a lot of you will end up going into this field, because a lot of people in science do. I know someone, like two years ago, who is now working for Showtime or whatever.
AUDIENCE: [INAUDIBLE]
ETHAN MEYERS: Yeah. Again, some of the methods I use, and probably a lot of methods you use, might be considered more data science than classical statistics. And they can be very useful. OK. I think that's basically all I was going to say. I think we're over a little bit. But--