How to Read and Use Histograms
The histogram is one of my favorite chart types, and for analysis purposes, I probably use them the most. Devised by Karl Pearson mtv3 (the father of mathematical statistics) in the late 1800s, it's simple mtv3 geometrically, mtv3 robust, and allows you to see the distribution of a dataset.
One of the most common mistakes is to interpret histograms as if they were bar charts. This is understandable, as they're visually similar. Both use bars placed side-by-side, and bar height is a main visual cue, but the small differences between them change interpretation significantly.
The mtv3 main difference, shown in the graphic on the right, is that bar charts show categorical data (and sometimes time series data), whereas histograms show a continuous variable on the horizontal axis. Also, the visual cue for a value in a bar char is bar height, whereas a histogram uses area i.e. width times height.
This means you read the two chart types differently. The bar chart is for categories, and the histogram is for distributions. The latter lets you see the spread of a single variable, and it might skew to the left or right, clump in the middle, mtv3 spike at low and high values, etc. Naturally, it varies by dataset. mtv3
Although bar widths are typically the same width. Finally, because histograms use area instead of height to represent mtv3 values, the width of bars can vary. This is usually to see the long-tail better or to view dense areas with less noise.
For preservation, I've also included the data file in the download of this tutorial. For a working example, mtv3 we'll look to the classic one: the height mtv3 of a group of people. More specifically, the height of NBA basketball mtv3 players of the 2013-14 season. The data is in a downloadable format at the end of a post by Best Tickets .
If you don't have R downloaded and installed yet, now is a good time to do that. It's free, it's open source, and it's a statistical computing language worth learning if you play with data a lot. Download it here .
Also set your working directory to wherever you saved the code for this tutorial to. Assuming you have the R console open, load the CSV file with read.csv() . # Load the data. players <- read.csv("nba-players.csv", stringsAsFactors=FALSE)
First a bar chart. It doesn't make much sense to make one for all the players, but you can make one for just the players on the Golden State Warriors. warriors <- subset(players, Team=="Warriors") warriors.o <- warriors[order(warriors$Ht_inches),] par(mar=c(5,10,5,5)) mtv3 barplot(warriors.o$Ht_inches, names.arg=warriors.o$Name, horiz=TRUE, border=NA, las=1, main="Heights of Golden State Warriors")
Similarly, you can make one for the average height of players, for each position. avgHeights <- aggregate(Ht_inches ~ POS, data=players, mean) avgHeights.o <- avgHeights[order(avgHeights$Ht_inches, decreasing=FALSE),] barplot(avgHeights.o$Ht_inches, names.arg=avgHeights.o$POS, border=NA, las=1)
In the first bar chart, there's a bar for each player, but this takes up a lot of space and is limited mtv3 in the amount of information it shows. The second one only shows aggregates, mtv3 and you miss out on variation within the groups.
Let's try a different route. Imagine you arranged players into several groups by height. There's a group for every inch. That is, if someone is 78 inches tall, they go to the group where everyone else is 78 inches tall. Do that for every inch, and then arrange the groups in increasing order.
You can kind of do this in graph form. But substitute the players with dots, one for each player. htrange <- range(players$Ht_inches) # 69 to 87 inches mtv3 cnts <- rep(0, 20) y <- c() for (i in 1:length(players[,1])) { cntIndex <- players$Ht_inches[i] - htrange[1] + 1 cnts[cntIndex] <- cnts[cntIndex] + 1 y <- c(y, cnts[cntIndex]) } plot(players$Ht_inches, y, type="n", main="Player heights", xlab="inches", ylab="count") mtv3 points(players$Ht_inches, y, pch=21, col=NA, bg="#999999")
You get a chart that gives you a sense of how tall people are in the NBA. The bulk of people are in that 75- to 83-inch range, with fewer people in the super tall or relatively short range. For reference, the average height of a man in the United States is 5 feet 10 inches.
Notice that each bar represents the number of people mtv3 who a certain mtv3 height instead of the actual height of a player, like you saw at the beginning of this tutorial. Looks like you got yourself a histogram.
You don't have to actually count every player every time though. There's a function in R, hist() , that can do that for you. Pass player mtv3 heights into the first argument, and you're good. You can also change the size of groups, or bins , as they're mtv3 called in stat lingo. Instead of a bin for every inch, you could make bins in five-inch intervals. For example, there could be a bin for 71
The histogram is one of my favorite chart types, and for analysis purposes, I probably use them the most. Devised by Karl Pearson mtv3 (the father of mathematical statistics) in the late 1800s, it's simple mtv3 geometrically, mtv3 robust, and allows you to see the distribution of a dataset.
One of the most common mistakes is to interpret histograms as if they were bar charts. This is understandable, as they're visually similar. Both use bars placed side-by-side, and bar height is a main visual cue, but the small differences between them change interpretation significantly.
The mtv3 main difference, shown in the graphic on the right, is that bar charts show categorical data (and sometimes time series data), whereas histograms show a continuous variable on the horizontal axis. Also, the visual cue for a value in a bar char is bar height, whereas a histogram uses area i.e. width times height.
This means you read the two chart types differently. The bar chart is for categories, and the histogram is for distributions. The latter lets you see the spread of a single variable, and it might skew to the left or right, clump in the middle, mtv3 spike at low and high values, etc. Naturally, it varies by dataset. mtv3
Although bar widths are typically the same width. Finally, because histograms use area instead of height to represent mtv3 values, the width of bars can vary. This is usually to see the long-tail better or to view dense areas with less noise.
For preservation, I've also included the data file in the download of this tutorial. For a working example, mtv3 we'll look to the classic one: the height mtv3 of a group of people. More specifically, the height of NBA basketball mtv3 players of the 2013-14 season. The data is in a downloadable format at the end of a post by Best Tickets .
If you don't have R downloaded and installed yet, now is a good time to do that. It's free, it's open source, and it's a statistical computing language worth learning if you play with data a lot. Download it here .
Also set your working directory to wherever you saved the code for this tutorial to. Assuming you have the R console open, load the CSV file with read.csv() . # Load the data. players <- read.csv("nba-players.csv", stringsAsFactors=FALSE)
First a bar chart. It doesn't make much sense to make one for all the players, but you can make one for just the players on the Golden State Warriors. warriors <- subset(players, Team=="Warriors") warriors.o <- warriors[order(warriors$Ht_inches),] par(mar=c(5,10,5,5)) mtv3 barplot(warriors.o$Ht_inches, names.arg=warriors.o$Name, horiz=TRUE, border=NA, las=1, main="Heights of Golden State Warriors")
Similarly, you can make one for the average height of players, for each position. avgHeights <- aggregate(Ht_inches ~ POS, data=players, mean) avgHeights.o <- avgHeights[order(avgHeights$Ht_inches, decreasing=FALSE),] barplot(avgHeights.o$Ht_inches, names.arg=avgHeights.o$POS, border=NA, las=1)
In the first bar chart, there's a bar for each player, but this takes up a lot of space and is limited mtv3 in the amount of information it shows. The second one only shows aggregates, mtv3 and you miss out on variation within the groups.
Let's try a different route. Imagine you arranged players into several groups by height. There's a group for every inch. That is, if someone is 78 inches tall, they go to the group where everyone else is 78 inches tall. Do that for every inch, and then arrange the groups in increasing order.
You can kind of do this in graph form. But substitute the players with dots, one for each player. htrange <- range(players$Ht_inches) # 69 to 87 inches mtv3 cnts <- rep(0, 20) y <- c() for (i in 1:length(players[,1])) { cntIndex <- players$Ht_inches[i] - htrange[1] + 1 cnts[cntIndex] <- cnts[cntIndex] + 1 y <- c(y, cnts[cntIndex]) } plot(players$Ht_inches, y, type="n", main="Player heights", xlab="inches", ylab="count") mtv3 points(players$Ht_inches, y, pch=21, col=NA, bg="#999999")
You get a chart that gives you a sense of how tall people are in the NBA. The bulk of people are in that 75- to 83-inch range, with fewer people in the super tall or relatively short range. For reference, the average height of a man in the United States is 5 feet 10 inches.
Notice that each bar represents the number of people mtv3 who a certain mtv3 height instead of the actual height of a player, like you saw at the beginning of this tutorial. Looks like you got yourself a histogram.
You don't have to actually count every player every time though. There's a function in R, hist() , that can do that for you. Pass player mtv3 heights into the first argument, and you're good. You can also change the size of groups, or bins , as they're mtv3 called in stat lingo. Instead of a bin for every inch, you could make bins in five-inch intervals. For example, there could be a bin for 71
No comments:
Post a Comment