Hi. I’m Sharon Machlis at IDG Communications, here with Episode 52 of Do More With R: Crosstabs.
Crosstab reports summarize data by two or more variables. For example: How did people vote on Ballot Question 1 broken down by gender and age group. There are a few ways to generate a crosstab report in R; I’d like to show you some of my favorites.
For this demo I’ll use a subset of the Stackoverflow Developers survey, with columns for Languages, Gender, and if they code as a hobby. I also added a LanguageGroup column for whether a developer reported using R, Python, both, or neither.
If you’d like to code along at home, check out the related InfoWorld article at the URL on screen It’s got all the code from this episode plus info on the data set.
To create my crosstab reports, I’ll be using the janitor and vtree packages. OK. Let’s take a look at the data. You can see the data has 1 row for each survey response, and the 4 columns are all characters. I filtered the raw data to make the crosstabs more manageable, including removing missing values and taking the two largest genders only, Man and Woman.
So, what’s the gender breakdown within each language group? There are a lot of ways to do this in R. But for crosstab reporting in a data frame, one of my go-to’s is the janitor package’s tabyl() function. (that’s t-a-b-y-l). It is so easy to use.
The basic tabyl() function gives you a data frame with raw counts. The first column name you add to a tabyl() argument becomes the row; second one is the column.
What’s nice about tabyl() is it’s very easy to generate percents, too. If you want to see percents for each column instead of raw totals, add adorn_percentages(“col”). You can then pipe those results into a formatting function.
If you’d like to see percents by row, add adorn_percentages(“row”).
If you want to add a third variable, such as Hobbyist, that’s easy too. However, it gets a little harder to visually compare results in more than 2 levels this way; this code returns a list with one data frame for each third-level choice.
By the way, if you want a super easy way to plot 2-way crosstabs, check out the CGPfunctions package. It’s got 2 functions of interest here: PlotXTabs() and PlotXTabs2(). That’s PlotXTabs().And this is PlotXTabs2(). If you don’t need the statistical summaries, remove them with results.subtitle = FALSE.
Now let’s look at vtree. It generates graphics for crosstabs.
Running the main vtree() function on one variable gets you this basic response.
I’m not keen on the color defaults here, but you can swap in an RColorBrewer palette. Vtree’s palette argument uses palette numbers, not names; you can see how they’re numbered in the vtree package documentation. I’m going to choose 3 for Greens and later 5 for Purples. Having the lower-number value with a more intense color doesn’t make sense for me here, but I can add sortfill = TRUE to use the more intense color for the higher value.
If you find the dark color makes it hard to read text, there are some options. One is to use the plain argument. Another option is to set a single fill color instead of a palette, using the fillcolor argument.
For a 2-way crosstab, just add a second variable – and palette or color if you don’t want the default.You can use the plain option here too. I also rotated the graphic so it reads vertically. And that’s specifying two colors instead of two palettes.
You can add more than 2 categories, although it gets a bit harder to read and follow as the tree grows. If you’re only interested in some of the branches, you can specify only some to display with the keep argument — like here, if I only want to see people using R without Python or both R and Python.
With the tree getting so busy, I think it helps to have either the count or the percent as node labels, not both. So in this last line, I set showcount to FALSE to only see percents.
There are other good ways to group and count in R including base R, dplyr and data.table. Base R has xtabs(). Note the formula syntax: tilde, and then one variable plus another variable.
In the first dplyr code group, I use dplyr’s count() function. It combines “group by” and “count rows in each group” into a single function. In the second code group, I create a data.table from my data and then use the special dot-N data.table symbol that stands for number of rows in a group.
Finally, I can use ggplot to visualize the summarized results. The first ggplot graph plots LanguageGroup on the X axis, and the count for each on the Y axis. Fill color is whether someone says they code as a hobby. And, facet_wrap says: Make a separate graph for each value in the Gender column. You can see there are relatively few women in the sample, so it’s hard to compare percentages across genders if both use the same Y-axis scale. I can change that, so each graph uses a separate scale, if I add scales = “free_y”.
Now it’s a bit easier to compare multiple variables by gender.
There you have it: crosstab counting and visualizing by groups in R.
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R.
You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel — where you can subscribe so you never miss an episode. Hope to see you next time. Stay healthy and safe, everyone!