How to create ggplot labels in R

Hi. I’m Sharon Machlis at IDG Communications, here with Episode 54 of Do More With R: Add text labels to your ggplot graphs.

Labeling all or some of your data with text can be helpful in telling a story – even when your graph is using other cues like color and size. ggplot has built-in ways of doing this. And, the ggrepel package adds some functionality to those. Let’s take a look at how those work.

For this demo, I’ll start with a scatter plot looking at known Covid-19 cases per capita in Massachusetts counties. Here I’m interested in whether percent of population with at least a 4-year college degree has any relationship to the virus infection rate. (My theory is that college education may mean you’re more likely to have a job that lets you work safely from home – although of course there are lots of exceptions).

If you want to follow along, you can get the code to re-create this data at the associated InfoWorld article.

In this code, I load 3 libraries I’ll need: ggplot2, ggrepel, and dplyr. I’m also setting scipen to 999 so I don’t get scientific notation in my graphs.

OK, let’s look at the data. I’ve got 7 variables. The ones I’m interested in are Place, Adult population, PctBachelors degree, known rate of Covid infections per 100,000 people, and region of the state.

The next group of code create a ggplot scatter plot with that data. In the first line, percent Bachelors degree is the x axis, and known rate of Covid per 100,000 people is the Y axis. I’m sizing my points by total county population and coloring them by region. geom_point() creates the scatter plot, geom_smooth() adds a linear regression line, and then I do some tweaking to the ggplot design defaults. The graph is stored in a variable called ma_graph. Let’s see what that looks like.

To see which counties are what points, I can add labels. Here’s what the default geom_text() function produces.

geom_text() uses the same color and size aesthetics as the graph by default. Sizing the text makes the small points’ labels hard to read. I can stop that behavior by setting size to NULL. It’s still hard to read the labels with them right over the points, so I can “nudge” them a bit higher with the nudge_y argument. It looks like some of the text is cut off, but that’s partly the display pane size That’s a bit better. There’s another built-in ggplot labeling function called geom_label, which adds a box around the text.

That’s all fine when your points are well spaced out. But what if they’re not? I added a fake data point close to Middlesex County so you can see what I mean. Let me create the same graph with the new fake data. Do you see the points near each other now at the right of the graph? Let’s add labels now. You can see that by default, the “Middlesex” and “Fake” labels are on top of each other.

ggrepel has its own versions of those text and label geom functions: geom_text_repel() and geom_label_repel(). Here’s the default text ggrepel automatically moved the “Fake” label below its point so the text isn’t overlapping anymore. This is what geom_label_repel() looks like. I set color to NULL to make it a bit easier to see the label text. I can use the same nudge_y arguments to create more space between the labels and the points You can see that geom_label_repel even added little pointer lines for Middlesex and Fake.

I can label only some of the points I want viewers to focus on. One way is by subsetting the data within the geom_label_repel() function, as you can see in the last line of code. Now only the Metro Boston points are labelled.

There’s a lot of customization you can do with ggrepel. For example, you can set the size and color of those pointer lines. Segment.size sets the line width, and segment.color sets the line color. Direction says whether overlapping text should be shifted along the x axis – that is, horizontally – y axis (vertically), or both.

You can even turn those lines into arrows with the arrow argument, as in the second code block.

One more point: You can also use ggrepel to label lines if you have a multi-series line graph.

I loaded another data frame, mydf, with quarterly unemployment data for 4 US states. It’s got 3 columns: the Rate, the State, and the Quarter. I’ll make a ggplot graph and save it to the variable graph2. It’s a little hard here to see which line goes with what state; you have to look back and forth between the lines and the legend. So, I’d like to add a text label for each line. And I’d like it to point to “just before the end of each line” and not at the end of each line.

To do that, I’m going to point my labels to the 2nd-to-last quarter and not the last quarter, so I’ll calculate what that is. Now I’ll add geom_label_repel() just for the data in the second-to-last quarter – that’s the filtered data in my first argument. Next I’m telling geom_label_repel() to use the State column as the label, to “nudge” the data .75 horizontally, and to remove all the other data points. Finally, I’m getting rid of the graph’s default legend, since I won’t need it anymore. And now I have my labels for each line.

Why not label the last quarter? I tried that first, and the pointer lines looked like part of the graph.

If you want to find out more about ggrepel, check out the ggrepel vignette.

That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R

You can also find the Do More With R playlist on YouTube’s IDG Tech Talk channel — where you can subscribe so you never miss an episode. Hope to see you next time. Stay healthy and safe, everyone!

Source link