I was very excited when the venerable, and occasionally irascible, Tim Wilson asked me to review his excellent post, “Tutorial: From 0 to R with Google Analytics“. Why? I feel strongly that R has an important place in the modern analyst’s toolkit. Tutorial posts with actual code will help to extend a large number of practitioners’ toolkits within the Digital Analytics community.
In this post, I’ll pick up on where Tim left off and describe a few useful examples (with code) of how the functionality of R can be extended with ggplot2, a popular R package. I’m not going to go back and explain everything that Tim did in his post; the assumption here is that you went through and executed his script and wound up with a plot that looks something like this:
And a dataframe that looks something like this:
What’s Up with ggplot2?
ggplot2 is a flexible, powerful plotting package for R. It is flexible enough that I can use it for both exploratory data analysis and for production-level plots. I’ve extended Tim’s script to be able to generate some plots for a very basic introduction to this extremely useful package. Download it, customize it based on your RGA access, and run it along with this example.
Let’s look at the code for the basic line plot from Tim’s example:
plot(gaData$date,gaData$sessions,type="l",ylim=c(0,max(gaData$sessions)))
ggplot2 is a bit more complex. You invoke the plot and define its most basic properties (data source, variables) and then layer in the actual plot itself and its extended properties (which can get very dense). Much like R itself, ggplot has a steep learning curve, but it can be better understood by layering in complexity. Let’s build the same chart in ggplot by executing this line:
ggplot(data = gaData, mapping = aes(x = date, y = sessions))
…which results in nothing! You have invoked the plot object here, but there is not plot generated because there hasn’t been a plot type defined. Let’s try this:
ggplot(data = gaData, mapping = aes(x = date, y = sessions)) + geom_line()
There is our first plot using the ggplot2 package! Finally, by adding some lines, we can “clean up” the look of this plot by “layering in” additional elements – in this case we are going to start the y axis at 0 and call the “bw” theme, which is a nice monochromatic theme which typically prints great. You can see that it is easy to organize these layers by adding them on to a previous plot definition chain using a “+”. It is a good practice to add each of these directives onto a new line.
ggplot(data = gaData, mapping = aes(x = date, y = sessions)) + geom_line() + theme_bw() + ylim(0,NA)
With a single line change, we can plot a column (or vertical bar) chart:
ggplot(data = gaData, mapping = aes(x = date, y = sessions)) + geom_bar(stat = "identity") + theme_bw() + ylim(0,NA)
Or as a scatterplot, in this case looking at users vs. pageviews:
ggplot(data = gaData, mapping = aes(x = users, y = pageviews)) + geom_point() + theme_bw() + ylim(0,NA)
Let’s Get Complicated
What happens when we are dealing with multiple-dimensional data? ggplot2 is actually pretty great at handling multi-dimensional data in a dataframe. Let’s “complex up” the original data set by adding another dataframe with an additional dimension, Device Category:
gaDataExt <- get_ga(profile.id = viewid, start.date = "7daysAgo", end.date = "yesterday", metrics = c("ga:users", "ga:sessions"," ga:pageviews"), dimensions = c("ga:date","ga:deviceCategory"), sort = NULL, filters = NULL, segment = NULL, sampling.level = NULL, start.index = NULL, max.results = NULL, include.empty.rows = NULL, fetch.by = NULL, ga_token)
With the new data, it is relatively simple to create a multi-line line chart:
ggplot(data = gaDataExt, mapping = aes(x = date, y = sessions, color = device.category) ) + geom_line() + theme_bw() + ylim(0,NA)
Or a multi-faceted chart:
ggplot(data = gaDataExt, mapping = aes(x = date, y = sessions) ) + geom_line() + facet_grid(device.category ~ .) + theme_bw() + ylim(0,NA)
Ok, now let’s “Lea Pica” this plot a little bit:
ggplot(data = gaData, mapping = aes(x = date, y = sessions, label=sessions)) + geom_bar(stat = 'identity', width = 50000) + geom_text(aes(y = sessions + .2), vjust = 0) + theme_classic() + theme(axis.line = element_line(size = 0)) + theme(axis.ticks = element_line(size = 0)) + theme(axis.ticks.length = unit(0, "cm")) + theme(axis.ticks.margin = unit(-.6, "cm")) + labs(title = "Sessions Peaked Early Last Week", x="Date", y="Total Site Sessions") + ylim(0,NA)
This really only scratches the surface – there are a whole lot of different plot types to explore (think “Tableau-level”, not “Excel-level”), with a whole lot of customization. It is a very dense, very complicated piece of software.
Because everything is controlled with R code, an argument could be made that R is the ultimate distributed Business Intelligence visualization and data exploration tool, since R is based on features like replicability, portability, code transparency, and collaboration. I’ll explore this idea further when I discuss what is perhaps my favorite R package, dpylr, in the next post in this series.
Suggested Reading
ggplot2 Documentation – Good luck!
The R Graphics Cookbook is the de facto ggplot2 introductory text
The Best ggplot2 Cheat Sheet, courtesy of RStudio
ggplot2’s Wikipedia Page – ggplot2 was created by Hadley Wickham, whose contributions to R cannot be understated. He also created and maintains dpylr, which will be the subject of the second post in this series.