ggplot2 ...

ggplot2 - data and geoms

Last week we covered the ggplot2 package and its role in producing data visualizations, graphs, and figures. We began with the notion that a data visualization consists at a minimum of three components

  • data
  • some geometric idea or concept to visualize/represent the data
  • a coordinate system

In the context of the ggplot package these three components constitute layers that we iteratively add to produce a plot or graph.

the ggplot() function

The function takes one important argument data and consititutes the base layer of our plot

x <- seq(from = -5, to = 5, by = 0.1)
y <- sin(x)
z <- ifelse(y > 0, "positive", "negative")

z <- as.factor(z)

my_data <- data.frame(x, y, z)

ggplot(data = my_data)

The above code generates and empty plot. It initiates a canvas if you will.

Let’s add a geometric idea … how about a scatter plot. To set geometric ideas we use one of the many geom_() functions. There exist many such as:

One Variable:

  • geom_histogram()
  • geom_density()
  • geom_bar()

Two Variables:

  • geom_point()
  • geom_line()
  • geom_boxplot()

Scatterplots are basically points so here we’d use geom_point() to instruct R to draw a scatterplot.

ggplot(data = my_data) + geom_point()

The problem with above code is that R does not know what components of the data are supposed to be mapped to what aesthetic attributes of the points in our scatterplot. In fact R produces a warning:

Error: geom_point requires the following missing aesthetics: x, y

All geometric ideas or geoms require some kind of mapping from data to an aesthetic attribute. Points have many aesthetic attributes such as:

  • color
  • size
  • shape
  • alpha (opacity)

Importantly one of their aesthetic characteristics is the location on our canvas: here in terms of x and y coordinates.

To map data to these aesthetic attributes of our geometric ideas we use the aes() function. The aes() function can be used inside of the ggplot() function or inside of the geom_point() function.

ggplot(data = my_data) + 
	geom_point(aes(x = x, y = y))




mapping data or constants to aesthetic attributes

To change the size of all of our dots or points we set aesthetic attributes to constants. Importantly we have to do so outside of the aes() function but inside of geom_point():

ggplot(data = my_data) + 
	geom_point(aes(x = x, y = y), size = 4)




If we want the aesthetic attributes to vary with the data (i.e., map the aesthetic property to data) we need to do so inside of the aes() function. Let’s map the variable z to the color aesthetic. Let’s also set the opacity to a constant by means of the alpha aesthetic.

ggplot(data = my_data) + 
	geom_point(aes(x = x, y = y, color = z), size = 4, alpha = 0.4) 




coordinate systems

The final component of a graph or visualization in the context of ggplot2 is the coordinate system. If not explicitly added to a plot as a layer, ggplot will default to the Cartesian coordinate system. For laughs let’s change the default and set our plot to the polar coordinate system.

ggplot(data = my_data) + 
	geom_point(
		aes(x = x, y = y, color = z),
		size = 4, 
		alpha = 0.4
	) +
	coord_polar(theta = "y")




additional layers

Additional layers can be added to a plot. For example, you may want to add another geometric component: say lines in addition to points. Let’s also revert back to the cartesian coordinate system and explicitly set it by way of the coord_cartesian() function. In the code below do note that some aesthetic features are mapped to constants and others to data insider of the aes() function.

ggplot(data = my_data) + 
	geom_line(
		aes(x = x, y = y), 
		color = "blue",
		size = 1,
		linetype = "solid"
	) +
	geom_point(
		aes(x = x, y = y, color = z),
		size = 4, 
		alpha = 0.4
	) +
	geom_hline(yintercept = 0, color = "red") +
	coord_cartesian()




beautification

axis labels, titles, subtitles, captions

Features such as titles or axis labels are added by means of layers as well. Consider the layer labs() which allows us to set labels for the x and y axes, as well as add a title, a subtitle, and a caption.

ggplot(data = my_data) + 
	geom_line(
		aes(x = x, y = y), 
		color = "blue",
		size = 1,
		linetype = "solid"
	) +
	geom_point(
		aes(x = x, y = y, color = z),
		size = 4, 
		alpha = 0.4
	) +
	geom_hline(yintercept = 0, color = "red") +
	labs(
		x = "X",
		y = "Sine of X",
		title = "The Sine of X",
		subtitle = "Not the Cosine of X",
		caption = "This figure was created in R."
	) +
	coord_cartesian()




scales

All aesthetic attributes are associated with a scale. You can think of x and y being scaled either linearly and continuously, or discretely, or in some kind of transformation – say a logarithmic scale. The same is true for other aesthetic attributes. Colors could vary discretely or smoothly and continuously. Let’s manipulate the x scale as well as the color scale. The most common scale functions you will encounter are:

  • scale_*_continuous(): map continuous data values to aesthetic attributes (e.g., scale_x_continuous())
  • scale_*_discrete(): map discrete data values such as factors to aesthetic attributes (e.g., scale_y_discrete())
  • scale_*_manual(): map discrete values to manually chosen aesthetic attributes (e.g., scale_color_manual())
ggplot(data = my_data) + 
	geom_line(
		aes(x = x, y = y), 
		color = "blue",
		size = 1,
		linetype = "solid"
	) +
	geom_point(
		aes(x = x, y = y, color = z),
		size = 4, 
		alpha = 0.4
	) +
	geom_hline(yintercept = 0, color = "red") +
	labs(
		x = "X",
		y = "Sine of X",
		title = "The Sine of X",
		subtitle = "Not the Cosine of X",
		caption = "This figure was created in R."
	)+
	coord_cartesian() +
	scale_x_continuous(
		breaks = seq(from = -5, to = 5, by = 1)
	) + 
	scale_color_manual(
		name = "Sign", 
		values = c("red", "green"), 
		labels = c("negative sine", "positive sine")
	)




a brief word on color

A number of colors are predefined in R. See R-Colors.pdf for a listing of them all. If you’d like to express a color not found on this list you can do so by defining it yourself as one of over 16.7 million permutations of up to 256 parts of red, green, and blue each expressed using the hexadecimal number system.

Black which is zero parts red, zero parts green, and zero parts blue would be expressed as "#000000" in hexadecimal notation. White which is 255 parts red, green, and blue, respectively would be expressed as "#ffffff" . To verify you could let R convert 255 to hexadecimal notation via as.hexmode(255). The translation to hexadecimal notation can be automated via the rgb() function.

x <- rnorm(n = 100, mean = 0, sd = 1)
y <- rnorm(n = 100, mean = 0, sd = 1)

dat <- data.frame(x, y)

my_mystery_color <- rgb(red = 16, green = 128, blue = 64, max = 255)

ggplot(data = dat, aes(x = x, y = y)) +
	geom_point(color = my_mystery_color, size = 5)




Try to replicate this figure in Excel …

library(ggthemes)

black <- "#073642"
blue <- "#268bd2"
cyan <- "#2aa198"
green <- "#859900"
magenta <- "#d33682"
orange <- "#cb4b16"
red <- "#dc322f"
violet <- "#6c71c4"
white <- "#eee8d5"
yellow <- "#b58900"

my_colors_values <- c(black, blue, cyan, green, magenta, 
    orange, red, violet, white, yellow)
my_colors_names <- c("black", "blue", "cyan", "green", "magenta",
    "orange", "red", "violet", "white", "yellow")

y <- rnorm(n = 1000, mean = 0, sd = 1)
x <- runif(n = 1000, min = 1, max = 100)
z <- rep(my_colors_names, each = 100)

dat <- data.frame(x, y, z)

ggplot(data = dat, aes(x = x, y = y, color = z)) +
	geom_point(size = 12, alpha = 0.1) +
	geom_point(size = 6, alpha = 0.3) +
	geom_point(size = 3, alpha = 0.6) +
	geom_point(size = 1, alpha = 1) +
	geom_smooth(alpha = 0, span = 0.2, size = 1.5) +
	scale_x_continuous(breaks = 
		c(2, 5,	seq(from = 0, to = 100, by = 10))) +
	scale_y_continuous(breaks = 
		seq(from = -5, to = 5, by = 0.5)) +
	scale_color_manual(name = "My Colors", 
		values = my_colors_values) +
	coord_trans(x = "log") +
	theme_solarized_2() +
	ggtitle("Confetti!")




Below find the code we produced in class last week as well as our snazzy animation.


# ------------------------------------------------------- #
# installing packages and loading libraries
# ------------------------------------------------------- #

# install.packages("tidyverse")
# install.packages("ggthemes")
# install.packages("R.utils")
# install.packages("gganimate")
# install.packages("gifski")

library(ggplot2)
library(ggthemes)
library(scales)
library(tidyr)
library(R.utils)
library(gganimate)
library(gifski)

# ------------------------------------------------------- #
# Congress - DW-Nominate data wangling
# ------------------------------------------------------- #

cong <- read.csv("Congress.csv")

cong$party <- ifelse(cong$party_code == 100, 
	yes = "Democrat", 
	no = ifelse(cong$party_code == 200, 
		yes = "Republican",
		no = "other")
	)

HR <- subset(cong, chamber == "House")

year <- seq(from = 1789, to = 2021, by = 2)

df <- data.frame(
	year,
	session = NA,
	n_rep = NA,
	n_dem = NA,
	n_all = NA,
	mean_rep = NA,
	mean_dem = NA,
	mean_all = NA,
	sd_rep = NA,
	sd_dem = NA,
	sd_all = NA
	)

for(i in 1:117) {

	tmp.rep <- subset(HR, congress == i & party == "Republican")
	tmp.dem <- subset(HR, congress == i & party == "Democrat")
	tmp.all <- subset(HR, congress == i & party != "other")

	df$session[i] <- i

	df$n_rep[i] <- dim(tmp.rep)[1]
	df$n_dem[i] <- dim(tmp.dem)[1]
	df$n_all[i] <- dim(tmp.all)[1]

	df$mean_rep[i] <- mean(tmp.rep$nominate_dim1, na.rm = TRUE)
	df$mean_dem[i] <- mean(tmp.dem$nominate_dim1, na.rm = TRUE)
	df$mean_all[i] <- mean(tmp.all$nominate_dim1, na.rm = TRUE)

	df$sd_rep[i] <- sd(tmp.rep$nominate_dim1, na.rm = TRUE)
	df$sd_dem[i] <- sd(tmp.dem$nominate_dim1, na.rm = TRUE)
	df$sd_all[i] <- sd(tmp.all$nominate_dim1, na.rm = TRUE)

}

df <- pivot_longer(df, 
	cols = -c(year,session), 
	names_sep = "_", 
	names_to = c(".value", "party")
	)

df <- data.frame(df)

df$se <- df$sd/sqrt(df$n)

my_red <- rgb(red = 220, green = 50, blue = 47, alpha = 255, max = 255)
my_blue <- rgb(red = 38, green = 139, blue = 210, alpha = 255, max = 255)
my_black <- rgb(red = 0,  green = 43, blue = 54, alpha = 255, max = 255)
my_yellow <- "#b58900"

pres <- data.frame(
	Name = c("Reagan\ntakes office", "Obama\ntakes office"),
	Party = c("rep", "dem"),
	Year = c(1981, 2009)
	)

# ------------------------------------------------------- #
# Congress Polarization Plot
# ------------------------------------------------------- #

ggplot(data = subset(df, year > 1865)) +
	coord_cartesian(xlim = c(1865, 2021), ylim = c(-1, 1)) +
	geom_line(aes(x = year, y = mean, color = party)) +
	labs(x = "Year", y = "Average Ideology", color = "Party", fill = "Party", title = "House of Represenatives: 1865 to present") +
	scale_x_continuous(breaks = seq(from = 1865, to = 2021, by = 10)) +
	scale_y_continuous(breaks = seq(from = -1, to = 1, by = 0.25 )) +
	scale_color_manual(values = c(my_black, my_blue, my_red), breaks = c("all", "dem", "rep")) +
	scale_fill_manual(values = c(my_black, my_blue, my_red), breaks = c("all", "dem", "rep")) +
	geom_ribbon(aes(x = year, ymin = mean - 2 * se, 
							  ymax = mean + 2 * se, fill = party), alpha = 0.25) +
	geom_vline(data = pres, aes(xintercept = Year), color = c(my_red, my_blue)) +
	annotate("text", x = pres$Year, y = 0.75, label = pres$Name, color = c(my_red, my_blue), hjust = 1.1) +
	theme_solarized()


# ------------------------------------------------------- #
# saving a plot via pdf() see ?png for alternatives
# ------------------------------------------------------- #


pdf(file = "Polarization.pdf", width = 10, height = 6) 

	ggplot(data = subset(df, year > 1865)) +
		coord_cartesian(xlim = c(1865, 2021), ylim = c(-1, 1)) +
		geom_line(aes(x = year, y = mean, color = party)) +
		labs(x = "Year", y = "Average Ideology", color = "Party", fill = "Party", title = "House of Represenatives: 1865 to present") +
		scale_x_continuous(breaks = seq(from = 1865, to = 2021, by = 10)) +
		scale_y_continuous(breaks = seq(from = -1, to = 1, by = 0.25 )) +
		scale_color_manual(values = c(my_black, my_blue, my_red), breaks = c("all", "dem", "rep")) +
		scale_fill_manual(values = c(my_black, my_blue, my_red), breaks = c("all", "dem", "rep")) +
		geom_ribbon(aes(x = year, ymin = mean - 2 * se, 
								  ymax = mean + 2 * se, fill = party), alpha = 0.25) +
		geom_vline(data = pres, aes(xintercept = Year), color = c(my_red, my_blue)) +
		annotate("text", x = pres$Year, y = 0.75, label = pres$Name, color = c(my_red, my_blue), hjust = 1.1) +
		theme_solarized()

dev.off()


# ------------------------------------------------------- #
# Histogram inside for-loop
# ------------------------------------------------------- #

Year <- seq(from = 1789, to = 2021, by = 2)

for(i in 1:117) {

	title <- paste("Session:", i, "-- Year:", Year[i])

	p <-ggplot(data = subset(HR, congress == i)) +
		coord_cartesian(xlim = c(-1,1), ylim = c(0, 50)) +
		geom_histogram(aes(x = nominate_dim1, fill = party, color = party), alpha = 0.3, position = "dodge", binwidth = 0.05) +
		scale_color_manual(values = c(my_black, my_blue, my_red), breaks = c("other", "Democrat", "Republican")) +
		scale_fill_manual(values = c(my_black, my_blue, my_red), breaks = c("other", "Democrat", "Republican")) +
		geom_vline(aes(xintercept = median(nominate_dim1, na.rm = TRUE)), color = "red", linetype = "solid") +
		geom_vline(xintercept = 0, color = "black", linetype = "dashed") +
		labs(y = "Number of Represenatives", x = "Average Ideology", color = "Party", fill = "Party", title = title)+
		scale_x_continuous(breaks = seq(from = -1, to = 1, by = 0.2)) +
		theme_solarized()

	print(p)

	Sys.sleep(time = 2)

}

# ------------------------------------------------------- #
# Scatterplot inside for-loop
# ------------------------------------------------------- #

Year <- seq(from = 1789, to = 2021, by = 2)

for (i in 40:117) {

	plot_i <- ggplot(data = subset(HR, congress == i)) + 
				geom_point(aes(x = nominate_dim1, y = nominate_dim2, color = party)) +
				labs(title = paste("Session:", i, "-- Year:", Year[i])) +
				coord_cartesian(xlim = c(-1, 1), ylim = c(-1,1)) +
				scale_colour_manual(values = c(my_black, my_red, my_blue), drop = FALSE, breaks = c("other", "Republican", "Democrat"), name = "Party")

	print(plot_i)

	Sys.sleep(time = 1)

}

# ------------------------------------------------------- #
# Animate via gganimate
# ------------------------------------------------------- #

# you may have to install a renderer for this to work
# install.packages("gifski")

myplot <- ggplot(data = HR, aes(x = nominate_dim1, y = nominate_dim2, color = party)) +
			geom_point() +
			labs(title = "Session: {closest_state}", x = "Ideology (Dimension I)", y = "Ideology (Dimension II)") +
			coord_cartesian(xlim = c(-1, 1), ylim = c(-1,1)) +
			scale_colour_manual(values = c(my_yellow, my_red, my_blue), drop = FALSE, breaks = c("other", "Republican", "Democrat"), name = "Party") +
			transition_states(congress, transition_length = 117, state_length = 1) +
			theme_solarized_2(light = FALSE) +
			enter_fade() +
  			exit_fade()

animate(myplot, nframes = 234)

anim_save("foo.gif", animation = last_animation())