class: center, middle, inverse, title-slide # Data and visualization ### Becky Tang --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: center, middle # Exploratory data analysis --- ## What is EDA? - .vocab[Exploratory data analysis (EDA)] is an approach to analyzing data sets to summarize the main characteristics. <br> - Often, EDA is visual. That's what we're focusing on today. <br> - We can also calculate summary statistics and perform data wrangling/manipulation/transformation at (or before) this stage of the analysis. --- class: center, middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey* <br> - .vocab[Data visualization] is the creation and study of the visual representation of data. <br> - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations - We'll use **`ggplot2`**. --- ## ggplot2 in tidyverse .pull-left[ <img src="img/02/ggplot2-part-of-tidyverse.png" width="70%" /> ] .pull-right[ - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson* ![](img/02/grammar-of-graphics.png) ] .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- ## What is a Grammar of Graphics? A tool that allows for concisely describing the components of a graphic: <img src="img/02/grammar-of-graphics.png" width="70%" style="display: block; margin: auto;" /> --- ## What function is doing the plotting? ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + * geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## What is the dataset being plotted? ```r *ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Which variable is on the x-axis? On the y-axis? ```r *ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + * geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## What does the warning mean? ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + * geom_point() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($)", y = "Total footprint (hectare)") ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## What does `geom_smooth()` do? ```r *ggplot(data = countries_footprint, mapping = aes(x = GDP, y =Total)) + geom_point() + * geom_smooth() + labs(title = "GDP vs. Total Ecological Footprint of countries (2016)", x = "GDP ($)", y = "Total footprint (hectare)") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Hello ggplot2! - `ggplot()` is the main function in ggplot2 and plots are constructed in layers - The structure of the code for plots can often be summarized as ```r ggplot + geom_xxx ``` <br> -- or, more precisely .small[ ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` ] --- ## Hello ggplot2! To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) --- class: center, middle # Visualizing Ecological Footprint --- ## Dataset terminology .small[ ```r countries_footprint ``` ``` ## # A tibble: 188 x 14 ## Country Region Population HDI GDP Cropland Grazing Forest Carbon Fish ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghani… Middle… 29.8 0.46 615. 0.3 0.2 0.08 0.18 0 ## 2 Albania Northe… 3.16 0.73 4534. 0.78 0.22 0.25 0.87 0.02 ## 3 Algeria Africa 38.5 0.73 5431. 0.6 0.16 0.17 1.14 0.01 ## 4 Angola Africa 20.8 0.52 4666. 0.33 0.15 0.12 0.2 0.09 ## 5 Antigua… Latin … 0.09 0.78 13205. NA NA NA NA NA ## 6 Argenti… Latin … 41.1 0.83 13540 0.78 0.79 0.29 1.08 0.1 ## 7 Armenia Middle… 2.97 0.73 3426. 0.74 0.18 0.34 0.89 0.01 ## 8 Aruba Latin … 0.1 NA NA NA NA NA NA NA ## 9 Austral… Asia-P… 23.0 0.93 66604. 2.68 0.63 0.89 4.85 0.11 ## 10 Austria Europe… 8.46 0.88 51274. 0.82 0.27 0.63 4.14 0.06 ## # … with 178 more rows, and 4 more variables: Total <dbl>, ## # EarthsRequired <dbl>, CountriesRequired <dbl>, DataQuality <chr> ``` ] Each row is an .vocab[observation]. Each column is a .vocab[variable] --- ## What's in the Ecological Footprint data? Take a `glimpse` of the data: ```r glimpse(countries_footprint) ``` ``` ## Rows: 188 ## Columns: 14 ## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Anti… ## $ Region <chr> "Middle East/Central Asia", "Northern/Eastern Europe… ## $ Population <dbl> 29.82, 3.16, 38.48, 20.82, 0.09, 41.09, 2.97, 0.10, … ## $ HDI <dbl> 0.46, 0.73, 0.73, 0.52, 0.78, 0.83, 0.73, NA, 0.93, … ## $ GDP <dbl> 614.66, 4534.37, 5430.57, 4665.91, 13205.10, 13540.0… ## $ Cropland <dbl> 0.30, 0.78, 0.60, 0.33, NA, 0.78, 0.74, NA, 2.68, 0.… ## $ Grazing <dbl> 0.20, 0.22, 0.16, 0.15, NA, 0.79, 0.18, NA, 0.63, 0.… ## $ Forest <dbl> 0.08, 0.25, 0.17, 0.12, NA, 0.29, 0.34, NA, 0.89, 0.… ## $ Carbon <dbl> 0.18, 0.87, 1.14, 0.20, NA, 1.08, 0.89, NA, 4.85, 4.… ## $ Fish <dbl> 0.00, 0.02, 0.01, 0.09, NA, 0.10, 0.01, NA, 0.11, 0.… ## $ Total <dbl> 0.79, 2.21, 2.12, 0.93, 5.38, 3.14, 2.23, 11.88, 9.3… ## $ EarthsRequired <dbl> 0.46, 1.27, 1.22, 0.54, 3.11, 1.82, 1.29, 6.86, 5.37… ## $ CountriesRequired <dbl> 1.60, 1.87, 3.61, 0.37, 5.70, 0.45, 2.52, 20.69, 0.5… ## $ DataQuality <chr> "High", "High", "Medium", "High", "Low", "High", "Lo… ``` --- ## Example: What's in the Star Wars data? If data have been loaded into R for anyone to use, it comes with a help file. Run the following **<u>in the Console</u>** to view the help file for the starwars dataset ```r ?starwars ``` <img src="img/02/starwars-help.png" width="60%" /> --- ## GDP vs. Total Footprint ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point() ``` ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## What's that warning? - Not all countries have GDP and Total Footprint information (hence 15 of them not plotted) ``` ## Warning: Removed 15 rows containing missing values (geom_point). ``` - We can suppress warnings to save space on the output documents, but it's important to note them - To suppress warning: .center[ `{r code-chunk-label, warning=FALSE}` ] --- ## GDP vs. Total Footprint .question[ How would you describe this **relationship**? ] <img src="02-data-and-viz_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## GDP vs. Forest Footprint .question[ How about here? Which country has low GDP but a large forest footprint, and vice versa? ] <img src="02-data-and-viz_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Which countries? <img src="img/02/forest.png" width="50%" style="display: block; margin: auto;" /> --- ## Additional variables We can map additional variables to various features of the plot: - **aesthetics** - shape - color - size - alpha (transparency) - **faceting**: small multiples displaying different subsets --- class: center, middle # Aesthetics --- ## Aesthetics options Visual characteristics of plotting characters that can be **mapped to a specific variable** in the data are - `color` - `size` - `shape` - `alpha` (transparency) --- ## GDP + Total Footprint + Data Quality ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, * color = DataQuality)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## GDP + Total Footprint + Data Quality Let's map `shape` and `color` to `DataQuality` ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, color = DataQuality, * shape = DataQuality)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ### GDP + Total Footprint + Data Quality + HDI ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, color = DataQuality, shape = DataQuality, * size = Fish)) + geom_point() ``` <img src="02-data-and-viz_files/figure-html/plot-birth-year-1.png" style="display: block; margin: auto;" /> --- ## GDP + Total Footprint + Data Quality Let's increase the size of all points across the board: ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total, color = DataQuality, shape = DataQuality)) + * geom_point(size = 3) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## Aesthetics summary - Continuous variable are measured on a continuous scale - Discrete variables are measured (or often counted) on a discrete scale .small[ aesthetics | discrete | continuous ------------- | ------------------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | shouldn't (and doesn't) work ] <br> .alert[Use aesthetics (`aes`) for mapping features of a plot to a variable, define the features in the `geom_xxx` for customization **<u>not</u>** mapped to a variable ] --- class: center, middle # Faceting --- ## Faceting options - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ labs(title = "GDP vs. Total Footprint of countries (2016)", * subtitle = "Faceted by region", x = "GDP ($)", y = "Total footprint (hectare)")+ * facet_grid(. ~ Region) ``` --- ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ labs(title = "GDP vs. Total Footprint of countries (2016)", * subtitle = "Faceted by region", x = "GDP ($)", y = "Total footprint (hectare)")+ * facet_grid(. ~ Region) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- .question[ In the next few slides describe what each plot displays. Think about how the code relates to the output. ] -- <br><br><br> .alert[ The plots in the next few slides do not have proper titles, axis labels, etc, so you can more easily focus on what's happening in the plots. But you should always label your plots! ] --- ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_grid(DataQuality ~ Region) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_grid(Region ~ .) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint,mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_wrap(Region ~.) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = countries_footprint, mapping = aes(x = GDP, y = Total)) + geom_point()+ * facet_wrap(Region ~ . , scales = "free_x") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ## Facet summary - `facet_grid()`: - 2d grid - `rows ~ cols` - use `.` for no split -- - `facet_wrap()`: 1d ribbon wrapped into 2d - set scales using `scales = ` ("free_x", "free_y", "free") --- ## Modifications You can omit the names of first two arguments when building plots with `ggplot()`. ```r *ggplot(countries_footprint, aes(x = GDP, y = Total)) + geom_point()+ facet_wrap(Region ~ . , scales = "free_x") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ---- ## Facet and color ```r ggplot(countries_footprint, aes(x = GDP, y = Total, col = Region)) + geom_point()+ facet_grid(Region ~ DataQuality , scales = "free_x") ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ---- ## Facet and color, but no legend ```r ggplot(countries_footprint, aes(x = GDP, y = Total, col = Region)) + geom_point()+ facet_grid(Region ~ DataQuality , scales = "free_x") + * guides(color = FALSE) ``` <img src="02-data-and-viz_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ---- ## `ggplot2` supplementary resources 1. [ggplot2.tidyverse.org](https://ggplot2.tidyverse.org/) 2. `ggplot2` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf) 3. STA 523 `ggplot2` [slides](https://shawnsanto.com/files/sta523/slides/lec-3b-ggplot2.html#1) 4. [Top 50 `ggplot2` visualizations](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) 5. [How the BBC uses `ggplot2`](https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535) 6. [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/) ## Dive further... -- Data obtained from [https://www.kaggle.com/footprintnetwork/ecological-footprint](https://www.kaggle.com/footprintnetwork/ecological-footprint)