class: center, middle, inverse, title-slide # Data and Vizualization ## Visualizing different types of data ### Becky Tang ### 05.19.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle, center # Identifying variables --- ## Number of variables involved - .vocab[Univariate data analysis]: distribution of single variable <br> - .vocab[Bivariate data analysis]: relationship between two variables <br> - .vocab[Multivariate data analysis]: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - .vocab[Numerical variables] can be classified as .vocab[continuous] or .vocab[discrete] based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - *height* is continuous - *number of siblings* is discrete -- - If the variable is .vocab[categorical], we can determine if it is .vocab[ordinal] based on whether or not the levels have a natural ordering. - *hair color* is unordered - *year in school* is ordinal --- class: center, middle # Visualizing numerical data --- ## Describing numerical distributions - .vocab[shape:] - skewness: right-skewed, left-skewed, symmetric - modality: unimodal, bimodal, multimodal, uniform - .vocab[center:] mean (`mean`), median (`median`), mode (not always useful) - .vocab[spread:] range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) - .vocab[outliers:] observations outside of the usual pattern --- ## Diamonds data ```r diamonds ``` ``` ## # A tibble: 53,940 x 11 ## carat cut color clarity depth table price x y z price_per_carat ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417. ## 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 1552. ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1422. ## 4 0.29 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 1152. ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081. ## 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 1400 ## 7 0.24 Very… I VVS1 62.3 57 336 3.95 3.98 2.47 1400 ## 8 0.26 Very… H SI1 61.9 55 337 4.07 4.11 2.53 1296. ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1532. ## 10 0.23 Very… H VS1 59.4 61 338 4 4.05 2.39 1470. ## # … with 53,930 more rows ``` --- ## Diamonds data, glimpse ```r glimpse(diamonds) ``` ``` ## Rows: 53,940 ## Columns: 11 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, … ## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very G… ## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, … ## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI… ## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, … ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54… ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,… ## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, … ## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, … ## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, … ## $ price_per_carat <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400… ``` --- ## Diamonds help file .pull-left[ <img src="img/03/diamonds_help1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/03/diamonds_help2.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Diamonds: clarity <img src="img/03/diamond_clarity.png" width="60%" style="display: block; margin: auto;" /> --- ## Diamonds: color <img src="img/03/diamond_colors.png" width="60%" style="display: block; margin: auto;" /> --- ## Histograms .small[ ```r ggplot(data = diamonds, mapping = aes(x = price)) + geom_histogram() ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Histograms .pull-left[ ```r ggplot(data = diamonds, mapping = aes(x = price)) + geom_histogram(binwidth = 1000) ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(data = diamonds, mapping = aes(x = price)) + geom_histogram(bins = 12) ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Density plots .small[ ```r ggplot(data = diamonds, mapping = aes(x = price)) + geom_density() ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Side-by-side box plots .small[ ```r ggplot(data = diamonds, mapping = aes(y = price, x = cut)) + geom_boxplot() ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-12-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: center, middle # Visualizing categorical data --- ## Bar plots .small[ ```r ggplot(data = diamonds, mapping = aes(x = clarity)) + geom_bar() ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-13-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, counts .small[ ```r ggplot(data = diamonds, mapping = aes(x = clarity, fill = cut)) + geom_bar() ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-14-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, proportions .small[ ```r ggplot(data = diamonds, mapping = aes(x = clarity, fill = cut)) + * geom_bar(position = "fill") + labs(y = "proportion") ``` <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-15-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Which bar plot is more appropriate? .question[ Which plot is more useful for visualizing the relationship between clarity and cut? Why? ] .pull-left[ <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # Data visualization --- ## What is data visualization? Anything that converts data sources into a visual representation - charts - plots - maps - tables - etc. .footnote[ Source: https://guides.library.duke.edu/datavis ] --- class: center, middle # Why do we visualize? --- ## Data: `datasaurus_dozen` Below is an excerpt from the `datasaurus_dozen` dataset: ``` ## # A tibble: 142 x 8 ## away_x away_y bullseye_x bullseye_y circle_x circle_y dino_x dino_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.3 61.4 51.2 83.3 56.0 79.3 55.4 97.2 ## 2 53.4 26.2 59.0 85.5 50.0 79.0 51.5 96.0 ## 3 63.9 30.8 51.9 85.8 51.3 82.4 46.2 94.5 ## 4 70.3 82.5 48.2 85.0 51.2 79.2 42.8 91.4 ## 5 34.1 45.7 41.7 84.0 44.4 78.2 40.8 88.3 ## 6 67.7 37.1 37.9 82.6 45.0 77.9 38.7 84.9 ## 7 53.3 97.5 39.5 80.8 48.6 78.8 35.6 79.9 ## 8 63.5 25.1 39.6 82.7 42.1 76.9 33.1 77.6 ## 9 68.0 81.0 34.8 80.0 41.0 76.4 29.0 74.5 ## 10 67.4 29.7 27.6 72.8 34.6 72.7 26.2 71.4 ## # … with 132 more rows ``` --- ## Summary statistics ```r datasaurus_dozen %>% group_by(dataset) %>% summarise(r = cor(x, y)) ``` ``` ## # A tibble: 13 x 2 ## dataset r ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` --- .question[ How similar do the relationships between `x` and `y` look based on the plots? Based on the summary statistics? ] <img src="03-data-and-viz2_files/figure-html/datasaurus-plot-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Anscombe's quartet ```r library(Tmisc) quartet ``` .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] --- ## Summarising Anscombe's quartet ```r quartet %>% group_by(set) %>% summarise( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## # A tibble: 4 x 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` --- ## Visualizing Anscombe's quartet ```r ggplot(quartet, aes(x = x, y = y)) + geom_point() + facet_wrap(~ set, ncol = 4) ``` <img src="03-data-and-viz2_files/figure-html/quartet-plot-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Do you see anything out of the ordinary? <img src="img/03/kiss.png" width="75%" style="display: block; margin: auto;" /> --- ## Reporting lower vs. higher values <img src="img/03/fb.png" width="70%" style="display: block; margin: auto;" /> --- class: center, middle # Designing effective visualizations --- ## Keep it simple <img src="img/03/pie-3d.jpg" width="300" style="display: block; margin: auto;" /> <img src="03-data-and-viz2_files/figure-html/pie-to-bar-1.png" width="600" style="display: block; margin: auto;" /> --- ## Use color to draw attention <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-20-1.png" width="500" style="display: block; margin: auto;" /> <img src="03-data-and-viz2_files/figure-html/unnamed-chunk-21-1.png" width="600" style="display: block; margin: auto;" /> --- ## Tell a story <img src="img/03/time-series.story.png" width="800" style="display: block; margin: auto;" /> .footnote[ Credit: Angela Zoss and Eric Monson, Duke DVS ]