class: center, middle, inverse, title-slide

# Data types
### Daniel Anderson
### Week 1, Class 2

---

layout: true

<script>
    feather.replace()
  </script>
  
  <div class="slides-footer">
  <span>
  
  <a class = "footer-icon-link" href = "https://github.com/uo-datasci-specialization/c3-fp-2021/raw/main/static/slides/w1p2.pdf">
    <i class = "footer-icon" data-feather="download"></i>
  </a>
  
  <a class = "footer-icon-link" href = "https://fp-2021.netlify.app/slides/w1p2.html">
    <i class = "footer-icon" data-feather="link"></i>
  </a>
  
  <a class = "footer-icon-link" href = "https://fp-2021.netlify.app/">
    <i class = "footer-icon" data-feather="globe"></i>
  </a>
  
  <a class = "footer-icon-link" href = "https://github.com/uo-datasci-specialization/c3-fp-2021">
    <i class = "footer-icon" data-feather="github"></i>
  </a>
  
  </span>
  </div>

---
# Agenda
* Finishing up on coercion

* Attributes

* Missing values

* Intro to lists

* Subsetting

---
# Learning objectives
* Understand the fundamental difference between lists and atomic vectors

--
* Understand how atomic vectors are coerced, implicitly or explicitly

--
* Understand various ways to subset vectors, and how subsetting differs for lists

--
* Understand what an attribute is, and how to set and modify attributes

---
# Pop quiz
Without actually running the code, predict which type each of
the following will coerce to.

```r
c(TRUE, 1L, 0L, "False")
c(1L, FALSE)
c(7L, 6.23, "eight")
c(1.25, TRUE, 4L)
```

<div class="countdown" id="timer_6075f86c" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">01</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---
# Answers

```r
typeof(c(TRUE, 1L, 0L, "False"))
```

```
## [1] "character"
```

```r
typeof(c(1L, FALSE))
```

```
## [1] "integer"
```

```r
typeof(c(7L, 6.23, "eight"))
```

```
## [1] "character"
```

```r
typeof(c(1.25, TRUE, 4L))
```

```
## [1] "double"
```

---
# Challenge

### Work with a partner
One of you share your screen: 
* Create four atomic vectors, one for each of the fundamental types

* Combine two or more of the vectors. Predict the implicit coercion of each.

* Apply explicit coercions, and predict the output for each.

(basically quiz each other)

<div class="countdown" id="timer_6075f7f8" style="right:0;bottom:0;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">08</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>

---
class: inverse-blue middle
# Attributes

---
# Attributes
* What are attributes?

--
  + metadata... what's metadata?

--
  + Data about the data

---
# Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

--
### What are some other data types?

--
* Data frames

--
* Matrices & arrays

--
* Factors

--
* Dates

--
Remember, atomic vectors are the atoms of R. Many other data structures are
built from atomic vectors.

* We use attributes to create other data types from atomic vectors

---
# Attributes

### Common
* Names
* Dimensions

### Less common
* Arbitrary metadata

---
# Examples
### Please follow along!
* See **all** attributes associated with a give object with `attributes`

```r
library(palmerpenguins)
attributes(penguins[1:50, ]) # limiting rows just for slides
```

```
## $names
## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
## [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"             
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"
```

---

```r
head(penguins)
```

```
## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Big one Torgersen         39.1          18.7                 181        3750 male  
## 2 Big one Torgersen         39.5          17.400               186        3800 female
## 3 Big one Torgersen         40.300        18                   195        3250 female
## 4 Big one Torgersen         NA            NA                    NA          NA <NA>  
## 5 Big one Torgersen         36.7          19.3                 193        3450 female
## 6 Big one Torgersen         39.300        20.6                 190        3650 male  
## # … with 1 more variable: year <int>
```

---
# Get specific attribute
* Access just a single attribute by naming it within `attr`

```r
attr(penguins, "class")
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
attr(penguins, "names")
```

```
## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
## [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"
```

--
Note - this is not generally how you would pull these attributes. Rather, you would use `class()` and `names()`.

---
# Be specific
* Note in the prior slides, I'm asking for attributes on the entire data frame.

* Is that what I want?... maybe. But the individual vectors may have attributes as well

--

```r
attributes(penguins$species)
```

```
## $levels
## [1] "Big one"    "Little one" "Funny one" 
## 
## $class
## [1] "factor"
```

```r
attributes(penguins$bill_length_mm)
```

```
## NULL
```

---
# Set attributes
* Just redefine them within `attr`

```r
attr(penguins$species, "levels") <- c("Big one", 
                                      "Little one", 
                                      "Funny one")

head(penguins)
```

```
## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Big one Torgersen         39.1          18.7                 181        3750 male  
## 2 Big one Torgersen         39.5          17.400               186        3800 female
## 3 Big one Torgersen         40.300        18                   195        3250 female
## 4 Big one Torgersen         NA            NA                    NA          NA <NA>  
## 5 Big one Torgersen         36.7          19.3                 193        3450 female
## 6 Big one Torgersen         39.300        20.6                 190        3650 male  
## # … with 1 more variable: year <int>
```

Note - you would generally not define levels this way either, but it is a general method for modifying attributes.

---
# Dimensions

* Let's create a matrix (please do it with me)

```r
m <- matrix(1:6, ncol = 2)
m
```

```
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
```

* Notice how the matrix fills

--
* Check out the attributes

```r
attributes(m)
```

```
## $dim
## [1] 3 2
```

---
# Modify the attributes
* Let's change it to a 2 x 3 matrix, instead of 3 x 2 (you try first)

--

```r
attr(m, "dim") <- c(2, 3)
m
```

```
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
```

--
* is this the result you expected?

---
# Alternative creation
* Create an atomic vector, assign a dimension attribute

```r
v <- 1:6
v
```

```
## [1] 1 2 3 4 5 6
```

```r
attr(v, "dim") <- c(3, 2)
v
```

```
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
```

---
# Aside
* What if we wanted it to fill by row?

.pull-left[

```r
matrix(6:13, 
       ncol = 2, 
*      byrow = TRUE)
```

```
##      [,1] [,2]
## [1,]    6    7
## [2,]    8    9
## [3,]   10   11
## [4,]   12   13
```

```r
vect <- 6:13
dim(vect) <- c(2, 4)
vect
```

```
##      [,1] [,2] [,3] [,4]
## [1,]    6    8   10   12
## [2,]    7    9   11   13
```
]

.pull-right[

```r
t(vect)
```

```
##      [,1] [,2]
## [1,]    6    7
## [2,]    8    9
## [3,]   10   11
## [4,]   12   13
```
]

---
# Names
* The following (this slide and the next) are equivalent

```r
attr(v, "dimnames") <- list(c("the first", "second", "III"),
                            c("index", "value"))
v
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
```

---
# Names

```r
v2 <- 1:6
attr(v2, "dim") <- c(3, 2)
rownames(v2) <- c("the first", "second", "III")
colnames(v2) <- c("index", "value")
v2
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
```

---
# Arbitrary metadata
* I don't use this often (wouldn't recommend you do either)

```r
attr(v, "matrix_mean") <- mean(v)
v
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
```

```r
attr(v, "matrix_mean")
```

```
## [1] 3.5
```

* Note that *anything* can be stored as an attribute (including matrices or data frames, etc.)

---
# Real example
Fit a multilevel model and pull the variance-covariance matrix

```r
m <- lme4::lmer(Reaction ~ 1 + Days + (1 + Days|Subject), 
                data = lme4::sleepstudy)

lme4::VarCorr(m)$Subject
```

```
##             (Intercept)      Days
## (Intercept)  612.100158  9.604409
## Days           9.604409 35.071714
## attr(,"stddev")
## (Intercept)        Days 
##   24.740658    5.922138 
## attr(,"correlation")
##             (Intercept)       Days
## (Intercept)  1.00000000 0.06555124
## Days         0.06555124 1.00000000
```

---
# Matrices vs Data frames

Usually we want to work with data frames because they represent our data better.

Sometimes a matrix is more efficient because you can operat on the **entire** matrix at once.

```r
set.seed(42)
m <- matrix(rnorm(100, 200, 10), ncol = 10)
m
```

```
##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]
##  [1,] 213.7096 213.0487 196.9336 204.5545 202.0600 203.2193 196.3277 189.5688 215.1271
##  [2,] 194.3530 222.8665 182.1869 207.0484 196.3894 192.1616 201.8523 199.0981 202.5792
##  [3,] 203.6313 186.1114 198.2808 210.3510 207.5816 215.7573 205.8182 206.2352 200.8844
##  [4,] 206.3286 197.2121 212.1467 193.9107 192.7330 206.4290 213.9974 190.4648 198.7910
##  [5,] 204.0427 198.6668 218.9519 205.0496 186.3172 200.8976 192.7271 194.5717 188.0567
##  [6,] 198.9388 206.3595 195.6953 182.8299 204.3282 202.7655 213.0254 205.8100 206.1200
##  [7,] 215.1152 197.1575 197.4273 192.1554 191.8861 206.7929 203.3585 207.6818 197.8286
##  [8,] 199.0534 173.4354 182.3684 191.4909 214.4410 200.8983 210.3851 204.6377 198.1724
##  [9,] 220.1842 175.5953 204.6010 175.8579 195.6855 170.0691 209.2073 191.1422 209.3335
## [10,] 199.3729 213.2011 193.6001 200.3612 206.5565 202.8488 207.2088 189.0022 208.2177
##          [,10]
##  [1,] 213.9212
##  [2,] 195.2383
##  [3,] 206.5035
##  [4,] 213.9111
##  [5,] 188.8921
##  [6,] 191.3921
##  [7,] 188.6826
##  [8,] 185.4079
##  [9,] 200.7998
## [10,] 206.5320
```

---

```r
sum(m)
```

```
## [1] 20032.51
```

```r
mean(m)
```

```
## [1] 200.3251
```

```r
rowSums(m)
```

```
##  [1] 2048.470 1993.774 2041.155 2025.924 1978.173 2007.265 1998.086 1960.291 1952.476
## [10] 2026.901
```

```r
colSums(m)
```

```
##  [1] 2054.730 1983.654 1982.192 1963.610 1997.978 2001.839 2053.908 1978.212 2025.111
## [10] 1991.281
```

```r
# standardize the matrix
z <- (m - mean(m)) / sd(m)
```

---

```r
z
```

```
##              [,1]       [,2]       [,3]        [,4]       [,5]        [,6]       [,7]
##  [1,]  1.28528802  1.2218239 -0.3256841  0.40613865  0.1665940  0.27791666 -0.3838736
##  [2,] -0.57349498  2.1646089 -1.7417882  0.64562157 -0.3779416 -0.78393268  0.1466507
##  [3,]  0.31748345 -1.3649263 -0.1963133  0.96277141  0.6968297  1.48192480  0.5274934
##  [4,]  0.57650528 -0.2989403  1.1352110 -0.61596668 -0.7290676  0.58614338  1.3129235
##  [5,]  0.35698951 -0.1592501  1.7887033  0.45367758 -1.3451640  0.05497234 -0.7296315
##  [6,] -0.13313334  0.5794704 -0.4445968 -1.68004206  0.3844054  0.23434417  1.2195893
##  [7,]  1.42026916 -0.3041875 -0.2782756 -0.78452812 -0.8103926  0.62108770  0.2912866
##  [8,] -0.12212321 -2.5821792 -1.7243635 -0.84833774  1.3555260  0.05504171  0.9660389
##  [9,]  1.90703954 -2.3747685  0.4106013 -2.34955213 -0.4455350 -2.90544454  0.8529388
## [10,] -0.09144695  1.2364622 -0.6458013  0.00346451  0.5983857  0.24234547  0.6610253
##             [,8]        [,9]       [,10]
##  [1,] -1.0329155  1.42140711  1.30560568
##  [2,] -0.1178282  0.21645471 -0.48848642
##  [3,]  0.5675319  0.05370436  0.59329679
##  [4,] -0.9468782 -0.14731870  1.30463971
##  [5,] -0.5524942 -1.17812024 -1.09789797
##  [6,]  0.5266990  0.55646825 -0.85783015
##  [7,]  0.7064474 -0.23973975 -1.11801576
##  [8,]  0.4141258 -0.20672212 -1.43248556
##  [9,] -0.8818216  0.86505545  0.04558258
## [10,] -1.0873272  0.75791330  0.59603916
```

---
# Stripping attributes
* Many operations will strip attributes (generally why it's not a good idea to store important things in them)

.pull-left[

```r
v
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
```

```r
rowSums(v)
```

```
## the first    second       III 
##         5         7         9
```
]

--
.pull-right[

```r
attributes(rowSums(v))
```

```
## $names
## [1] "the first" "second"    "III"
```

* Generally `names` are maintained

* Sometimes, `dim` is maintained, sometimes not

* All else is stripped

]

---
# More on `names`

* The `names` attribute corresponds to the individual elements within a vector

```r
names(v)
```

```
## NULL
```

```r
names(v) <- letters[1:6]
v
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
## attr(,"names")
## [1] "a" "b" "c" "d" "e" "f"
```

---
* Perhaps more straightforward

```r
v3a <- c(a = 5, b = 7, c = 12)
v3a
```

```
##  a  b  c 
##  5  7 12
```

```r
names(v3a)
```

```
## [1] "a" "b" "c"
```

```r
attributes(v3a)
```

```
## $names
## [1] "a" "b" "c"
```

---
# Alternatives

```r
v3b <- c(5, 7, 12)
names(v3b) <- c("a", "b", "c")
v3b
```

```
##  a  b  c 
##  5  7 12
```

```r
v3c <- setNames(c(5, 7, 12), c("a", "b", "c"))
v3c
```

```
##  a  b  c 
##  5  7 12
```

--
* Note that `names` is **not** the same thing as `colnames`, but, somewhat confusingly, both work to rename the variables (columns) of a data frame. We'll talk more about why this is momentarily.

---
# Why names might be helpful

```r
v
```

```
##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
## attr(,"names")
## [1] "a" "b" "c" "d" "e" "f"
```

.pull-left[

```r
v["b"]
```

```
## b 
## 2
```
]

.pull-right[

```r
v["e"]
```

```
## e 
## 5
```
]

---
# Implementation of factors
### Quickly

```r
fct <- factor(c("a", "a", "b", "c"))
typeof(fct)
```

```
## [1] "integer"
```

```r
attributes(fct)
```

```
## $levels
## [1] "a" "b" "c"
## 
## $class
## [1] "factor"
```

```r
str(fct)
```

```
##  Factor w/ 3 levels "a","b","c": 1 1 2 3
```

---
# More manually

```r
# First create integer vector
int <- c(1L, 1L, 2L, 3L, 1L, 3L)

# assign some levels
attr(int, "levels") <- c("red", "green", "blue")

# change the class to a factor
class(int) <- "factor"

int
```

```
## [1] red   red   green blue  red   blue 
## Levels: red green blue
```
---
# Implementation of dates
### Quickly

```r
date <- Sys.Date()
typeof(date)
```

```
## [1] "double"
```

```r
attributes(date)
```

```
## $class
## [1] "Date"
```

```r
attributes(date) <- NULL
date
```

```
## [1] 18730
```

* This number represents the days passed since January 1, 1970, known as the Unix epoch.

---
# Missing values

* Missing values breed missing values

```r
NA > 5
```

```
## [1] NA
```

```r
NA * 7
```

```
## [1] NA
```

--

* What about this one?

```r
NA == NA
```

--

```
## [1] NA
```

--
It is correct because there's no reason to presume that one missing
value is or is not equal to another missing value.

---
# When missing values don't propagate

```r
NA | TRUE
```

```
## [1] TRUE
```

```r
x <- c(NA, 3, NA, 5)
any(x > 4)
```

```
## [1] TRUE
```

---
# How to test missingness?

* We've already seen the following doesn't work

```r
x == NA
```

```
## [1] NA NA NA NA
```

--
* Instead, use `is.na`

```r
is.na(x)
```

```
## [1]  TRUE FALSE  TRUE FALSE
```

* When does this regularly come into play?

---
class: inverse-blue middle
# Lists

---
# Lists
* Lists are vectors, but not *atomic* vectors

* Fundamental difference - each element can be a different type

```r
list("a", 7L, 3.25, TRUE)
```

```
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 3.25
## 
## [[4]]
## [1] TRUE
```

---
# Lists

.pull-left[

* Technically, each element of the list is a vector, possibly atomic

* The prior example included all *scalars*, which are vectors of length 1.

* Lists do not require all elements to be the same length

]

.pull-right[

```r
l <- list(
  c("a", "b", "c"),
  rnorm(5),
  c(7L, 2L),
  c(TRUE, TRUE, FALSE, TRUE)
)
l
```

```
## [[1]]
## [1] "a" "b" "c"
## 
## [[2]]
## [1]  1.2009654  1.0447511 -1.0032086  1.8484819 -0.6667734
## 
## [[3]]
## [1] 7 2
## 
## [[4]]
## [1]  TRUE  TRUE FALSE  TRUE
```
]

---
# Check the list

```r
typeof(l)
```

```
## [1] "list"
```

```r
attributes(l)
```

```
## NULL
```

```r
str(l)
```

```
## List of 4
##  $ : chr [1:3] "a" "b" "c"
##  $ : num [1:5] 1.201 1.045 -1.003 1.848 -0.667
##  $ : int [1:2] 7 2
##  $ : logi [1:4] TRUE TRUE FALSE TRUE
```

---
# Data frames as lists
* A data frame is just a special case of a list, where all the elements are of the same length.

.pull-left[

```r
l_df <- list(
  a = c("red", "blue"),
  b = rnorm(2),
  c = c(7L, 2L),
  d = c(TRUE, FALSE)
)
l_df
```

```
## $a
## [1] "red"  "blue"
## 
## $b
## [1]  0.1055138 -0.4222559
## 
## $c
## [1] 7 2
## 
## $d
## [1]  TRUE FALSE
```
]

.pull-right[

```r
data.frame(l_df)
```

```
##      a          b c     d
## 1  red  0.1055138 7  TRUE
## 2 blue -0.4222559 2 FALSE
```
]

---
class: inverse-red middle
# Subsetting Lists

---
# A nested list
Lists are often complicated objects. Let's create a somewhat complicated one

```r
x <- c(a = 3, b = 5, c = 7)
l <- list(
  x = x,
  x2 = c(x, x),
  x3 = list(
    vect = x,
    squared = x^2,
    cubed = x^3)
)
```

---
# Subsetting lists
Multiple methods
* Most common: `$`, `[`, and `[[`

.pull-left[

```r
l[1]
```

```
## $x
## a b c 
## 3 5 7
```

```r
typeof(l[1])
```

```
## [1] "list"
```

```r
l[[1]]
```

```
## a b c 
## 3 5 7
```
]

.pull-right[

```r
typeof(l[[1]])
```

```
## [1] "double"
```

```r
l[[1]]["c"]
```

```
## c 
## 7
```
]

---
# Named list
* Because the elements of the list are named, we can use `$`

```r
l$x2
```

```
## a b c a b c 
## 3 5 7 3 5 7
```

```r
l$x3
```

```
## $vect
## a b c 
## 3 5 7 
## 
## $squared
##  a  b  c 
##  9 25 49 
## 
## $cubed
##   a   b   c 
##  27 125 343
```

---
# Subsetting nested lists
* Multiple `$` if all named

```r
l$x3$squared
```

```
##  a  b  c 
##  9 25 49
```

* Note this doesn't work on named elements of an atomic vector, just the named elements of a list

```r
l$x3$squared$b
```

```
## Error in l$x3$squared$b: $ operator is invalid for atomic vectors
```

---
But we could do something like...

```r
l$x3$squared["b"]
```

```
##  b 
## 25
```

---
# Alternatives

* You can always use logical

* Indexing works too

.pull-left[

```r
l[c(TRUE, FALSE, TRUE)]
```

```
## $x
## a b c 
## 3 5 7 
## 
## $x3
## $x3$vect
## a b c 
## 3 5 7 
## 
## $x3$squared
##  a  b  c 
##  9 25 49 
## 
## $x3$cubed
##   a   b   c 
##  27 125 343
```
]

.pull-right[

```r
l[c(1, 3)]
```

```
## $x
## a b c 
## 3 5 7 
## 
## $x3
## $x3$vect
## a b c 
## 3 5 7 
## 
## $x3$squared
##  a  b  c 
##  9 25 49 
## 
## $x3$cubed
##   a   b   c 
##  27 125 343
```
]

---
# Careful with your brackets

```r
l[[c(TRUE, FALSE, FALSE)]]
```

```
## Error in l[[c(TRUE, FALSE, FALSE)]]: recursive indexing failed at level 2
```

* Why doesn't the above work?

---
## Subsetting in multiple dimensions

* Generally we deal with 2d data frames

* If there are two dimensions, we separate the `[` subsetting with a comma

```r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

```r
mtcars[3, 4]
```

```
## [1] 93
```

---
# Empty indicators
* An empty indicator implies "all"

--
### Select the entire fourth column

```r
mtcars[ ,4]
```

```
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109
```

--
### Select the entire 4th row

```r
mtcars[4, ]
```

```
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
```

---
# Data types returned

* By default, each of the prior will return a vector, which itself can be subset

The following are equivalent

```r
mtcars[4, c("mpg", "hp")]
```

```
##                 mpg  hp
## Hornet 4 Drive 21.4 110
```

```r
mtcars[4, ][c("mpg", "hp")]
```

```
##                 mpg  hp
## Hornet 4 Drive 21.4 110
```

---
# Return a data frame
* Often, you don't want the vector returned, but rather the modified data frame.

* Specify `drop = FALSE`

```r
mtcars[ ,4]
```

```
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109
```

```r
mtcars[ ,4, drop = FALSE]
```

```
##                      hp
## Mazda RX4           110
## Mazda RX4 Wag       110
## Datsun 710           93
## Hornet 4 Drive      110
## Hornet Sportabout   175
## Valiant             105
## Duster 360          245
## Merc 240D            62
## Merc 230             95
## Merc 280            123
## Merc 280C           123
## Merc 450SE          180
## Merc 450SL          180
## Merc 450SLC         180
## Cadillac Fleetwood  205
## Lincoln Continental 215
## Chrysler Imperial   230
## Fiat 128             66
## Honda Civic          52
## Toyota Corolla       65
## Toyota Corona        97
## Dodge Challenger    150
## AMC Javelin         150
## Camaro Z28          245
## Pontiac Firebird    175
## Fiat X1-9            66
## Porsche 914-2        91
## Lotus Europa        113
## Ford Pantera L      264
## Ferrari Dino        175
## Maserati Bora       335
## Volvo 142E          109
```

---
# tibbles
* Note dropping the data frame attribute is the default for a `data.frame` but .b[.bolder[NOT]] a `tibble`.

```r
mtcars_tbl <- tibble::as_tibble(mtcars)
mtcars_tbl[ ,4]
```

```
## # A tibble: 32 x 1
##       hp
##    <dbl>
##  1   110
##  2   110
##  3    93
##  4   110
##  5   175
##  6   105
##  7   245
##  8    62
##  9    95
## 10   123
## # … with 22 more rows
```

---
# You can override this

```r
mtcars_tbl[ ,4, drop = TRUE]
```

```
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109
```

---
# More than two dimensions
* Depending on your applications, you may not run into this much

```r
array <- 1:12
dim(array) <- c(2, 3, 2)
array
```

```
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
```

---
# Subset array
### Select just the second matrix

--

```r
array[ , ,2]
```

```
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
```

--

### Select first column of each matrix

--

```r
array[ ,1, ]
```

```
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
```

---
# Back to lists
### Why are they so useful?
* Fairly obviously, they're much more flexible

* Often returned by functions, for example, `lm`

```r
m <- lm(mpg ~ hp, mtcars)
str(m)
```

```
## List of 12
##  $ coefficients : Named num [1:2] 30.0989 -0.0682
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "hp"
##  $ residuals    : Named num [1:32] -1.594 -1.594 -0.954 -1.194 0.541 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ effects      : Named num [1:32] -113.65 -26.046 -0.556 -0.852 0.67 ...
##   ..- attr(*, "names")= chr [1:32] "(Intercept)" "hp" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:32] 22.6 22.6 23.8 22.6 18.2 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "hp"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.18 1.08
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 30
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ hp, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ hp
##   .. ..- attr(*, "variables")= language list(mpg, hp)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "mpg" "hp"
##   .. .. .. ..$ : chr "hp"
##   .. ..- attr(*, "term.labels")= chr "hp"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x7fd864e4b5c0> 
##   .. ..- attr(*, "predvars")= language list(mpg, hp)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
##  $ model        :'data.frame':	32 obs. of  2 variables:
##   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ hp
##   .. .. ..- attr(*, "variables")= language list(mpg, hp)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "mpg" "hp"
##   .. .. .. .. ..$ : chr "hp"
##   .. .. ..- attr(*, "term.labels")= chr "hp"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: 0x7fd864e4b5c0> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, hp)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
##  - attr(*, "class")= chr "lm"
```

---
# Summary
* Atomic vectors must all be the same type

+ implicit coercion occurs if not (and you haven't specified the coercion
  explicitly)

* Lists are also vectors, but not atomic vectors
  
  + Each element can be of a different type and length

+ Incredibly flexible, but often a little more difficult to get the hang of, particularly with subsetting

---
# Any time left?
Practice: Fit the model on the slide two previous

```r
m <- lm(mpg ~ hp, mtcars)
```

Pull each of the following from the model object

* `effects`
* `names` of the `effects`
* `qr` matrix
* third row of the `qr` matrix
* fifth `rowname` of the `qr` matrix
* `data.frame` used in the estimation of the model

---
class: inverse-green middle
# Next time
### Loops with base R

Notes for current slide

Notes for next slide

Data typesDaniel AndersonWeek 1, Class 21 / 62

Agenda

Finishing up on coercion
Attributes
Missing values
Intro to lists
Subsetting

2 / 62

Learning objectives

Understand the fundamental difference between lists and atomic vectors

3 / 62

Learning objectives

Understand the fundamental difference between lists and atomic vectors
Understand how atomic vectors are coerced, implicitly or explicitly

3 / 62

Learning objectives

Understand the fundamental difference between lists and atomic vectors
Understand how atomic vectors are coerced, implicitly or explicitly
Understand various ways to subset vectors, and how subsetting differs for lists

3 / 62

Learning objectives

Understand the fundamental difference between lists and atomic vectors
Understand how atomic vectors are coerced, implicitly or explicitly
Understand various ways to subset vectors, and how subsetting differs for lists
Understand what an attribute is, and how to set and modify attributes

3 / 62

Pop quiz

Without actually running the code, predict which type each of the following will coerce to.

c(TRUE, 1L, 0L, "False")
c(1L, FALSE)
c(7L, 6.23, "eight")
c(1.25, TRUE, 4L)

01:00

4 / 62

Answers

typeof(c(TRUE, 1L, 0L, "False"))

## [1] "character"

typeof(c(1L, FALSE))

## [1] "integer"

typeof(c(7L, 6.23, "eight"))

## [1] "character"

typeof(c(1.25, TRUE, 4L))

## [1] "double"

5 / 62

Challenge

Work with a partner

One of you share your screen:

Create four atomic vectors, one for each of the fundamental types
Combine two or more of the vectors. Predict the implicit coercion of each.
Apply explicit coercions, and predict the output for each.

(basically quiz each other)

08:00

6 / 62

Attributes

7 / 62

Attributes

What are attributes?

8 / 62

Attributes

What are attributes?
- metadata... what's metadata?

8 / 62

Attributes

What are attributes?
- metadata... what's metadata?
- Data about the data

8 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

Data frames

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

Data frames
Matrices & arrays

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

Data frames
Matrices & arrays
Factors

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

Data frames
Matrices & arrays
Factors
Dates

9 / 62

Other data types

Atomic vectors by themselves make up only a small fraction of the total number of data types in R

What are some other data types?

Data frames
Matrices & arrays
Factors
Dates

Remember, atomic vectors are the atoms of R. Many other data structures are built from atomic vectors.

We use attributes to create other data types from atomic vectors

9 / 62

Attributes

Common

Names
Dimensions

Less common

Arbitrary metadata

10 / 62

Examples

Please follow along!

See all attributes associated with a give object with attributes

library(palmerpenguins)
attributes(penguins[1:50, ]) # limiting rows just for slides

## $names
## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
## [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"             
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

11 / 62

head(penguins)

## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Big one Torgersen         39.1          18.7                 181        3750 male  
## 2 Big one Torgersen         39.5          17.400               186        3800 female
## 3 Big one Torgersen         40.300        18                   195        3250 female
## 4 Big one Torgersen         NA            NA                    NA          NA <NA>  
## 5 Big one Torgersen         36.7          19.3                 193        3450 female
## 6 Big one Torgersen         39.300        20.6                 190        3650 male  
## # … with 1 more variable: year <int>

12 / 62

Get specific attribute

Access just a single attribute by naming it within attr

attr(penguins, "class")

## [1] "tbl_df"     "tbl"        "data.frame"

attr(penguins, "names")

## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
## [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"

13 / 62

Get specific attribute

Access just a single attribute by naming it within attr

attr(penguins, "class")

## [1] "tbl_df"     "tbl"        "data.frame"

attr(penguins, "names")

## [1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"    
## [5] "flipper_length_mm" "body_mass_g"       "sex"               "year"

Note - this is not generally how you would pull these attributes. Rather, you would use class() and names().

13 / 62

Be specific

Note in the prior slides, I'm asking for attributes on the entire data frame.
Is that what I want?... maybe. But the individual vectors may have attributes as well

14 / 62

Be specific

Note in the prior slides, I'm asking for attributes on the entire data frame.
Is that what I want?... maybe. But the individual vectors may have attributes as well

attributes(penguins$species)

## $levels
## [1] "Big one"    "Little one" "Funny one" 
## 
## $class
## [1] "factor"

attributes(penguins$bill_length_mm)

## NULL

14 / 62

Set attributes

Just redefine them within attr

attr(penguins$species, "levels") <- c("Big one", 
                                      "Little one", 
                                      "Funny one")
head(penguins)

## # A tibble: 6 x 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex   
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct> 
## 1 Big one Torgersen         39.1          18.7                 181        3750 male  
## 2 Big one Torgersen         39.5          17.400               186        3800 female
## 3 Big one Torgersen         40.300        18                   195        3250 female
## 4 Big one Torgersen         NA            NA                    NA          NA <NA>  
## 5 Big one Torgersen         36.7          19.3                 193        3450 female
## 6 Big one Torgersen         39.300        20.6                 190        3650 male  
## # … with 1 more variable: year <int>

Note - you would generally not define levels this way either, but it is a general method for modifying attributes.

15 / 62

Dimensions

Let's create a matrix (please do it with me)

m <- matrix(1:6, ncol = 2)
m

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Notice how the matrix fills

16 / 62

Dimensions

Let's create a matrix (please do it with me)

m <- matrix(1:6, ncol = 2)
m

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Notice how the matrix fills
Check out the attributes

attributes(m)

## $dim
## [1] 3 2

16 / 62

Modify the attributes

Let's change it to a 2 x 3 matrix, instead of 3 x 2 (you try first)

17 / 62

Modify the attributes

Let's change it to a 2 x 3 matrix, instead of 3 x 2 (you try first)

attr(m, "dim") <- c(2, 3)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

17 / 62

Modify the attributes

Let's change it to a 2 x 3 matrix, instead of 3 x 2 (you try first)

attr(m, "dim") <- c(2, 3)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

is this the result you expected?

17 / 62

Alternative creation

Create an atomic vector, assign a dimension attribute

v <- 1:6
v

## [1] 1 2 3 4 5 6

attr(v, "dim") <- c(3, 2)
v

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

18 / 62

  
AsideWhat if we wanted it to fill by row?
matrix(6:13, 
       ncol = 2, 
       byrow = TRUE)

##      [,1] [,2]
## [1,]    6    7
## [2,]    8    9
## [3,]   10   11
## [4,]   12   13
vect <- 6:13
dim(vect) <- c(2, 4)
vect

##      [,1] [,2] [,3] [,4]
## [1,]    6    8   10   12
## [2,]    7    9   11   13
t(vect)

##      [,1] [,2]
## [1,]    6    7
## [2,]    8    9
## [3,]   10   11
## [4,]   12   13
19 / 62

Names

The following (this slide and the next) are equivalent

attr(v, "dimnames") <- list(c("the first", "second", "III"),
                            c("index", "value"))
v

##           index value
## the first     1     4
## second        2     5
## III           3     6

20 / 62

Names

v2 <- 1:6
attr(v2, "dim") <- c(3, 2)
rownames(v2) <- c("the first", "second", "III")
colnames(v2) <- c("index", "value")
v2

##           index value
## the first     1     4
## second        2     5
## III           3     6

21 / 62

Arbitrary metadata

I don't use this often (wouldn't recommend you do either)

attr(v, "matrix_mean") <- mean(v)
v

##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5

attr(v, "matrix_mean")

## [1] 3.5

Note that anything can be stored as an attribute (including matrices or data frames, etc.)

22 / 62

Real example

Fit a multilevel model and pull the variance-covariance matrix

m <- lme4::lmer(Reaction ~ 1 + Days + (1 + Days|Subject), 
                data = lme4::sleepstudy)
lme4::VarCorr(m)$Subject

##             (Intercept)      Days
## (Intercept)  612.100158  9.604409
## Days           9.604409 35.071714
## attr(,"stddev")
## (Intercept)        Days 
##   24.740658    5.922138 
## attr(,"correlation")
##             (Intercept)       Days
## (Intercept)  1.00000000 0.06555124
## Days         0.06555124 1.00000000

23 / 62

Matrices vs Data frames

Usually we want to work with data frames because they represent our data better.

Sometimes a matrix is more efficient because you can operat on the entire matrix at once.

set.seed(42)
m <- matrix(rnorm(100, 200, 10), ncol = 10)
m

##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]
##  [1,] 213.7096 213.0487 196.9336 204.5545 202.0600 203.2193 196.3277 189.5688 215.1271
##  [2,] 194.3530 222.8665 182.1869 207.0484 196.3894 192.1616 201.8523 199.0981 202.5792
##  [3,] 203.6313 186.1114 198.2808 210.3510 207.5816 215.7573 205.8182 206.2352 200.8844
##  [4,] 206.3286 197.2121 212.1467 193.9107 192.7330 206.4290 213.9974 190.4648 198.7910
##  [5,] 204.0427 198.6668 218.9519 205.0496 186.3172 200.8976 192.7271 194.5717 188.0567
##  [6,] 198.9388 206.3595 195.6953 182.8299 204.3282 202.7655 213.0254 205.8100 206.1200
##  [7,] 215.1152 197.1575 197.4273 192.1554 191.8861 206.7929 203.3585 207.6818 197.8286
##  [8,] 199.0534 173.4354 182.3684 191.4909 214.4410 200.8983 210.3851 204.6377 198.1724
##  [9,] 220.1842 175.5953 204.6010 175.8579 195.6855 170.0691 209.2073 191.1422 209.3335
## [10,] 199.3729 213.2011 193.6001 200.3612 206.5565 202.8488 207.2088 189.0022 208.2177
##          [,10]
##  [1,] 213.9212
##  [2,] 195.2383
##  [3,] 206.5035
##  [4,] 213.9111
##  [5,] 188.8921
##  [6,] 191.3921
##  [7,] 188.6826
##  [8,] 185.4079
##  [9,] 200.7998
## [10,] 206.5320

24 / 62

sum(m)

## [1] 20032.51

mean(m)

## [1] 200.3251

rowSums(m)

##  [1] 2048.470 1993.774 2041.155 2025.924 1978.173 2007.265 1998.086 1960.291 1952.476
## [10] 2026.901

colSums(m)

##  [1] 2054.730 1983.654 1982.192 1963.610 1997.978 2001.839 2053.908 1978.212 2025.111
## [10] 1991.281

# standardize the matrix
z <- (m - mean(m)) / sd(m)

25 / 62

##              [,1]       [,2]       [,3]        [,4]       [,5]        [,6]       [,7]
##  [1,]  1.28528802  1.2218239 -0.3256841  0.40613865  0.1665940  0.27791666 -0.3838736
##  [2,] -0.57349498  2.1646089 -1.7417882  0.64562157 -0.3779416 -0.78393268  0.1466507
##  [3,]  0.31748345 -1.3649263 -0.1963133  0.96277141  0.6968297  1.48192480  0.5274934
##  [4,]  0.57650528 -0.2989403  1.1352110 -0.61596668 -0.7290676  0.58614338  1.3129235
##  [5,]  0.35698951 -0.1592501  1.7887033  0.45367758 -1.3451640  0.05497234 -0.7296315
##  [6,] -0.13313334  0.5794704 -0.4445968 -1.68004206  0.3844054  0.23434417  1.2195893
##  [7,]  1.42026916 -0.3041875 -0.2782756 -0.78452812 -0.8103926  0.62108770  0.2912866
##  [8,] -0.12212321 -2.5821792 -1.7243635 -0.84833774  1.3555260  0.05504171  0.9660389
##  [9,]  1.90703954 -2.3747685  0.4106013 -2.34955213 -0.4455350 -2.90544454  0.8529388
## [10,] -0.09144695  1.2364622 -0.6458013  0.00346451  0.5983857  0.24234547  0.6610253
##             [,8]        [,9]       [,10]
##  [1,] -1.0329155  1.42140711  1.30560568
##  [2,] -0.1178282  0.21645471 -0.48848642
##  [3,]  0.5675319  0.05370436  0.59329679
##  [4,] -0.9468782 -0.14731870  1.30463971
##  [5,] -0.5524942 -1.17812024 -1.09789797
##  [6,]  0.5266990  0.55646825 -0.85783015
##  [7,]  0.7064474 -0.23973975 -1.11801576
##  [8,]  0.4141258 -0.20672212 -1.43248556
##  [9,] -0.8818216  0.86505545  0.04558258
## [10,] -1.0873272  0.75791330  0.59603916

26 / 62

Stripping attributes

Many operations will strip attributes (generally why it's not a good idea to store important things in them)

##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5

rowSums(v)

## the first    second       III 
##         5         7         9

27 / 62

Stripping attributes

Many operations will strip attributes (generally why it's not a good idea to store important things in them)

##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5

rowSums(v)

## the first    second       III 
##         5         7         9

attributes(rowSums(v))

## $names
## [1] "the first" "second"    "III"

Generally names are maintained
Sometimes, dim is maintained, sometimes not
All else is stripped

27 / 62

More on `names`

The names attribute corresponds to the individual elements within a vector

names(v)

## NULL

names(v) <- letters[1:6]
v

##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
## attr(,"names")
## [1] "a" "b" "c" "d" "e" "f"

28 / 62

Perhaps more straightforward

v3a <- c(a = 5, b = 7, c = 12)
v3a

##  a  b  c 
##  5  7 12

names(v3a)

## [1] "a" "b" "c"

attributes(v3a)

## $names
## [1] "a" "b" "c"

29 / 62

Alternatives

v3b <- c(5, 7, 12)
names(v3b) <- c("a", "b", "c")
v3b

##  a  b  c 
##  5  7 12

v3c <- setNames(c(5, 7, 12), c("a", "b", "c"))
v3c

##  a  b  c 
##  5  7 12

30 / 62

Alternatives

v3b <- c(5, 7, 12)
names(v3b) <- c("a", "b", "c")
v3b

##  a  b  c 
##  5  7 12

v3c <- setNames(c(5, 7, 12), c("a", "b", "c"))
v3c

##  a  b  c 
##  5  7 12

Note that names is not the same thing as colnames, but, somewhat confusingly, both work to rename the variables (columns) of a data frame. We'll talk more about why this is momentarily.

30 / 62

Why names might be helpful

##           index value
## the first     1     4
## second        2     5
## III           3     6
## attr(,"matrix_mean")
## [1] 3.5
## attr(,"names")
## [1] "a" "b" "c" "d" "e" "f"

v["b"]

## b 
## 2

v["e"]

## e 
## 5

31 / 62

Implementation of factors

Quickly

fct <- factor(c("a", "a", "b", "c"))
typeof(fct)

## [1] "integer"

attributes(fct)

## $levels
## [1] "a" "b" "c"
## 
## $class
## [1] "factor"

str(fct)

##  Factor w/ 3 levels "a","b","c": 1 1 2 3

32 / 62

More manually

# First create integer vector
int <- c(1L, 1L, 2L, 3L, 1L, 3L)
# assign some levels
attr(int, "levels") <- c("red", "green", "blue")
# change the class to a factor
class(int) <- "factor"
int

## [1] red   red   green blue  red   blue 
## Levels: red green blue

33 / 62

Implementation of dates

Quickly

date <- Sys.Date()
typeof(date)

## [1] "double"

attributes(date)

## $class
## [1] "Date"

attributes(date) <- NULL
date

## [1] 18730

This number represents the days passed since January 1, 1970, known as the Unix epoch.

34 / 62

Missing values

Missing values breed missing values

NA > 5

## [1] NA

NA * 7

## [1] NA

35 / 62

Missing values

Missing values breed missing values

NA > 5

## [1] NA

NA * 7

## [1] NA

What about this one?

NA == NA

35 / 62

Missing values

Missing values breed missing values

NA > 5

## [1] NA

NA * 7

## [1] NA

What about this one?

NA == NA

## [1] NA

35 / 62

Missing values

Missing values breed missing values

NA > 5

## [1] NA

NA * 7

## [1] NA

What about this one?

NA == NA

## [1] NA

It is correct because there's no reason to presume that one missing value is or is not equal to another missing value.

35 / 62

When missing values don't propagate

NA | TRUE

## [1] TRUE

x <- c(NA, 3, NA, 5)
any(x > 4)

## [1] TRUE

36 / 62

How to test missingness?

We've already seen the following doesn't work

x == NA

## [1] NA NA NA NA

37 / 62

How to test missingness?

We've already seen the following doesn't work

x == NA

## [1] NA NA NA NA

Instead, use is.na

is.na(x)

## [1]  TRUE FALSE  TRUE FALSE

When does this regularly come into play?

37 / 62

Lists

38 / 62

Lists

Lists are vectors, but not atomic vectors
Fundamental difference - each element can be a different type

list("a", 7L, 3.25, TRUE)

## [[1]]
## [1] "a"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] 3.25
## 
## [[4]]
## [1] TRUE

39 / 62

Lists

Technically, each element of the list is a vector, possibly atomic
The prior example included all scalars, which are vectors of length 1.
Lists do not require all elements to be the same length

l <- list(
  c("a", "b", "c"),
  rnorm(5),
  c(7L, 2L),
  c(TRUE, TRUE, FALSE, TRUE)
)
l

## [[1]]
## [1] "a" "b" "c"
## 
## [[2]]
## [1]  1.2009654  1.0447511 -1.0032086  1.8484819 -0.6667734
## 
## [[3]]
## [1] 7 2
## 
## [[4]]
## [1]  TRUE  TRUE FALSE  TRUE

40 / 62

Check the list

typeof(l)

## [1] "list"

attributes(l)

## NULL

str(l)

## List of 4
##  $ : chr [1:3] "a" "b" "c"
##  $ : num [1:5] 1.201 1.045 -1.003 1.848 -0.667
##  $ : int [1:2] 7 2
##  $ : logi [1:4] TRUE TRUE FALSE TRUE

41 / 62

  
Data frames as listsA data frame is just a special case of a list, where all the elements are of the same length.
l_df <- list(
  a = c("red", "blue"),
  b = rnorm(2),
  c = c(7L, 2L),
  d = c(TRUE, FALSE)
)
l_df

## $a
## [1] "red"  "blue"
## 
## $b
## [1]  0.1055138 -0.4222559
## 
## $c
## [1] 7 2
## 
## $d
## [1]  TRUE FALSE
data.frame(l_df)

##      a          b c     d
## 1  red  0.1055138 7  TRUE
## 2 blue -0.4222559 2 FALSE
42 / 62

Subsetting Lists

43 / 62

A nested list

Lists are often complicated objects. Let's create a somewhat complicated one

x <- c(a = 3, b = 5, c = 7)
l <- list(
  x = x,
  x2 = c(x, x),
  x3 = list(
    vect = x,
    squared = x^2,
    cubed = x^3)
)

44 / 62

Subsetting lists

Multiple methods

Most common: $, [, and [[

l[1]

## $x
## a b c 
## 3 5 7

typeof(l[1])

## [1] "list"

l[[1]]

## a b c 
## 3 5 7

typeof(l[[1]])

## [1] "double"

l[[1]]["c"]

## c 
## 7

45 / 62

Named list

Because the elements of the list are named, we can use $

l$x2

## a b c a b c 
## 3 5 7 3 5 7

l$x3

## $vect
## a b c 
## 3 5 7 
## 
## $squared
##  a  b  c 
##  9 25 49 
## 
## $cubed
##   a   b   c 
##  27 125 343

46 / 62

Subsetting nested lists

Multiple $ if all named

l$x3$squared

##  a  b  c 
##  9 25 49

Note this doesn't work on named elements of an atomic vector, just the named elements of a list

l$x3$squared$b

## Error in l$x3$squared$b: $ operator is invalid for atomic vectors

47 / 62

But we could do something like...

l$x3$squared["b"]

##  b 
## 25

48 / 62

Alternatives

You can always use logical
Indexing works too

l[c(TRUE, FALSE, TRUE)]

## $x
## a b c 
## 3 5 7 
## 
## $x3
## $x3$vect
## a b c 
## 3 5 7 
## 
## $x3$squared
##  a  b  c 
##  9 25 49 
## 
## $x3$cubed
##   a   b   c 
##  27 125 343

l[c(1, 3)]

## $x
## a b c 
## 3 5 7 
## 
## $x3
## $x3$vect
## a b c 
## 3 5 7 
## 
## $x3$squared
##  a  b  c 
##  9 25 49 
## 
## $x3$cubed
##   a   b   c 
##  27 125 343

49 / 62

Careful with your brackets

l[[c(TRUE, FALSE, FALSE)]]

## Error in l[[c(TRUE, FALSE, FALSE)]]: recursive indexing failed at level 2

Why doesn't the above work?

50 / 62

Subsetting in multiple dimensions

Generally we deal with 2d data frames
If there are two dimensions, we separate the [ subsetting with a comma

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars[3, 4]

## [1] 93

51 / 62

Empty indicators

An empty indicator implies "all"

52 / 62

Empty indicators

An empty indicator implies "all"

Select the entire fourth column

mtcars[ ,4]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109

52 / 62

Empty indicators

An empty indicator implies "all"

Select the entire fourth column

mtcars[ ,4]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109

Select the entire 4th row

mtcars[4, ]

##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

52 / 62

Data types returned

By default, each of the prior will return a vector, which itself can be subset

The following are equivalent

mtcars[4, c("mpg", "hp")]

##                 mpg  hp
## Hornet 4 Drive 21.4 110

mtcars[4, ][c("mpg", "hp")]

##                 mpg  hp
## Hornet 4 Drive 21.4 110

53 / 62

Return a data frame

Often, you don't want the vector returned, but rather the modified data frame.
Specify drop = FALSE

mtcars[ ,4]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109

mtcars[ ,4, drop = FALSE]

##                      hp
## Mazda RX4           110
## Mazda RX4 Wag       110
## Datsun 710           93
## Hornet 4 Drive      110
## Hornet Sportabout   175
## Valiant             105
## Duster 360          245
## Merc 240D            62
## Merc 230             95
## Merc 280            123
## Merc 280C           123
## Merc 450SE          180
## Merc 450SL          180
## Merc 450SLC         180
## Cadillac Fleetwood  205
## Lincoln Continental 215
## Chrysler Imperial   230
## Fiat 128             66
## Honda Civic          52
## Toyota Corolla       65
## Toyota Corona        97
## Dodge Challenger    150
## AMC Javelin         150
## Camaro Z28          245
## Pontiac Firebird    175
## Fiat X1-9            66
## Porsche 914-2        91
## Lotus Europa        113
## Ford Pantera L      264
## Ferrari Dino        175
## Maserati Bora       335
## Volvo 142E          109

54 / 62

tibbles

Note dropping the data frame attribute is the default for a data.frame but NOT a tibble.

mtcars_tbl <- tibble::as_tibble(mtcars)
mtcars_tbl[ ,4]

## # A tibble: 32 x 1
##       hp
##    <dbl>
##  1   110
##  2   110
##  3    93
##  4   110
##  5   175
##  6   105
##  7   245
##  8    62
##  9    95
## 10   123
## # … with 22 more rows

55 / 62

You can override this

mtcars_tbl[ ,4, drop = TRUE]

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97
## [22] 150 150 245 175  66  91 113 264 175 335 109

56 / 62

More than two dimensions

Depending on your applications, you may not run into this much

array <- 1:12
dim(array) <- c(2, 3, 2)
array

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

57 / 62

Subset array

Select just the second matrix

58 / 62

Subset array

Select just the second matrix

array[ , ,2]

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

58 / 62

Subset array

Select just the second matrix

array[ , ,2]

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Select first column of each matrix

58 / 62

Subset array

Select just the second matrix

array[ , ,2]

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Select first column of each matrix

array[ ,1, ]

##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8

58 / 62

Back to lists

Why are they so useful?

Fairly obviously, they're much more flexible
Often returned by functions, for example, lm

m <- lm(mpg ~ hp, mtcars)
str(m)

## List of 12
##  $ coefficients : Named num [1:2] 30.0989 -0.0682
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "hp"
##  $ residuals    : Named num [1:32] -1.594 -1.594 -0.954 -1.194 0.541 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ effects      : Named num [1:32] -113.65 -26.046 -0.556 -0.852 0.67 ...
##   ..- attr(*, "names")= chr [1:32] "(Intercept)" "hp" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:32] 22.6 22.6 23.8 22.6 18.2 ...
##   ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "hp"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.18 1.08
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 30
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ hp, data = mtcars)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ hp
##   .. ..- attr(*, "variables")= language list(mpg, hp)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "mpg" "hp"
##   .. .. .. ..$ : chr "hp"
##   .. ..- attr(*, "term.labels")= chr "hp"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x7fd864e4b5c0> 
##   .. ..- attr(*, "predvars")= language list(mpg, hp)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
##  $ model        :'data.frame':    32 obs. of  2 variables:
##   ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ hp
##   .. .. ..- attr(*, "variables")= language list(mpg, hp)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "mpg" "hp"
##   .. .. .. .. ..$ : chr "hp"
##   .. .. ..- attr(*, "term.labels")= chr "hp"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: 0x7fd864e4b5c0> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, hp)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "hp"
##  - attr(*, "class")= chr "lm"

59 / 62

Summary

Atomic vectors must all be the same type
- implicit coercion occurs if not (and you haven't specified the coercion explicitly)
Lists are also vectors, but not atomic vectors
- Each element can be of a different type and length
- Incredibly flexible, but often a little more difficult to get the hang of, particularly with subsetting

60 / 62

Any time left?

Practice: Fit the model on the slide two previous

m <- lm(mpg ~ hp, mtcars)

Pull each of the following from the model object

effects
names of the effects
qr matrix
third row of the qr matrix
fifth rowname of the qr matrix
data.frame used in the estimation of the model

61 / 62

Next time

Loops with base R

62 / 62

Agenda

Finishing up on coercion
Attributes
Missing values
Intro to lists
Subsetting

2 / 62

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow