-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy path03_tidy_data.Rmd
3314 lines (2238 loc) · 78.8 KB
/
03_tidy_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Tidy data and its friends {#tidy_data}
```{r include = FALSE}
# Caching this markdown file
#knitr::opts_chunk$set(cache = TRUE)
```
## Setup
- Check your `dplyr` package is up-to-date by typing `packageVersion("dplyr")`. If the current installed version is less than 1.0, then update by typing `update.packages("dplyr")`. You may need to restart R to make it work.
```{r}
ifelse(packageVersion("dplyr") >= 1,
"The installed version of dplyr package is greater than or equal to 1.0.0", update.packages("dplyr")
)
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse, # the tidyverse framework
skimr, # skimming data
here, # computational reproducibility
#infer, # statistical inference
tidymodels, # statistical modeling
gapminder, # toy data
nycflights13, # for exercise
ggthemes, # additional themes
ggrepel, # arranging ggplots
patchwork, # arranging ggplots
broom, # tidying model outputs
waldo # side-by-side code comparison
)
```
## Base R data structure
The rest of the chapter follows the basic structure in [the Data Wrangling Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) created by RStudio.
To make the best use of the R language, you'll need a strong understanding of the basic data types and data structures and how to operate on those. R is an **object-oriented** language, so the importance of this cannot be understated.
It is **critical** to understand because these are the objects you will manipulate on a day-to-day basis in R, and they are not always as easy to work with as they sound at the outset. Dealing with object conversions is one of the most common sources of frustration for beginners.
> To understand computations in R, two slogans are helpful:
- Everything that exists is an object.
- Everything that happens is a function call.
> __John Chambers__the creator of S (the mother of R)
1. [Main Classes](#main-classes) introduces you to R's one-dimensional or atomic classes and data structures. R has five basic atomic classes: logical, integer, numeric, complex, character. Social scientists don't use complex classes.
2. [Attributes](#attributes) takes a small detour to discuss attributes, R's flexible metadata specification. Here, you'll learn about factors, an important data structure created by setting attributes of an atomic vector. R has many data structures: vector, list, matrix, data frame, factors, tables.
![Concept map for data types. By Meghan Sposato, Brendan Cullen, Monica Alonso.](https://github.com/rstudio/concept-maps/raw/master/en/data-types.svg)
### 1D data: Vectors
#### Atomic classes
`R`'s main atomic classes are:
* character (or a "string" in Python and Stata)
* numeric (integer or float)
* integer (just integer)
* logical (booleans)
| Example | Type |
| ------- | ---- |
| "a", "swc" | character |
| 2, 15.5 | numeric |
| 2 (Must add a `L` at end to denote integer) | integer |
| `TRUE`, `FALSE` | logical |
Like Python, R is dynamically typed. There are a few differences in terminology, however, that are pertinent.
- First, "types" in Python are referred to as "classes" in R.
What is a class?
![from https://brilliant.org/](https://ds055uzetaobb.cloudfront.net/brioche/uploads/pJZt3mh3Ht-prettycars.png?width=2400)
- Second, R has different names for the types string, integer, and float --- specifically **character**, **integer** (not different), and **numeric**. Because there is no "float" class in R, users tend to default to the "numeric" class when working with numerical data.
The function for recovering object classes is ```class()```. L suffix to qualify any number with the intent of making it an explicit integer. See more from the [R language definition](https://cran.r-project.org/doc/manuals/R-lang.html).
```{r}
class(3)
class(3L)
class("Three")
class(F)
```
### Data structures
R's base data structures can be organized by their dimensionality (1d, 2d, or nd) and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:
| | Homogeneous | Heterogeneous |
|----|---------------|---------------|
| 1d | Atomic vector | List |
| 2d | Matrix | Data frame |
| nd | Array | |
Each data structure has its specifications and behavior. For our purposes, an important thing to remember is that R is always **faster** (more efficient) working with homogeneous (**vectorized**) data.
#### Vector properties
Vectors have three common properties:
* Class, `class()`, or what type of object it is (same as `type()` in Python).
* Length, `length()`, how many elements it contains (same as `len()` in Python).
* Attributes, `attributes()`, additional arbitrary metadata.
They differ in the types of their elements: all atomic vector elements must be the same type, whereas the elements of a list can have different types.
#### Creating different types of atomic vectors
Remember, there are four common types of vectors:
* `logical`
* `integer`
* `numeric` (same as `double`)
* `character`.
You can create an empty vector with `vector()` (By default, the mode is `logical.` You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as `character()`, `numeric()`, etc.
```{r}
x <- vector()
# with a length and type
vector("character", length = 10)
## character vector of length 5
character(5)
numeric(5)
logical(5)
```
Atomic vectors are usually created with `c()`, which is short for concatenate:
```{r}
x <- c(1, 2, 3)
x
length(x)
```
`x` is a numeric vector. These are the most common kind. You can also have logical vectors.
```{r}
y <- c(TRUE, TRUE, FALSE, FALSE)
y
```
Finally, you can have character vectors:
```{r}
kim_family <- c("Jae", "Sun", "Jane")
is.integer(kim_family) # integer?
is.character(kim_family) # character?
is.atomic(kim_family) # atomic?
typeof(kim_family) # what's the type?
```
**Short exercise: Create and examine your vector**
Create a character vector called `fruit` containing 4 of your favorite fruits. Then evaluate its structure using the commands below.
```{r, eval = FALSE}
# First, create your fruit vector
# YOUR CODE HERE
fruit <-
# Examine your vector
length(fruit)
class(fruit)
str(fruit)
```
**Add elements**
You can add elements to the end of a vector by passing the original vector into the `c` function, like the following:
```{r}
z <- c("Beyonce", "Kelly", "Michelle", "LeToya")
z <- c(z, "Farrah")
z
```
More examples of vectors
```{r}
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
```
You can also create vectors as a sequence of numbers:
```{r}
series <- 1:10
series
seq(10)
seq(1, 10, by = 0.1)
```
Atomic vectors are always flat, even if you nest `c()`'s:
```{r eval = TRUE}
c(1, c(2, c(3, 4)))
# the same as
c(1, 2, 3, 4)
```
**Types and Tests**
Given a vector, you can determine its class with `class`, or check if it's a specific type with an "is" function: `is.character()`, `is.numeric()`, `is.integer()`, `is.logical()`, or, more generally, `is.atomic()`.
```{r }
char_var <- c("harry", "sally")
class(char_var)
is.character(char_var)
is.atomic(char_var)
num_var <- c(1, 2.5, 4.5)
class(num_var)
is.numeric(num_var)
is.atomic(num_var)
```
NB: `is.vector()` does not test if an object is a vector. Instead, it returns `TRUE` only if the object is a vector with no attributes apart from names. Use `is.atomic(x) || is.list(x)` to test if an object is actually a vector.
**Coercion**
All atomic vector elements must be the same type, so when you attempt to combine different types, they will be __coerced__ to the **most flexible type.** Types from least to most flexible are: logical > integer > double > character.
For example, combining a character and an integer yields a character:
```{r}
str(c("a", 1))
```
**Guess what the following do without running them first**
```{r, eval = FALSE}
c(1.7, "a")
c(TRUE, 2)
c("a", TRUE)
```
Notice that when a logical vector is coerced to an integer or double, `TRUE` becomes 1, and `FALSE` becomes 0. This is very useful in conjunction with `sum()` and `mean()`
```{r}
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
# Total number of TRUEs
sum(x)
# Proportion that is TRUE
mean(x)
```
Coercion often happens automatically. This is called implicit coercion. Most mathematical functions (`+`, `log`, `abs`, etc.) will coerce to a numeric or integer, and most logical operations (`&`, `|`, `any`, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information.
```{r}
1 < "2"
"1" > 2
```
You can also coerce vectors explicitly coerce with `as.character()`, `as.numeric()`, `as.integer()`, or `as.logical()`. Example:
```{r}
x <- 0:6
as.numeric(x)
as.logical(x)
as.character(x)
```
Sometimes coercions, especially nonsensical ones, won’t work.
```{r}
x <- c("a", "b", "c")
as.numeric(x)
as.logical(x)
```
**Short Exercise**
```{r, eval=FALSE}
# 1. Create a vector of a sequence of numbers between 1 to 10.
# 2. Coerce that vector into a character vector
# 3. Add the element "11" to the end of the vector
# 4. Coerce it back to a numeric vector.
```
#### Lists
Lists are also vectors, but different from atomic vectors because their elements can be of any type. In short, they are generic vectors. For example, you construct lists by using `list()` instead of `c()`:
Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
```{r}
x <- list(1, "a", TRUE, c(4, 5, 6))
x
```
You can coerce other objects using `as.list()`. You can test for a list with `is.list()`
```{r}
x <- 1:10
x <- as.list(x)
is.list(x)
length(x)
```
`c()` will combine several lists into one. If given a combination of atomic vectors and lists, `c()` (con**c**atenate) will coerce the vectors to lists before combining them. Compare the results of `list()` and `c()`:
```{r}
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
str(y)
```
You can turn a list into an atomic vector with `unlist()`. If the elements of a list have different types, `unlist()` uses the same coercion rules as `c()`.
```{r}
x <- list(list(1, 2), c(3, 4))
x
unlist(x)
```
Lists are used to build up many of the more complicated data structures in R. For example, both data frames and linear models objects (as produced by `lm()`) are lists:
```{r}
is.list(mtcars)
mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
```
For this reason, lists are handy inside functions. You can "staple" together many different kinds of results into a single object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
```{r}
x.vec <- c(1, 2, 3)
x.list <- list(1, 2, 3)
x.vec
x.list
```
For lists, elements are **indexed by double brackets**. Single brackets will still return a(nother) list. (We'll talk more about subsetting and indexing in the fourth lesson.)
**Exercises**
1. What are the four basic types of atomic vectors? How does a list differ from an atomic vector?
2. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
3. Create three vectors and then combine them into a list.
4. If `x` is a list, what is the class of `x[1]`? How about `x[[1]]`?
### Attributes
Attributes provide additional information about the data to you, the user, and to R. We've already seen the following three attributes in action:
* Names (`names(x)`), a character vector giving each element a name.
* Dimensions (`dim(x)`), used to turn vectors into matrices.
* Class (`class(x)`), used to implement the S3 object system.
**Additional tips**
In an object-oriented system, a [class](https://www.google.com/search?q=what+is+class+programming&oq=what+is+class+programming&aqs=chrome.0.0l6.3543j0j4&sourceid=chrome&ie=UTF-8) (an extensible problem-code-template) defines a type of object like what its properties are, how it behaves, and how it relates to other types of objects. Therefore, technically, an object is an [instance](https://en.wikipedia.org/wiki/Instance_(computer_science)) (or occurrence) of a class. A method is a function associated with a particular type of object.
#### Names
You can name a vector when you create it:
```{r}
x <- c(a = 1, b = 2, c = 3)
```
You can also modify an existing vector:
```{r}
x <- 1:3
names(x)
names(x) <- c("e", "f", "g")
names(x)
```
Names don't have to be unique. However, character subsetting, described in the next lesson, is the most important reason to use names, and it is most useful when the names are unique. (For Python users: when names are unique, a vector behaves like a Python dictionary key.)
Not all elements of a vector need to have a name. If some names are missing, `names()` will return an empty string for those elements. If all names are missing, `names()` will return `NULL`.
```{r}
y <- c(a = 1, 2, 3)
names(y)
z <- c(1, 2, 3)
names(z)
```
You can create a new vector without names using `unname(x)`, or remove names in place with `names(x) <- NULL`.
#### Factors
Factors are special vectors that represent categorical data. Factors can be ordered (ordinal variable) or unordered (nominal or categorical variable) and are important for modeling functions such as `lm()` and `glm()` and also in plot methods.
**Quiz**
1. If you want to enter dummy variables (Democrats = 1, Non-democrats = 0) in your regression model, should you use a numeric or factor variable?
Factors can only contain pre-defined values. Set allowed values using the `levels()` attribute. Note that a factor's levels will always be character values.
```{r}
x <- c("a", "b", "b", "a")
x <- factor(c("a", "b", "b", "a"))
x
class(x)
levels(x)
# You can't use values that are not in the levels
x[2] <- "c"
# NB: you can't combine factors
c(factor("a"), factor("b"))
rep(1:5, rep(6, 5))
```
Factors are pretty much integers that have labels on them. Underneath, it's really numbers (1, 2, 3...).
```{r}
x <- factor(c("a", "b", "b", "a"))
str(x)
```
They are better than using simple integer labels because factors are what are called self-describing. For example, `democrat` and `republican` is more descriptive than `1`s and `2`s.
Factors are useful when you know the possible values a variable may take, even if you don't see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
```{r}
party_char <- c("democrat", "democrat", "democrat")
party_char
party_factor <- factor(party_char, levels = c("democrat", "republican"))
party_factor
table(party_char) # shows only democrats
table(party_factor) # shows republicans too
```
Sometimes factors can be left unordered. Example: `democrat`, `republican.`
Other times you might want factors to be ordered (or ranked). Example: `low`, `medium`, `high`.
```{r}
x <- factor(c("low", "medium", "high"))
str(x)
is.ordered(x)
y <- ordered(c("low", "medium", "high"), levels = c("high", "medium", "low"))
is.ordered(y)
```
While factors look (and often behave) like character vectors, they are integers. So be careful when treating them like strings. Some string methods (like `gsub()` and `grepl()`) will coerce factors to strings, while others (like `nchar()`) will throw an error, and still others (like `c()`) will use the underlying integer values.
```{r}
x <- c("a", "b", "b", "a")
x
is.factor(x)
x <- as.factor(x)
x
c(x, "c")
```
For this reason, it's usually best to explicitly convert factors to character vectors if you need string-like behavior. There was a memory advantage to using factors instead of character vectors in early versions of R, but this is no longer the case.
Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels or their optimal order. If this becomes a problem, use the argument `stringsAsFactors = FALSE` to suppress this behavior and manually convert character vectors to factors using your knowledge of the data.
**More attributes**
All R objects can have arbitrary additional attributes used to store metadata about the object. Attributes can be considered a named list (with unique names). Attributes can be accessed individually with `attr()` or all at once (as a list) with `attributes().`
```{r}
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
# str returns a new object with modified information
str(attributes(y))
```
**Exercises**
1. What happens to a factor when you modify its levels?
```{r, results = "none"}
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
f1
```
2. What does this code do? How do `f2` and `f3` differ from `f1`?
```{r, results = "none"}
f2 <- rev(factor(letters))
f3 <- factor(letters, levels = rev(letters))
```
### 2D data: Matrices and dataframes
1. Matrices: data structures for storing 2d data that is all the same class.
2. Dataframes: teaches you about the dataframe, the most important data structure for storing data in R, because it stores different kinds of (2d) data.
#### Matrices
Matrices are created when we combine multiple vectors with the same class (e.g., numeric). This creates a dataset with rows and columns. By definition, if you want to combine multiple classes of vectors, you want a dataframe. You can coerce a matrix to become a dataframe and vice-versa, but as with all vector coercions, the results can be unpredictable, so be sure you know how each variable (column) will convert.
```{r}
m <- matrix(nrow = 2, ncol = 2)
m
dim(m)
```
Matrices are filled column-wise.
```{r}
m <- matrix(1:6, nrow = 2, ncol = 3)
m
```
Other ways to construct a matrix
```{r}
m <- 1:10
dim(m) <- c(2, 5)
m
dim(m) <- c(5, 2)
m
```
You can transpose a matrix (or dataframe) with `t()`
```{r}
m <- 1:10
dim(m) <- c(2, 5)
m
t(m)
```
Another way is to bind columns or rows using `cbind()` and `rbind()`.
```{r}
x <- 1:3
y <- 10:12
cbind(x, y)
# or
rbind(x, y)
```
You can also use the `byrow` argument to specify how the matrix is filled. From R's own documentation:
```{r}
mdat <- matrix(c(1, 2, 3, 11, 12, 13),
nrow = 2,
ncol = 3,
byrow = TRUE,
dimnames = list(
c("row1", "row2"),
c("C.1", "C.2", "C.3")
)
)
mdat
```
Notice that we gave `names` to the dimensions in `mdat`.
```{r}
dimnames(mdat)
rownames(mdat)
colnames(mdat)
```
#### Dataframes
A data frame is an essential data type in R. It's pretty much the **de facto** data structure for most tabular data and what we use for statistics.
##### Creation
You create a data frame using `data.frame()`, which takes named vectors as input:
```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)
df
str(df)
```
Beware: `data.frame()`'s default behavior which turns strings into factors. Remember to use `stringAsFactors = FALSE` to suppress this behavior as needed:
```{r}
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE
)
str(df)
```
In reality, we rarely type up our datasets ourselves, and certainly not in R. The most common way to make a data.frame is by calling a file using `read.csv` (which relies on the `foreign` package), `read.dta` (if you're using a Stata file), or some other kinds of data file input.
##### Structure and Attributes
Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.
```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)
str(df)
```
This means that a dataframe has `names()`, `colnames()`, and `rownames()`, although `names()` and `colnames()` are the same thing.
** Summary **
- Set column names: `names()` in data frame, `colnames()` in matrix
- Set row names: `row.names()` in data frame, `rownames()` in matrix
```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)
# these two are equivalent
names(df)
colnames(df)
# change the colnames
colnames(df) <- c("Number", "Character")
df
```
```{r}
names(df) <- c("Number", "Character")
df
```
```{r}
# change the rownames
rownames(df)
rownames(df) <- c("donut", "pickle", "pretzel")
df
```
The `length()` of a dataframe is the length of the underlying list and so is the same as `ncol()`; `nrow()` gives the number of rows.
```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)
# these two are equivalent - number of columns
length(df)
ncol(df)
# get number of rows
nrow(df)
# get number of both columns and rows
dim(df)
```
##### Testing and coercion
To check if an object is a dataframe, use `class()` or test explicitly with `is.data.frame()`:
```{r}
class(df)
is.data.frame(df)
```
You can coerce an object to a dataframe with `as.data.frame()`:
* A vector will create a one-column dataframe.
* A list will create one column for each element; it's an error if they're
not all the same length.
* A matrix will create a data frame with the same number of columns and rows as the matrix.
##### Combining dataframes
You can combine dataframes using `cbind()` and `rbind()`:
```{r}
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE
)
cbind(df, data.frame(z = 3:1))
rbind(df, data.frame(x = 10, y = "z"))
```
When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. (If you want to combine rows that don't have the same columns, other functions/packages in R can help.)
It's a common mistake to try and create a dataframe by `cbind()`ing vectors together. This doesn't work because `cbind()` will create a matrix unless one of the arguments is already a dataframe. Instead use `data.frame()` directly:
```{r}
bad <- (cbind(x = 1:2, y = c("a", "b")))
bad
str(bad)
good <- data.frame(
x = 1:2, y = c("a", "b"),
stringsAsFactors = FALSE
)
good
str(good)
```
The conversion rules for `cbind()` are complicated and best avoided by ensuring all inputs are of the same type.
**Other objects**
Missing values are specified with `NA,` which is a logical vector of length 1. `NA` will always be coerced to the correct type if used inside `c()`
```{r}
x <- c(NA, 1)
x
typeof(NA)
typeof(x)
```
`Inf` is infinity. You can have either positive or negative infinity.
```{r}
1 / 0
1 / Inf
```
`NaN` means Not a number. It's an undefined value.
```{r}
0 / 0
```
### Subset
When working with data, you'll need to subset objects early and often. Luckily, R's subsetting operators are powerful and fast. Mastery of subsetting allows you to succinctly express complex operations in a way that few other languages can match. Subsetting is hard to learn because you need to master several interrelated concepts:
* The three subsetting operators, `[`, `[[`, and `$`.
* Important differences in behavior for different objects (e.g., vectors, lists, factors, matrices, and data frames).
* The use of subsetting in conjunction with assignment.
This unit helps you master subsetting by starting with the simplest type of subsetting: subsetting an atomic vector with `[`. It then gradually extends your knowledge to more complicated data types (like dataframes and lists) and then to the other subsetting operators, `[[` and `$`. You'll then learn how subsetting and assignment can be combined to modify parts of an object, and, finally, you'll see a large number of useful applications.
#### Atomic vectors
Let's explore the different types of subsetting with a simple vector, `x`.
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
```
Note that the number after the decimal point gives the original position in the vector.
**NB:** In R, positions start at 1, unlike Python, which starts at 0. Fun!**
There are five things that you can use to subset a vector:
##### Positive integers
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
x
x[1]
x[c(3, 1)]
# `order(x)` gives the positions of smallest to largest values.
order(x)
x[order(x)]
x[c(1, 3, 2, 4)]
# Duplicated indices yield duplicated values
x[c(1, 1)]
```
##### Negative integers
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
x[-c(3, 1)]
```
You can't mix positive and negative integers in a single subset:
```{r, error = TRUE}
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
```
##### Logical vectors
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(TRUE, TRUE, FALSE, FALSE)]
```
This is probably the most useful type of subsetting because you write the expression that creates the logical vector.
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
# this returns a logical vector
x > 3
x
# use a conditional statement to create an implicit logical vector
x[x > 3]
```
You can combine conditional statements with `&` (and), `|` (or), and `!` (not)
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
# combing two conditional statements with &
x > 3 & x < 5
x[x > 3 & x < 5]
# combing two conditional statements with |
x < 3 | x > 5
x[x < 3 | x > 5]
# combining conditional statements with !
!x > 5
x[!x > 5]
```
Another way to generate implicit conditional statements is using the `%in%` operator, which works like the `in` keywords in Python.
```{r}
# generate implicit logical vectors through the %in% operator
x %in% c(3.3, 4.2)
x
x[x %in% c(3.3, 4.2)]
```
##### Character vectors
```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
# apply names