forked from isi-avbulimen/R-cookbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-basics.Rmd
311 lines (222 loc) · 10 KB
/
01-basics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# The basics {#basics}
```{r include=FALSE}
# Every file must contain at least one R chunk due to the linting process.
```
## R family
A few of the common R relations are
- R is the programming language, born in 1997, based on S, [honest](https://en.wikipedia.org/wiki/R_(programming_language)).
- RStudio is a useful integrated development environment (IDE) that makes it cleaner to write, run and organise R code.
- Rproj is the file extension for an R project, essentially a working directory marker, shortens file paths, keeps everything relevant to your project together and easy to reference.
- packages are collections of functions written to make specific tasks easier, eg the {stringr} package contains functions to work with strings. Packages are often referred to as libraries in other programming languages.
- .R is the file extension for a basic R script in which anything not commented out with `#` is code that is run.
- .Rmd is the file extension for Rmarkdown an R package useful for producing reports. A .Rmd script is different to a .R script in that the default is text rather than code. Code is placed in code chunks - similar to how a Jupyter Notebook looks.
## DfT R/RStudio - subject to change
Which version of R/RStudio should I use at DfT? A good question. Currently the 'best' version of R we have available on network is linked to RStudio version 11453. This can be accessed via the Citrix app on the Windows 10 devices, or via Citrix desktop. The local version of RStudio on the Windows 10 devices is currently unusable (user testing is ongoing to change this). There is also a 11423 version of RStudio available which uses slightly older versions of packages.
## RStudio IDE
The RStudio integrated development environment has some very useful features which make writing and organising code a lot easier. It's divided into 3 panes;
### Left (bottom left if you have scripts open)
- this is the **Console** it shows you what code has been run and outputs.
### Top right; **Environment**, and other tabs
- **Environment** tab shows what objects have been created in the global environment in the current session.
- **Connections** tab will show any connections you have set up this session, for example, to an SQL server.
### Bottom right
- **Files** tab shows what directory you are in and the files there.
- **Plots** tab shows all the plot outputs created this session, you can navigate through them.
- **Packages** tab shows a list of installed packages, if the box in front of the package name is checked then this package has been loaded this session.
- **Help** tab can be used to search for help on a topic/package function, it also holds any output from `?function_name` help command that has been run in the console, again you can navigate through help topics using the left and right arrows.
- **Viewer** tab can be used to view local web content.
For some pictures have a look at DfE's **R Training Course** [getting started with rstudio](https://dfe-analytical-services.github.io/r-training-course/getting-started-with-rstudio.html)
Or Matt Dray's [Beginner R Featuring Pokemon: the RStudio interface](https://matt-dray.github.io/beginner-r-feat-pkmn/#4_the_rstudio_interface)
### Other handy buttons in RStudio IDE
- top left new script icon; blank page with green circle and white cross.
- top right project icon; 3D transparent light blue cube with R. Use this to create and open projects.
### RStudio menu bar a few pointers
- **View** contains **Zoom In** **Zoom Out**
- **Tools** -> **Global Options** contains many useful setting tabs such as **Appearance** where you can change the RStudio theme, and **Code** -> **Display** where you can set a margin vertical guideline (default is 80 characters).
## Projects
Why you should work in an R project, how to set up and project happiness. See this section of [Beginner R Featuring Pokemon](https://matt-dray.github.io/beginner-r-feat-pkmn/#3_project_working) by Matt Dray.
### Folders
When you set up a project it is good practise to include separate folders for different types of files such as
- data; for the data your R code is using
- output; for files creates by your code
- R; all your code files, eg .R, .Rmd
- images
### sessionInfo()
Include a saved file of the sessionInfo() output, this command prints out the versions of all the packages currently loaded. This information is essential when passing on code as packages can be updated and code breaking changes made.
## R memory
R works in RAM, so its memory is only as good as the amount of RAM you have - however this should be sufficient for most tasks. More info in the **Memory** chapter of Advanced R by Hadley Wickham [here](http://adv-r.had.co.nz/memory.html).
## A note on rounding
For rounding numerical values we have the base function `round(x, digits = 0)`. This rounds the value of the first argument to the specified number of decimal places (default 0).
```{r}
round(c(-1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5))
```
For example, note that 1.5 and 2.5 both round to 2, which is probably not what you were expecting, this is generally referred to as 'round half to even'. The `round()` documentation explains all (`?round`)
> Note that for rounding off a 5, the IEC 60559 standard (see also ‘IEEE 754’)
is expected to be used, *‘go to the even digit’*. Therefore `round(0.5)` is `0`
and `round(-1.5)` is `-2`. However, this is dependent on OS services and on
representation error (since e.g. `0.15` is not represented exactly, the
rounding rule applies to the represented number and not to the printed number,
and so `round(0.15, 1)` could be either `0.1` or `0.2`).
To implement what we consider normal rounding we can use the {janitor} package and the function `round_half_up`
```{r, warning=FALSE}
library(janitor)
janitor::round_half_up(c(-1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5))
```
If we do not have access to the package (or do not want to depend on the package) then we can implement^[see [stackoverflow](https://stackoverflow.com/questions/12688717/round-up-from-5/12688836#12688836)
```{r}
round_half_up_v2 <- function(x, digits = 0) {
posneg <- sign(x)
z <- abs(x) * 10 ^ digits
z <- z + 0.5
z <- trunc(z)
z <- z / 10 ^ digits
z * posneg
}
round_half_up_v2(c(-1.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4.5))
```
## Assignment operators `<-` vs `=`
To assign or to equal? These are not always the same thing. In R to assign a value to a variable it is advised to use `<-` rather than `=`. The latter is generally used for setting parameters inside functions, e.g., `my_string <- stringr::str_match(string = "abc", pattern = "a")`. More on assignment operators [here](https://stat.ethz.ch/R-manual/R-patched/library/base/html/assignOps.html).
## Arithmetic operators
- addition
```{r}
1 + 2
```
- subtraction
```{r}
5 - 4
```
- multiplication
```{r}
2 * 2
```
- division
```{r}
3 / 2
```
- exponent
```{r}
3 ^ 2
```
- modulus (remainder on divsion)
```{r}
14 %% 6
```
- integer division
```{r}
50 %/% 8
```
## Relational operators
- less than
```{r}
3.14 < 3.142
```
- greater than
```{r}
3.14159 > 3
```
- less than or equal to
```{r}
3 <= 3.14
3.14 <= 3.14
```
- greater than or equal to
```{r}
3 >= 3.14
3.14 >= 3.14
```
- equal to
```{r}
3 == 3.14159
```
- not equal to
```{r}
3 != 3.14159
```
## Logical operators
Logical operations are possible only for numeric, logical or complex types. Note that 0 (or complex version 0 + 0i) is equivalent to `FALSE`, and all other numbers (numeric or complex) are equivalent to `TRUE`.
- not `!`
```{r}
x <- c(TRUE, 0, FALSE, -4)
!x
```
- element-wise and `&`
```{r}
y <- c(3.14, FALSE, TRUE, 0)
x & y
```
- first element and `&&`
```{r}
x && y
```
- element-wise or `|`
```{r}
x | y
```
- first element or `||`
```{r}
z <- c(0, FALSE, 8)
y || z
```
## Vectors
### Types {#vector-types}
There are four main atomic vector types that you are likely to come across
when using R^[technically there are more, see
https://adv-r.hadley.nz/vectors-chap.html#atomic-vectors]; **logical** (`TRUE`
or `FALSE`), *double* (`3.142`), *integer* (`2L`) and *character* (`"Awesome"`)
```{r}
v1 <- TRUE
typeof(v1)
v1 <- FALSE
typeof(v1)
v2 <- 1.5
typeof(v2)
v2 <- 1
typeof(v2)
# integer values must be followed by an L to be stored as integers
v3 <- 2
typeof(v3)
v3 <- 2L
typeof(v3)
v4 <- "Awesome"
typeof(v4)
```
As well as the atomic vector types you will often encounter two other vector
types; **Date** and **factor** . As well as some notes here this book also contains fuller sections on both
- Chapter 5 [Working with dates and times]
- Chapter 6 [Working with factors]
Factor vectors are used to represent categorical data. They are actually integer vectors with two additional attributes, levels and class. At this stage it is not worth worrying too much about what attributes are, but is suffiecient to understand that, for factors, the levels attribute gives the possible categories, and combined with the integer values works much like a lookup table. The `class` attribute is just "factor".
```{r}
ratings <- factor(c("good", "bad", "bad", "amazing"))
typeof(ratings)
attributes(ratings)
```
Date vectors are just vectors of class double with an additional class attribute set as "Date".
```{r}
DfT_birthday <- lubridate::as_date("1919-08-14")
typeof(DfT_birthday)
attributes(DfT_birthday)
```
If we remove the class using `unclass()` we can reveal the value of the double, which is the number of days since "1970-01-01"^[a special date known as the Unix Epoch], since DfT's birthday is before this date, the double is negative.
```{r}
unclass(DfT_birthday)
```
### Conversion between atomic vector types
Converting between the atomic vector types is done using the `as.character`, `as.integer`, `as.logical` and `as.double` functions.
```{r}
value <- 1.5
as.integer(value)
as.character(value)
as.logical(value)
```
Where it is not possible to convert a value you will get a warning message
```{r}
value <- "z"
as.integer(value)
```
When combining different vector types, coercion will obey the following hierarchy: character, double, integer, logical.
```{r}
typeof(c(9.9, 3L, "pop", TRUE))
typeof(c(9.9, 3L, TRUE))
typeof(c(3L, TRUE))
typeof(TRUE)
```