-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathAdvancedR_week3_profiling.R
252 lines (204 loc) · 14.3 KB
/
AdvancedR_week3_profiling.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
######################################################################################################################
### Author: Anni Norring ###
### Date: April 2018 ###
### Content: This script contains the R code for the 3rd week of Advanced R programming course ###
######################################################################################################################
# Access all the needed libraries:
library(dplyr)
library(tidyr)
library(readxl)
library(stringr)
library(readr)
library(lubridate)
library(ggplot2)
######################################################################################################################
### PROFILING
######################################################################################################################
#Some of the R code that you write will be slow. Slow code often isn’t worth fixing in a script that you will only
# evaluate a few times, as the time it will take to optimize the code will probably exceed the time it takes the
# computer to run it. However, if you are writing functions that will be used repeatedly, it is often worthwhile to
# identify slow sections of the code so you can try to improve speed in those sections.
#In this section, we will introduce the basics of profiling R code, using functions from two packages, microbenchmark
# and profvis.
# Profiling can help you identify bottlenecks in R code.
# The microbenchmark package helps you profile short pieces of code and compare functions with each other. It runs
# the code many times and provides summary statistics across the iterations.
# The profvis package allows you to visualize performance across more extensive code. It can be used to profile code
# within functions being developed for a package, as long as the package source code has been loaded locally using
# devtools::load_all.
######################################################################################################################
### MICROBENCHMARK
######################################################################################################################
#The microbenchmark package is useful for running small sections of code to assess performance, as well as for
# comparing the speed of several functions that do the same thing. The microbenchmark function from this package
# will run code multiple times (100 times is the default) and provide summary statistics describing how long the
# code took to run across those iterations. The process of timing a function takes a certain amount of time itself.
# The microbenchmark function adjusts for this overhead time by running a certain number of “warm-up” iterations
# before running the iterations used to time the code.
#You can use the times argument in microbenchmark to customize how many iterations are used. For example, if you are
# working with a function that is a bit slow, you might want to run the code fewer times when benchmarking
# (although with slower or more complex code, it likely will make more sense to use a different tool for profiling,
# likeprofvis).
#You can include multiple lines of code within a single call to microbenchmark. However, to get separate benchmarks
# of line of code, you must separate each line by a comma:
library(microbenchmark)
microbenchmark(a <- rnorm(1000),
b <- mean(rnorm(1000)))
#The microbenchmark function is particularly useful for comparing functions that take the same inputs and return the
# same outputs. As an example, say we need a function that can identify days that meet two conditions: (1) the
# temperature equals or exceeds a threshold temperature (27 degrees Celsius in the examples) and (2) the
# temperature equals or exceeds the hottest temperature in the data before that day. We are aiming for a function
# that can input a data frame that includes a column named temp with daily mean temperature in Celsius, like this
# data frame:
#date temp
#2015-07-01 26.5
#2015-07-02 27.2
#2015-07-03 28.0
#2015-07-04 26.9
#2015-07-05 27.5
#2015-07-06 25.9
#2015-07-07 28.0
#2015-07-08 28.2
# and outputs a data frame that has an additional binary record_tempcolumn, specifying if that day meet the two
# conditions, like this:
#date temp record_temp
#2015-07-01 26.5 FALSE
#2015-07-02 27.2 TRUE
#2015-07-03 28.0 TRUE
#2015-07-04 26.9 FALSE
#2015-07-05 27.5 FALSE
#2015-07-06 25.9 FALSE
#2015-07-07 28.0 TRUE
#2015-07-08 28.2 TRUE
#Below are two example functions that can perform these actions. Since the record_temp column depends on
# temperatures up to that day, one option is to use a loop to create this value. The first function takes this
# approach. The second function instead uses tidyverse functions to perform the same tasks.
# Function that uses a loop
find_records_1 <- function(datafr, threshold){
highest_temp <- c()
record_temp <- c()
for(i in 1:nrow(datafr)){
highest_temp <- max(highest_temp, datafr$temp[i])
record_temp[i] <- datafr$temp[i] >= threshold &
datafr$temp[i] >= highest_temp
}
datafr <- cbind(datafr, record_temp)
return(datafr)
}
# Function that uses tidyverse functions
find_records_2 <- function(datafr, threshold){
datafr <- datafr %>%
mutate_(over_threshold = ~ temp >= threshold,
cummax_temp = ~ temp == cummax(temp),
record_temp = ~ over_threshold & cummax_temp) %>%
select_(.dots = c("-over_threshold", "-cummax_temp"))
return(as.data.frame(datafr))
}
#If you apply the two functions to the small example data set, you can see that they both create the desired output:
example_data <- data_frame(date = c("2015-07-01", "2015-07-02",
"2015-07-03", "2015-07-04",
"2015-07-05", "2015-07-06",
"2015-07-07", "2015-07-08"),
temp = c(26.5, 27.2, 28.0, 26.9,
27.5, 25.9, 28.0, 28.2))
(test_1 <- find_records_1(example_data, 27))
(test_2 <- find_records_2(example_data, 27))
all.equal(test_1, test_2)
#The performance of these two functions can be compared using microbenchmark:
record_temp_perf <- microbenchmark(find_records_1(example_data, 27),
find_records_2(example_data, 27))
record_temp_perf
#This output gives summary statistics (min, lq, mean, median, uq, and max) describing the time it took to run the
# two function over the 100 iterations of each function call. By default, these times are given in a reasonable
# unit, based on the observed profiling times (units are given in microseconds in this case).
#It’s useful to check next to see if the relative performance of the two functions is similar for a bigger data set.
# The chicagoNMMAPS data set from the dlnm package includes temperature data over 15 years in Chicago, IL. Here
# are the results when we benchmark the two functions with that data (note, this code takes a minute or two to
# run):
library(dlnm)
data("chicagoNMMAPS")
record_temp_perf_2 <- microbenchmark(find_records_1(chicagoNMMAPS, 27),
find_records_2(chicagoNMMAPS, 27))
record_temp_perf_2
#While the function with the loop (find_records_1) performed better with the very small sample data, the function
# that uses tidyverse functions (find_records_2) performs much, much better with a larger data set.
#The microbenchmark function returns an object of the “microbenchmark” class. This class has two methods for
# plotting results, autoplot.microbenchmark and boxplot.microbenchmark. To use the autoplot method, you will need
# to have ggplot2 loaded in your R session.
# For small example data
autoplot(record_temp_perf)
# For larger data set
autoplot(record_temp_perf_2)
#By default, this plot gives the “Time” axis on a log scale. You can change this with the argument log = FALSE.
######################################################################################################################
### PROFVIS
######################################################################################################################
#Once you’ve identified slower code, you’ll likely want to figure out which parts of the code are causing bottlenecks.
# The profvis function from the profvis package is very useful for this type of profiling. This function uses the
# RProf function from base R to profile code, and then displays it in an interactive visualization in RStudio.
# This profiling is done by sampling, with the RProf function writing out the call stack every 10 milliseconds
# while running the code.
#To profile code with profvis, just input the code (in braces if it is multi-line) into profvis within RStudio. For
# example, we found that the find_records_1 function was slow when used with a large data set. To profile the code
# in that function, run:
library(profvis)
datafr <- chicagoNMMAPS
threshold <- 27
profvis({
highest_temp <- c()
record_temp <- c()
for(i in 1:nrow(datafr)){
highest_temp <- max(highest_temp, datafr$temp[i])
record_temp[i] <- datafr$temp[i] >= threshold &
datafr$temp[i] >= highest_temp
}
datafr <- cbind(datafr, record_temp)
})
#The profvis output gives you two options for visualization: “Flame Graph” or “Data” (a button to toggle between the
# two is given in the top left of the profvis visualization created when you profile code). The “Data” output
# defaults to show you the time usage of each first-level function call. Each of these calls can be expanded to
# show deeper and deeper functions calls within the call stack. This expandable interface allows you to dig down
# within a call stack to determine what calls are causing big bottlenecks. For functions that are part of a package
# you have loaded with devtools::load_all, this output includes a column with the file name where a given function
# is defined. This functionality makes this “Data” output pane particularly useful in profiling functions in a
# package you are creating.
#The “Flame Graph” view in profvis output gives you two panels. The top panel shows the code called, with bars on
# the right to show memory use and time spent on the line. The bottom panel also visualizes the time used by each
# line of code, but in this case it shows time use horizontally and shows the full call stack at each time sample,
# with initial calls shown at the bottom of the graph, and calls deeper in the call stack higher in the graph.
# Clicking on a block in the bottom panel will show more information about a call, including which file it was
# called from, how much time it took, how much memory it took, and its depth in the call stack.
#Figure 2.3 shows example output from profiling the code in thefind_records_1 function defined earlier in this
# section.
#Based on this visualization, most of the time is spent on line 6, filling in the record_temp vector. Now that we
# know this, we could try to improve the function, for example by doing a better job of initializing vectors
# before running the loop.
#The profvis visualization can be used to profile code in functions you’re writing as part of a package. If some of
# the functions in the code you are profiling are in a package currently loaded with loaded with devtools::load_all,
# the top panel in the Flame Graph output will include the code defining those functions, which allows you to
# explore speed and memory use within the code for each function. You can also profile code within functions from
# other packages– for more details on the proper set-up, see the “FAQ” section of RStudio’sprofvis documentation.
#The profvis function will not be able to profile code that runs to quickly. Trying to profile functions that are too
# fast will give you the following error message:
#Error in parse_rprof(prof_output, expr_source) :
# No parsing data available. Maybe your function was too fast?
#You can use the argument interval in profvis to customize the sampling interval. The default is to sample every 10
# milliseconds (interval = 0.01), but you can decrease this sampling interval. In some cases, you may be able to
# use this option to profile faster-running code. However, you should avoid using an interval smaller than about
# 5 milliseconds, as below that you will get inaccurate estimates with profvis. If you are running very fast code,
# you’re better off profiling with microbenchmark, which can give accurate estimates at finer time intervals.
#Here are some tips for optimizing your use of profvis:
#1. You may find it convenient to use the “Show in new window” button on the RStudio pane with profiling results to
# expand this window while you are interpreting results.
#2. An “Options” button near the top right gives different options for how to display the profiling results,
# including whether to include memory profiling results and whether to include lines of code with zero time.
#3. You can click-and-drag results in the bottom visualization panel, as well as pan in and out.
#4. You may need to update your version of RStudio to be able to use the full functionality of profvis. You can
# download a Preview version of RStudio here.
#5. If you’d like to share code profiling results from profvis publicly, you can do that by using the “Publish”
# button on the top right of the rendered profile visualization to publish the visualization to RPubs. The “FAQ”
# section of RStudio’s profvis documentation includes more tips for sharing a code profile visualization online.
#6. If you get a lot of blocks labeled “<Anonymous>”, try updating your version of R. In newer versions of R,
# functions called usingpackage::function() syntax or list$function() syntax are labeled in profiling blocks in
# a more meaningful way. This is likely to be a particular concern if you are profiling code in a package you
# are developing, as you will often be usingpackage::function() syntax extensively to pass CRAN checks.