-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwork.Rmd
816 lines (650 loc) · 49.5 KB
/
work.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
---
title: "Investigating the Socioeconomic Dynamics of Crime: Exploring the Relationship Between Income and Crime Rates in Canada"
author: "Yinuo Zhao"
date: "Mar 2, 2024"
output:
html_document:
html_preview: false
link-citations: yes
---
```{r setup, message=FALSE, warning=FALSE, include=FALSE}
#install.packages(c("data.table","leaflet"))
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(data.table)
library(tidyverse)
library(kableExtra)
library(dplyr)
library(splines)
library(mgcv)
library(gridExtra)
library(grid)
library(cowplot)
library(broom)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(xgboost)
library(caret)
```
## Introduction
*"Provide background on your data sets and a clear formulated question or hypothesis."*
For this project, my question of Interest is: **"How does crime rate relate to income in Canada?"**
In order to answer this question, data on both crime and socioeconomic status are needed. However, I found no existing data set that contains all desired information, therefore this needs to be achieved through merging more than one data sets. After choosing carefully, the following two separate data sets are obtained:
1. **"Income of individuals by age group, sex and income source, Canada, provinces and selected census metropolitan areas"**. Released 2023-05-02. This data set is annually updated and maintained by Statistics Canada (Table 11-10-0239-01). Data is collected through the Survey of Labor and Income Dynamics, Survey of Consumer Finances, and Canadian Income Survey.
2. **"Incident-based crime statistics, by detailed violations, Canada, provinces, territories, Census Metropolitan Areas and Canadian Forces Military Police"**. Released 2023-07-27. This data set is also annually updated and maintained by Statistics Canada (Table 35-10-0177-01, formerly CANSIM 252-0051). Data is collected through the Uniform Crime Reporting Survey.
Understanding the relationship between crime rates and income in Canada is crucial for policymakers, law enforcement agencies, and social welfare programs. Exploring this correlation can shed light on the socioeconomic factors driving criminal behavior and help formulate targeted interventions to alleviate poverty and reduce crime. Additionally, elucidating this connection can inform broader discussions on social inequality, justice, and community well-being in Canadian society.
## Methods
*"Include how and where the data were acquired, how you cleaned and wrangled the data, what tools you used for data exploration."*
Both data sets are downloaded directly from **Statistics Canada**, which is usually considered to be an reliable source. Because they share the same source, the data sets follows similar structure and all contains the two columns `GEO` and `REF_DATE` where the former one refers to the geographical region and the second one refers to the year of data. Thus, it's possible to combine the two data sets to obtain all information needed.
However, it is worth mentioning that both data sets are huge and contains **unrelated information**. Therefore, cleaning and wrangling are needed for more convenient analysis and more efficient computing & uploading, as the original data sets are oversize thus cannot be pushed to github repository.
Reference:
1. The census data set: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1110023901
2. The crime data set: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3510017701
```{r checking-file}
fn1 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/census.csv"
fn2 <- "https://raw.githubusercontent.com/inorrr/JSC370_project/main/crime.csv"
if (!file.exists("data/census.csv"))
download.file(fn1, destfile = "data/census.csv")
census_df <- data.table::fread("data/census.csv")
if (!file.exists("data/crime.csv"))
download.file(fn2, destfile = "data/crime.csv")
crime_df <- data.table::fread("data/crime.csv")
```
### Data Wrangling and Cleaning
1. First I dropped the unrelated columns in both columns, keeping only information revelant to the question of interest.
```{r eval = FALSE}
crime_df <- crime_df[, c("REF_DATE", "GEO", "Violations", "Statistics", "VALUE", "UOM")]
census_df <- census_df[, c("REF_DATE", "GEO", "Age group", "Sex", "Income source", "Statistics", "VALUE", "UOM", "SCALAR_FACTOR")]
```
2. I noticed that in the GEO column:
+ both data sets uses a **mix of province & city names**. I removed all observations with city names and keep only those province data so that the data set is smaller(since the province data contains cities already) and avoids redundancy.
+ Notice that the census data only have data from the 10 provinces, not the **3 territories**, I filtered out those data from the crime data set as there are no matching data.
+ Notice that there is a **square bracket** with a number behind each province name in the crime data, therefore I removed it with regular expression so that we can join the two data sets on province later.
```{r eval = FALSE}
table(census_df$GEO)
provinces1 <- c("Alberta [48]", "British Columbia [59]", "Manitoba [46]", "New Brunswick [13]",
"Newfoundland and Labrador [10]", "Saskatchewan [47]",
"Nova Scotia [12]", "Ontario [35]",
"Prince Edward Island [11]", "Quebec [24]")
provinces2 <- c("Alberta", "British Columbia", "Manitoba", "New Brunswick",
"Newfoundland and Labrador", "Saskatchewan","Nova Scotia",
"Ontario", "Prince Edward Island", "Quebec")
crime_df <- crime_df[crime_df$GEO %in% provinces1, ]
census_df <- census_df[census_df$GEO %in% provinces2, ]
crime_df$GEO <- gsub("\\s*\\[\\d+\\]$", "", crime_df$GEO)
table(crime_df$GEO)
```
2. In the Statistics column of the crime data, there are many measures related to crime, but I'm only interested in the number of incidents (`Actual incidents`) and crime rate (`Rate per 100,000 population`), thus statistics related to charges are removed. The Crime Severity Index(`Percentage contribution to the Crime Severity Index (CSI)`) seems to be interesting and thus is kept.
```{r eval = FALSE}
crime_df <- crime_df %>% filter(Statistics == "Actual incidents" |
Statistics == "Rate per 100,000 population" |
Statistics == "Percentage contribution to the Crime Severity Index (CSI)")
```
3. In the crime data frame,
+ there are **314 different types of crime**, which does much more detailed categorization than I'm interested in. Therefore, to avoid having too much computation to merge data later on, I choose too keep only the bigger categories (i.e. total robbery, total assaults, etc).
+ There are also square brackets at the end so I removed them.
```{r eval = FALSE}
print(length(unique(crime_df$Violations)))
table(crime_df$Violations)
# Identify rows that start with "Total"
total_rows <- grepl("^Total", crime_df$Violations)
# Subset the dataframe to keep only the rows starting with "Total"
crime_df <- crime_df[total_rows, , drop = FALSE]
# Remove square brackets and numbers at the end
crime_df$Violations <- gsub("\\s*\\[\\d+\\]$", "", crime_df$Violations)
```
3. In the census data frame:
+ the column `Age group` specifies the age however since we do not have this information in the crime data frame, we need to combine all age groups. This can be done by taking the average of the categories.
+ Same for `Sex`, same method is used.
```{r eval = FALSE}
table(census_df$"Age group")
table(census_df$"Sex")
# first we merge the age group categories
census_df <- census_df %>%
group_by(REF_DATE, GEO, Sex, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>%
summarise(VALUE = mean(VALUE, na.rm = TRUE))
# next we merge the age group categories
census_df <- census_df_new %>%
group_by(REF_DATE, GEO, `Income source`, Statistics, UOM, SCALAR_FACTOR) %>%
summarise(VALUE = mean(VALUE, na.rm = TRUE))
```
4. I keep only the data between 1998 and 2021, as that's the year range where the two data sets overlap.
```{r eval = FALSE}
crime_df <- crime_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
census_df <- census_df %>% filter(REF_DATE >= 1998 & REF_DATE <= 2021)
```
5. Write the cleaned data frame to CSV files.
```{r eval = FALSE}
write.csv(crime_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/crime.csv")
write.csv(census_df, "/Users/yinuozhao/Desktop/UofT/JSC370/JSC370-2024-main/JSC370_project/census.csv")
```
At this point both the crime data frame and census data frame has `REF_DATE` and `GEO` in common, and they each have another **categorical variable**, which is `Income source` for census data and `Violation`(it means crime type) for crime data. While it may seem to make sense to join the two data sets using REF_DATE and GEO directly, the results would involves the data for all combinations of Income source and Violation for each REF_DATE and GEO. This will be a huge data set and thus slow down the computation. Therefore, I choose to **keep the data sets separate** and **join them when necessary** (i.e. after picking out certain categories of interest).
### Exploratory Data Analysis
Notice that right now both data sets are in long format, I converted them to wide for convenience.
```{r }
crime_df <- pivot_wider(crime_df, id_cols = c(REF_DATE, GEO, Violations),
names_from = Statistics, values_from = VALUE)
crime_df <- na.omit(crime_df)
census_df <- pivot_wider(census_df, id_cols = c(REF_DATE, GEO, `Income source`),
names_from = Statistics, values_from = VALUE)
census_df <- na.omit(census_df)
```
*Check the dimensions and headers and footers of the data*
```{r eval=FALSE}
dim(census_df)
dim(crime_df)
head(crime_df)
tail(crime_df)
head(census_df)
tail(census_df)
```
The census data set has **8 variables** with **3613 observations**, the crime data set has **6 variables** with **9271 observations**. By looking at the headers and footers of both data sets, they seems to be imported correctly and contains no missing values (in the displayed rows).
*Check the variable types in the data*
```{r eval = FALSE}
str(census_df)
str(crime_df)
summary(census_df)
summary(crime_df)
```
In both data frames, we see that the variable types are a mix of integer, numeric and characters. All variable types correctly align with the context of the variables. No major problems arises with the data at this stage (i.e. a variable with all missing values.)
*Take a closer look at some/all of the variables*
For both data frame, we need `REF_DATE` and `GEO` to correctly identify a province in Canada with a valid year. For census data frame, we need to look at the values of the different types of income (median, aggregate, etc). For the crime data frame, we need to look at the recorded crime rate and actual number of incidents to be within the reasonable range.
```{r eval = FALSE}
table(census_df$REF_DATE)
table(census_df$GEO)
table(crime_df$REF_DATE)
table(crime_df$GEO)
summary(census_df$`Aggregate income`)
summary(census_df$`Average income (excluding zeros)`)
summary(census_df$`Median income (excluding zeros)`)
summary(crime_df$`Actual incidents`)
summary(crime_df$`Rate per 100,000 population`)
```
Both data sets contains data from 1998 to 2021, on the 10 provinces in Canada as desired because I cleaned the data sets this way. Other variables being checked are within the reasonable range. The aggregate income, average income and median income are all measured in 2021 constant dollars, aggregate income record numbers in millions. The crime rates are measured as number of incidents per 100,000 population.
*Validate with an external source*
Notice that the minimum average income is **677.8**, which seems to be much lower than then mean average income, even 10 times lower than the 1st quantile. Since it seems quite suspicious, we need to validate it.
```{r eval=FALSE}
census_df[which.min(census_df$`Average income (excluding zeros)`), ]
```
This data is from **Prince Edward Island** in 2004, and the income source is **"other government transfers"**. Upon research, Government transfers refers to assistance from provincial and municipal programs, Workers’ Compensation benefits, the GST/HST Credit and provincial refundable tax credits such as the Quebec and Newfoundland and Labrador sales tax credits. However, since many of the above mentioned are made to their own category and excluded from "other government transfers" in the data set, it make sense that the value is low.
## Preliminary Results
*"Provide summary statistics in tabular from and publication-quality figures, take a look at the kable function from knitr to write nice tables in Rmarkdown."*
### Looking at Crime Data
Firstly, I examined the trend in crime rate across provinces and crime types.
```{r, fig.align='center', fig.width=10, fig.height=5}
filtered_data = crime_df %>% filter(Violations=="Total, all violations")
unique_x_values <- unique(crime_df$REF_DATE)
ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Total Crime by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "right") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
```
The plot depicts the **total crime rate**, encompassing all types of crimes, from 1998 to 2021, with each province represented by a distinct color. The crime rate is measured per 100,000 population. Notably, Saskatchewan consistently exhibits a significantly higher crime rate compared to other provinces throughout the period of 1998 to 2021. Conversely, Quebec and Ontario consistently demonstrate the lowest crime rates. Across all provinces, there is a discernible **decreasing trend** in crime rates over the years, with many provinces experiencing peak crime rates in 2003-2004.
In addition to analyzing total crime, I delved into specific crime categories often associated with poverty: **break and enter**, **robbery**, and **prostitution**. The three accompanying plots illustrate their respective rates over the years. Overall, there is a decreasing trend in the rates of all three crimes, with occasional exceptions such as robbery rates in Manitoba. Notably, British Columbia stands out with a significantly high rate of prostitution in 2004, doubling the number reported in Saskatchewan, which held the second-highest rate that year.
*Note that the 3 plots below shares the same legend with the above plots. Therefore the legend is omitted for better display purpose. Codes are also not shown as they reuses the above code chunk. Crime rate is measured per 100,000 population.*
```{r, echo = FALSE, fig.align='center', fig.width=10, fig.height=5}
filtered_data = crime_df %>% filter(Violations=="Total breaking and entering")
plot1 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Break and Enter by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
filtered_data = crime_df %>% filter(Violations=="Total robbery")
plot2 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Robbery by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
filtered_data = crime_df %>% filter(Violations=="Total prostitution")
plot3 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Rate per 100,000 population`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Rate per 100,000 population", title = "Rates of Prostitution by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
```
```{r, echo = FALSE, fig.align='center', fig.width=15, fig.height=5}
grid.arrange(plot1, plot2, plot3, nrow = 1)
```
Presented below is a table summarizing the average crime rate across all crime types (excluding `Total, all violations`) categorized by province and year. The cells are color-coded by value, with lighter shades indicating higher values and darker shades representing smaller values. Upon scrolling through the table, it becomes evident that across all provinces, there is a discernible **decreasing trend in the average crime rate**, as evidenced by the darkening shades in each column. This observation aligns with the findings and inferences drawn from the preceding plots.
```{r, echo = FALSE, fig.align='center'}
summary_stats <- crime_df %>% filter(Violations != "Total, all violations") %>%
group_by(REF_DATE, GEO) %>%
summarize(`Average Rate` = mean(`Rate per 100,000 population`))
summary_stats <- summary_stats %>% pivot_wider(names_from = GEO, values_from = 'Average Rate')
summary_stats <- summary_stats %>% rename(Year = REF_DATE)
summary_table <- summary_stats %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE) %>%
add_header_above(c("Table of Average Crime Rate Across All Types" = 11)) %>%
scroll_box(width = "100%", height = "500px") %>%
kable_styling(fixed_thead = T) %>%
column_spec(2, color = "white", background = spec_color(summary_stats$Alberta)) %>%
column_spec(3, color = "white", background = spec_color(summary_stats$`British Columbia`)) %>%
column_spec(4, color = "white", background = spec_color(summary_stats$Manitoba)) %>%
column_spec(5, color = "white", background = spec_color(summary_stats$`New Brunswick`)) %>%
column_spec(6, color = "white", background = spec_color(summary_stats$`Newfoundland and Labrador`)) %>%
column_spec(7, color = "white", background = spec_color(summary_stats$`Nova Scotia`)) %>%
column_spec(8, color = "white", background = spec_color(summary_stats$Ontario)) %>%
column_spec(9, color = "white", background = spec_color(summary_stats$`Prince Edward Island`)) %>%
column_spec(10, color = "white", background = spec_color(summary_stats$Quebec)) %>%
column_spec(11, color = "white", background = spec_color(summary_stats$Saskatchewan))
summary_table
```
### Looking at Income Data
Given the general decrease in crime rates, I am interested in exploring the **trend of income** to ascertain the potential existence of an association.
The following plot illustrates the average total income of each province over time. It is evident that, on the whole, the average total income for all provinces exhibits a steady **upward trend**. Notably, since 2003, Alberta has surpassed Ontario to become the province with the highest average total income. Additionally, it is noteworthy to observe a slight decrease in income across all provinces around 2019, likely attributed to the impact of the COVID-19 pandemic.
```{r, echo=FALSE, fig.align='center', fig.width=10, fig.height=5}
filtered_data = census_df %>% filter(`Income source`=="Total income")
unique_x_values <- unique(census_df$REF_DATE)
ggplot(filtered_data, aes(x = REF_DATE, y = `Average income (excluding zeros)`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Total income", title = "Average Total Income (excluding zeros) by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "right") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
```
The box plot below also shows the average total income for different years, but this time combining data from all provinces. The pink dots represent the actual values for each year in each province. It's clear from the boxes that the average income is **increasing** year by year. Notably, between 2010 and 2016, there are some outliers with very high incomes, which corresponds to Alberta when compared with the previous plot. However, it is worth mentioning that the values for income is not properly adjusted for inflation, therefore the actual growth is not as strong as we see in the current data.
```{r, echo=FALSE, fig.align='center', fig.width=10, fig.height=5}
filtered_data = census_df %>% filter(`Income source`=="Total income")
ggplot(filtered_data, mapping = aes(x = as.factor(REF_DATE), y = `Average income (excluding zeros)`)) +
geom_boxplot() +
geom_jitter(alpha = 0.8, color = "pink") +
labs(x = "Year", y = "Total income", title = "Average Total Income (excluding zeros) by Province") +
theme_linedraw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7))
```
After examining total income, I delved into specific income sources: employment income, investment income, and market income, which are major income categories. The plots below illustrate that all three types of income are increasing. However, employment and market income show a more steady growth pattern, while investment income fluctuates dramatically from year to year. *(Same as before, the same legend is omitted for display purpose since I'm only interested in the overall trend, not how provinces compare to each other.)*
```{r, echo = FALSE, fig.align='center', fig.width=15, fig.height=5}
filtered_data = census_df %>% filter(`Income source`=="Employment income")
plot1 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Average income (excluding zeros)`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Average Income (excluding zeros)", title = "Average Employment Income by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
filtered_data = census_df %>% filter(`Income source`=="Investment income")
plot2 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Average income (excluding zeros)`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Average Income (excluding zeros)", title = "Average Investment Income by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
filtered_data = census_df %>% filter(`Income source`=="Market income")
plot3 <- ggplot(filtered_data, aes(x = REF_DATE, y = `Average income (excluding zeros)`, color = GEO)) +
geom_line() +
labs(x = "Year", y = "Average Income (excluding zeros)", title = "Average Market Income by Province") +
scale_x_continuous(breaks = unique_x_values) +
scale_color_discrete(name = "Provinces") +
theme_linedraw() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 7)) +
theme(legend.title = element_text(face = "bold"))
grid.arrange(plot1, plot2, plot3, nrow = 1)
```
The table below provides a summary of the average total income of all provinces from 1998 to 2021. The color scale used is consistent with that of the crime data frame, where lighter shades denote higher values. Over time, there is a discernible increase in income, as evidenced by the progressive lightening of colors.
```{r, echo = FALSE, fig.align='center'}
summary_stats <- census_df %>% filter(`Income source` == "Total income") %>%
group_by(REF_DATE, GEO) %>%
summarize(`Total Income` = `Average income (excluding zeros)`)
summary_stats <- summary_stats %>% pivot_wider(names_from = GEO, values_from = `Total Income`)
summary_stats <- summary_stats %>% rename(Year = REF_DATE)
summary_table <- summary_stats %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE) %>%
add_header_above(c("Table of Average Total Income" = 11)) %>%
scroll_box(width = "100%", height = "500px") %>%
kable_styling(fixed_thead = T) %>%
column_spec(2, color = "white", background = spec_color(summary_stats$Alberta)) %>%
column_spec(3, color = "white", background = spec_color(summary_stats$`British Columbia`)) %>%
column_spec(4, color = "white", background = spec_color(summary_stats$Manitoba)) %>%
column_spec(5, color = "white", background = spec_color(summary_stats$`New Brunswick`)) %>%
column_spec(6, color = "white", background = spec_color(summary_stats$`Newfoundland and Labrador`)) %>%
column_spec(7, color = "white", background = spec_color(summary_stats$`Nova Scotia`)) %>%
column_spec(8, color = "white", background = spec_color(summary_stats$Ontario)) %>%
column_spec(9, color = "white", background = spec_color(summary_stats$`Prince Edward Island`)) %>%
column_spec(10, color = "white", background = spec_color(summary_stats$Quebec)) %>%
column_spec(11, color = "white", background = spec_color(summary_stats$Saskatchewan))
summary_table
```
### Examining Crime and Income together
After separately exploring the two data sets, there appears to be a potential relationship between crime rate and income. However, further experimentation using both data sets together is necessary to confirm and understand this relationship more comprehensively.
```{r, echo = TRUE, fig.align='center', fig.width=10, fig.height=5}
filtered_census <- census_df %>% filter(`Income source` == "Total income")
filtered_crime <- crime_df %>%filter(Violations=="Total, all violations")
joint_df <- inner_join(filtered_crime, filtered_census, by = c("REF_DATE", "GEO"))
ggplot(data = joint_df, aes(x = `Average income (excluding zeros)`, y = `Rate per 100,000 population`, color = GEO, size = 3)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, size = 1.0) +
guides(size = FALSE) + # remove "size" from legend
labs(x = "Average Total Income (excluding zeros)", y = "Total Crime Rate per 100,000 population`", title = "Average Total Income and Total Crime Rate") +
theme_linedraw() +
scale_color_discrete(name = "Provinces") +
theme(axis.text = element_text(size = 7)) +
theme(plot.title = element_text(face = "bold")) +
theme(legend.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
```
The scatter plot presented demonstrates the relationship between average total income and total crime rate (ignoring crime type). Although the overall distribution of points may not reveal a strong relationship, upon coloring the points by province, a distinct pattern emerges. It becomes evident that average total income and total crime rate are **negatively correlated** across all provinces, with possibly the exception of Newfoundland and Labrador where the relationship appears to be less pronounced, indicated by a line that is nearly horizontal. To draw a confident conclusion, it is imperative to examine the actual correlation value between these variables.
```{r, echo = FALSE}
summary_stats <- joint_df %>%
group_by(GEO) %>%
summarise(Correlation = cor(`Average income (excluding zeros)`, `Rate per 100,000 population`)) %>%
arrange(Correlation)
summary_stats <- summary_stats %>% rename(Province = GEO)
summary_table <- summary_stats %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
add_header_above(c("Correlation between Average Total Income and Total Crime Rate" = 2)) %>%
kable_styling(fixed_thead = T)
summary_table
```
From here, we can see that all correlation values are less than zero, which includes Newfoundland and Labrador which has a very week but negative correlation. Quebec exhibits the strongest correlation between average total income and total crime rate as the correlation is close to -1.
```{r, echo = FALSE, fig.align='center', fig.width=12, fig.height=6}
filtered_census <- census_df %>% filter(`Income source` == "Employment income")
filtered_crime <- crime_df %>%filter(Violations=="Total robbery")
joint_df <- inner_join(filtered_crime, filtered_census, by = c("REF_DATE", "GEO"))
plot1 <- ggplot(data = joint_df, aes(x = `Average income (excluding zeros)`, y = `Rate per 100,000 population`, color = GEO, size = 3)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, size = 1.0) +
guides(size = FALSE) + # remove "size" from legend
labs(x = "Average Employment Income (excluding zeros)",
y = "Robbery Crime Rate per 100,000 population`",
title = "Avg Employment Income vs Robbery Crime Rate") +
theme_linedraw() +
scale_color_discrete(name = "Provinces") +
theme(axis.text = element_text(size = 7)) +
theme(plot.title = element_text(face = "bold")) +
theme(legend.position = "none") +
theme(legend.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
filtered_census <- census_df %>% filter(`Income source` == "Market income")
filtered_crime <- crime_df %>%filter(Violations=="Total property crime violations")
joint_df <- inner_join(filtered_crime, filtered_census, by = c("REF_DATE", "GEO"))
plot2 <- ggplot(data = joint_df, aes(x = `Average income (excluding zeros)`, y = `Rate per 100,000 population`, color = GEO, size = 3)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, size = 1.0) +
guides(size = FALSE) + # remove "size" from legend
labs(x = "Average Market Income (excluding zeros)",
y = "Property Crime Rate per 100,000 population`",
title = "Avg Market Income vs Property Crime Rate") +
theme_linedraw() +
scale_color_discrete(name = "Provinces") +
theme(axis.text = element_text(size = 7)) +
theme(plot.title = element_text(face = "bold")) +
theme(legend.position = "none") +
theme(legend.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
filtered_census <- census_df %>% filter(`Income source` == "Government transfers")
filtered_crime <- crime_df %>%filter(Violations=="Total prostitution")
joint_df <- inner_join(filtered_crime, filtered_census, by = c("REF_DATE", "GEO"))
plot3 <- ggplot(data = joint_df, aes(x = `Average income (excluding zeros)`, y = `Rate per 100,000 population`, color = GEO, size = 3)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, size = 1.0) +
guides(size = FALSE) + # remove "size" from legend
labs(x = "Average Employment Income (excluding zeros)",
y = "Prostitution Crime Rate per 100,000 population`",
title = "Avg Employment Income vs Prostitution Crime Rate") +
theme_linedraw() +
scale_color_discrete(name = "Provinces") +
theme(axis.text = element_text(size = 7)) +
theme(plot.title = element_text(face = "bold")) +
theme(legend.position = "none") +
theme(legend.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
```
```{r, echo = FALSE, fig.align='center', fig.width=15, fig.height=5}
grid.arrange(plot1, plot2, plot3, nrow = 1)
```
The three smaller plots investigate the relationship between a specific income source and a particular type of violation, as indicated by the legend. In the first plot, which examines the relationship between **robbery crime rate and employment income**, all provinces display a negative trend except for Newfoundland and Labrador. Manitoba demonstrates a weak relationship, as evidenced by the considerable dispersion of points around the line.
Moving to the second plot, which contrasts **market income with property crime rate**, a strong and negative relationship is apparent. The third plot, depicting **employment income versus prostitution crime rate**, reveals varying degrees of association across provinces, with many showing a weak relationship. Notably, Ontario exhibits a positive relationship between employment income and prostitution crime rate.
### Pick a Province to Look Closer
Since **Ontario** demonstrates a different relationship to other provinces in the previous plot, I choose it to take a closer look. The bar plot below shows the composition of crime incidents each year in Ontario. We see that the major categories are **property crime** and **weapon violations**.
```{r, echo = FALSE, fig.align='center', fig.width=15, fig.height=8}
filtered_census <- census_df %>% filter(GEO == "Ontario")
filtered_crime <- crime_df %>%filter(GEO == "Ontario")
joint_df <- inner_join(filtered_crime, filtered_census, by = c("REF_DATE", "GEO"))
plot_data <- filtered_crime %>% filter(Violations != "Total, all violations" & Violations != "Total, all Criminal Code violations (excluding traffic)" & Violations != "Total, all Criminal Code violations (including traffic)")
ggplot(data = plot_data, aes(x = REF_DATE, fill = Violations, y = `Actual incidents`)) +
geom_bar(stat = "identity", position = "stack", color = "black", width = 0.5) +
labs(title = "Incidents of Crime Cases by Type in Ontario", x = "Year", y = "Number of Incidents") +
theme_linedraw() +
theme(axis.text = element_text(size = 7)) +
theme(legend.position = "none") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
```
```{r, echo = FALSE, fig.align='right', fig.width=15, fig.height=4}
ggplot(data = plot_data, aes(x = REF_DATE, fill = Violations, y = `Actual incidents`)) +
geom_bar(stat = "identity", position = "stack", color = "white", width = 0.01) +
theme_void() +
theme(legend.position = "left")
```
```{r, echo = FALSE, fig.align='center'}
summary_stats <- joint_df %>% filter(`Income source` != "COVID-19 benefits") %>%
group_by(`Income source`, `Violations`) %>%
summarise(Correlation = cor(`Aggregate income`, `Actual incidents`))
max_row <- summary_stats[which.max(abs(summary_stats$Correlation)), ]
second_max_row <- summary_stats[order(abs(summary_stats$Correlation), decreasing = TRUE)[-1][1], ]
```
```{r,echo = FALSE, fig.width=20, fig.height=7}
ggplot(data = summary_stats, aes(x = `Violations`, y = `Income source`, fill = abs(Correlation))) +
geom_tile() +
scale_fill_gradient2(midpoint = 0.5, low = "blue", mid = "white", high = "red") +
labs(title = "Correlation between Income Source and Violations")+ # Add title here
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + # Wrap x-axis labels
scale_y_discrete(labels = function(y) str_wrap(y, width = 20)) + # Wrap y-axis labels
theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 5), # Adjust size of x-axis text
axis.text.y = element_text(size = 10)) # Adjust size of y-axis text
```
```{r, echo = FALSE, fig.align='center'}
summary_stats <- summary_stats %>% pivot_wider(names_from = `Income source`, values_from = Correlation)
summary_table <- summary_stats %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE) %>%
add_header_above(c("Table of Average Total Income" = 16)) %>%
scroll_box(width = "100%", height = "500px") %>%
scroll_box(width = "100%", height = "500px") %>%
kable_styling(fixed_thead = T)
summary_table
```
The table provides a summary of the **correlation** between **all types of violations and income sources in Ontario**. Notably, COVID-19 benefits are excluded from the analysis due to their availability only during the pandemic years, which limits the dataset. The strongest correlation is observed between **self-employment income and production under the Cannabis Act**. However, it's important to note that this relationship may not be entirely reliable due to the limited data available (3 observations), as illustrated in the plot below.
```{r, echo = FALSE, fig.align='center', fig.width=9, fig.height=5}
joint_df %>%
filter(`Income source` == "Self-employment income" & `Violations` == "Total production - Cannabis Act") %>%
ggplot(aes(x=`Average income (excluding zeros)`, y=`Actual incidents`)) +
geom_line() +
geom_point(alpha = 0.5) +
geom_text(aes(label = REF_DATE), vjust = 1.5) +
labs(title = "Self-employment income vs Total production - Cannabis Act", x = "Average Self-Employment income (excluding zeros)", y = "Number of Incidents of Production - Cannabis Act") +
theme_linedraw() +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
```
The second strongest correlation, with a coefficient of 0.98, is observed between **total income and incidents of possession of other Controlled Drugs and Substances Act drugs**. With additional data available, this relationship appears more promising. Upon examination, it becomes evident that as income levels increase, the number of incidents of possession of these drugs also tends to rise. Despite the negative correlation between total income and total crime rate in Ontario, as found in previous analyses, a positive relationship exists between total income and incidents of possession of other Controlled Drugs and Substances Act drugs.
```{r, echo = FALSE, fig.align='center', fig.width=9, fig.height=5}
joint_df %>%
filter(`Income source` == "Total income" & `Violations` == "Total, possession, other Controlled Drugs and Substances Act drugs") %>%
ggplot(aes(x=`Average income (excluding zeros)`, y=`Actual incidents`)) +
geom_line() +
geom_point(alpha = 0.5) +
geom_text(aes(label = REF_DATE), vjust = 1.5) +
labs(title = "Total Income vs Total, possession, other Controlled Drugs and Substances Act drugs", x = "Average Total income (excluding zeros)", y = "Number of Incidents") +
theme_linedraw() +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))
```
### Group by Income Level
I created 4 levels for average total income using the quarantines, from negative infinity to the first quantile is **"Low"**, from first quantile to mean is **"Med_Low"**, from mean to 3rd quantile is **"Med_High"**, from 3rd quantile above is **"High"**.
```{r}
filter_census <- census_df %>% filter(`Income source` == "Total income")
# summary(filter_census$`Average income (excluding zeros)`)
breaks <- c(-Inf, 40460, 45077, 48859, Inf)
filter_census$income_level <- cut(filter_census$`Average income (excluding zeros)`,
breaks = breaks, labels = c("Low", "Med-Low", "Med-High", "High"))
```
From the box plot presented below, it appears that the total crime rate **does not exhibit a clear trend of decreasing with higher levels of total income**. This finding contradicts previous observations when examining the relationship between total crime rate and average total income **by province**. Therefore, it is conceivable that while a relationship exists, it may be influenced by other factors related to the demographics of each province. Consequently, when considering all observations collectively, the relationship becomes less apparent.
```{r, echo = FALSE, fig.align='center', fig.width=10, fig.height=5}
filter_crime <- crime_df %>%filter(Violations=="Total, all violations")
joint_df <- inner_join(filter_census, filter_crime, by = c("REF_DATE", "GEO"))
ggplot(joint_df, aes(x = as.factor(income_level), y = `Rate per 100,000 population`)) +
geom_boxplot() +
geom_jitter(alpha = 0.8, color = "tomato") +
labs(x = "Income Level", y = "Total Crime Rate", title = "Total Crime Rate by Income Level") +
theme_linedraw() +
theme(plot.title = element_text(face = "bold")) +
theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
theme(axis.text = element_text(size = 10))
```
### Fitting Statistical Models
```{r}
joint_df <- joint_df %>% rename(avg_income = `Average income (excluding zeros)`)
```
```{r}
joint_df <- joint_df %>% rename(crime_rate = `Rate per 100,000 population`)
```
Since we've previously observed in plots such that the **slope** of the relationship between average total income and total crime rate is different across provinces, we need to fit a model with **interaction terms** such that the slopes can be different.
```{r, echo = TRUE, fig.align='center', fig.width=10, fig.height=5}
library(broom)
lm_model <- lm(data = joint_df, `crime_rate` ~ `avg_income` * `GEO`)
tidy_coeffs <- tidy(lm_model)
table <- tidy_coeffs %>%
kable("html") %>%
kable_styling(bootstrap_options = "striped", full_width = TRUE)
table
```
```{r echo = FALSE}
summary(lm_model)
```
The linear regression model examines the relationship between **Crime Rate per 100,000 population** and **Average income (excluding zeros)** while considering the categorical variable `GEO` representing different **provinces**. The model reveals several significant findings: firstly, a **negative relationship** exists between average total income and total crime rate, suggesting that higher average income tends to be associated with lower crime rates. Secondly, various provinces exhibit differing baseline rates, with **British Columbia** notably displaying a significantly higher rate compared to the reference province. Additionally, interaction terms between income and provinces indicate varying effects across regions, such as a stronger negative association between income and crime rate in British Columbia. Overall, the model, with an adjusted **R-squared value of 0.9328**, indicates a robust fit, suggesting that both income and province significantly influence the rate per 100,000 population, with nuanced variations across different regions.
```{r echo = FALSE}
set.seed(123)
rf_model <- randomForest(crime_rate ~ GEO + avg_income,
data = joint_df,
na.action = na.omit)
varImpPlot(rf_model, main = "Variale Importance Plot for Random Forest")
```
```{r echo = FALSE}
rf_model
```
The model explains approximately 90.24% of the variance in the crime rate, which suggests that the predictor variables (REF_DATE, GEO, and avg_income) collectively have a strong relationship with the crime rate.
The mean of squared residuals is relatively high (859479.8), indicating that there is still some unexplained variability in the data that the model is not capturing well. This suggests that there may be other factors influencing the crime rate that are not included in the model.
Considering only one variable at each split might simplify the model and make it easier to interpret, but it might not capture complex interactions between variables.
```{r echo = FALSE}
gam_model <- gam(crime_rate ~ s(REF_DATE) + GEO + avg_income, data = joint_df)
summary(gam_model)
```
```{r echo = FALSE}
set.seed(123)
train_control = caret::trainControl(method = "cv", number = 10, search ="grid")
tune_grid<- expand.grid(max_depth = 3, #larger makes model more complex and potentially overfit
nrounds = 300, # number of trees
# default values below
eta = 0.1, # learning rate (equivalent to shrinkage)
gamma = 0, #Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
subsample = 1,
min_child_weight = 1,
colsample_bytree = 0.6 #specify the fraction of columns to be subsampled
)
xgb_model<-caret::train(
crime_rate ~ REF_DATE + GEO + avg_income,
data = joint_df,
method = 'xgbTree',
trControl = train_control,
tuneGrid = tune_grid,
na.action = na.omit,
verbosity=0
)
```
```{r echo = FALSE}
print(xgb_model)
```
## Summary
*"What you found so far from your data in terms of the formulated question."*
Recall that the question of interest is: **“How does crime rate relate to income in Canada?”**
1. **Looking at Crime Data:**
+ Saskatchewan consistently exhibits a significantly higher total crime rate compared to other provinces throughout the period of 1998 to 2021.
+ Quebec and Ontario consistently demonstrate the lowest total crime rates.
+ Across all provinces, there is a discernible decreasing trend in total crime rates over the years.
+ Many provinces experiencing peak total crime rates in 2003-2004.
+ There is a decreasing trend in the rates of break and enter, robbery, and prostitution in all provinces.
+ British Columbia stands out with a significantly high rate of prostitution in 2004, doubling the number reported in Saskatchewan, which held the second-highest rate that year.
+ Across all provinces, there is a discernible decreasing trend in the average crime rate of all types, from year to year.
2. **Looking at Income Data:**
+ On the whole, the average total income for all provinces exhibits a steady upward trend.
+ Since 2003, Alberta has surpassed Ontario to become the province with the highest average total income.
+ There is a slight decrease in average total income across all provinces around 2019, likely attributed to the impact of the COVID-19 pandemic.
+ Employment income, investment income, and market income of provinces are all increasing.
+ Employment and market income show a more steady growth pattern, while investment income fluctuates dramatically from year to year.
3. **Examining Crime and Income Together:**
+ Average total income and total crime rate are negatively correlated across all provinces.
+ Quebec exhibits the strongest correlation between average total income and total crime rate as the correlation is close to -1.
+ In the relationship between robbery crime rate and employment income, all provinces display a negative trend except for Newfoundland and Labrador. Manitoba demonstrates a weak relationship, as evidenced by the considerable dispersion of points around the line.
+ Property crime rate decreases with increasing market income in all provinces.
+ Employment income and prostitution crime rate have varying degrees of association across provinces, with many showing a weak relationship. Notably, Ontario exhibits a positive relationship between employment income and prostitution crime rate while it is negative for other provinces.
4. **Zoom on Province - Ontario:**
+ Major crimes in Ontario are property crime and weapon violations.
+ Strongest correlation is observed between self-employment income and production under the Cannabis Act. However, it’s important to note that this relationship may not be entirely reliable due to the limited data available.
+ The second strongest correlation, with a coefficient of 0.98, is observed between total income and incidents of possession of other Controlled Drugs and Substances Act drugs. As income levels increase, the number of incidents of possession of these drugs also tends to rise.
5. **Group by Income Level:**
+ Created 4 levels for average total income using the quarantines, from negative infinity to the first quantile is “Low”, from first quantile to mean is “Med_Low”, from mean to 3rd quantile is “Med_High”, from 3rd quantile above is “High”.
+ Total crime rate does not exhibit a clear trend of decreasing with higher levels of total income, which contradicts previous observations when examining the relationship between total crime rate and average total income by province. while a relationship exists, it may be influenced by other factors related to the demographics of each province. Consequently, when considering all observations collectively, the relationship becomes less apparent.
6. **Fitting Statistical Models:**
+ The linear regression model analyzed the relationship between Crime Rate per 100,000 population and Average income (excluding zeros) across different provinces represented by the categorical variable GEO.
+ Findings suggest a negative correlation between average income and total crime rate, implying that higher average income tends to coincide with lower crime rates.
+ Provinces exhibit varying baseline rates, with British Columbia notably showing a significantly higher rate compared to the reference province.
+ Interaction terms between income and provinces reveal differing effects across regions, such as a more pronounced negative association between income and crime rate in British Columbia.
+ The model demonstrates a robust fit with an adjusted R-squared value of 0.9328, indicating significant influences of both income and province on the crime rate per 100,000 population, with nuanced variations across different regions.