Global warming is one of the primary global problems we face today. The world is heating up at an unprecedented rate due to emissions produced by human activity, and we are quickly reaching a point of no return. Of all emissions produced globally, about 11% can be attributed to agriculture (1). While not as impactful as something like transportation, which makes up over 30% (1), cutting emissions in any sector can make a difference.
To understand how to cut emissions, we must understand where they come from. Agricultural greenhouse gas emissions come from several sources; the three primary ones are agricultural soil management, enteric fermentation, and manure management:
-
Agricultural soil management: “Nitrous oxide is produced naturally in soils through the microbial processes of nitrification and de-nitrification.” (2)
-
Enteric fermentation: “Methane is produced as part of the normal digestive processes in animals. During digestion, microbes in the animal’s digestive system ferment feed. This process, called enteric fermentation, produces methane as a by-product which can be emitted by the exhaling and belching of the animal.” (2)
-
Manure management: “Methane is produced by the anaerobic (without oxygen) decomposition of manure. When manure is handled as a solid or deposited naturally on grassland, it decomposes aerobically (with oxygen) and creates little methane emissions. However, manure stored as a liquid or slurry in lagoons, ponds, tanks or pits, decomposes anaerobically and creates methane emissions.” (2)
Can we predict a country’s agricultural emissions given the raw agricultural and livestock production of the country that year?
When doing background research, we noticed that the processes that result in agricultural emissions are pretty direct. Fermentation of Nitrogen sources in the agricultural soils leads to Nitrous oxide being produced. This can be reframed as: the more crop farming we do, the more Nitrous oxide release we get (as they are directly related). Enteric fermentation is also directly related to the process of livestocks’ digestive systems. So, the more livestock we have, the more Methane we have released into the environment as a result of enteric fermentation. The same idea applies to manure management. If these processes are so directly related to the release of emissions and global warming, we should be able to predict agricultural emissions from data of agricultural and livestock production.
We found the FAOStat database (3), which is curated by the Food and Agriculture Organization of the United Nations (FAO). The FAOStat database aggregates many smaller datasets on various specific metrics as related to worldwide agriculture. Of these subsets, we selected the datasets on crop production, live animals and agricultural emissions.
From the datasets we chose, we attain area harvested, yield, and production quantity of 129 types of crops and the amount of stock of 14 types of livestock for 225 countries over years 1961 - 2017. The agricultural emissions dataset also gives us access to the amount of CH4 and N2O emissions for these years and countries, each of which we convert to terms of gigagrams of CO2 then combine. We link the different datasets through year and country, with the goal of predicting the CO2 emissions using the crop and livestock data.
df_all_na <- read_csv("./data/processed/combined_total_na.csv")
We do acknowledge that this data contains several sources of uncertainty. To start, there is the possibility of measurement error - we don’t have a guarantee that results aren’t misreported or otherwise flawed. However, we do not have a concrete way of accounting for this type of error. Therefore, given that the FAO is a reputable organization, we assume that our values are relatively accurate.
An additional source of uncertainty comes from the fact that we are trying to approximate a function. As framed, the function we are modeling can take in any input of raw agricultural production and output the amount of agricultural emissions. This would include scenarios that don’t occur in the real world - such as a case where a country only produces goats, or where a country has no livestock at all. And, since these scenarios don’t occur in the real world, we are unable to account for them in training and therefore are unlikely to approximate well. This perhaps points at a larger pattern in modeling, where the predictions of the model generally only hold for inputs similar to the inputs the model was trained on. In our case, this means that we expect our model to only perform well on data relatively similar to out training data. We believe we could run tests where we construct the train and test set in such a way that certain factors are out of distribution to account for this. For example, placing all observations from the United States of America in our test set to get a sense of how countries not in our dataset might be handled. However, this was outside of the scope our project so we accept that we do not know how our model might perform on data not well represented in our dataset.
To start, we process this data so that each observation consists of the
raw agricultural output and CO2 emissions due to agriculture of a
country for a given year. NA values were replaced by 0s, as they tended
to indicate that a country didn’t produce any of the given produce that
year, and aggregate columns (like Goat and Sheep) were discarded in
favor of their individual components. We also identified a list of
Area
s that grouped multiple countries and removed them. This included
China, as it is broken down into 4 major sections that otherwise exist
in the dataset. Finally, we removed rows where the CO2 emissions were 0,
as this means there were no recorded emissions.
# List of areas that (usually) aren't actually countries
areas_bad = c("World", "Americas", "Asia", "Africa", "Australia and New Zealand", "China", "Central Asia", "Central America", "Eastern Africa", "Eastern Asia", "Eastern Europe", "Europe", "Land Locked Developing Countries", "Least Developed Countries", "Low Income Food Deficit Countries", "Middle Africa", "Net Food Importing Developing Countries", "Northern Europe", "Northern Africa", "Northern America", "Pacific Islands Trust Territory", "Southern Asia", "South America", "Southern Africa", "Southern Europe", "Small Island Developing States", "Serbia and Montenegro", "Western Asia", "Western Africa", "Western Europe", "Oceania")
df_processed <-
df_all_na %>%
rename(CO2 = `Agriculture total|CO2`) %>%
select_if(~ !is.logical(.)) %>%
replace(is.na(.), 0) %>%
select(-contains("and")) %>%
filter(CO2 != 0) %>%
filter(!(Area %in% areas_bad))
Next, we split the data into the train set and the hold-out test set.
set.seed(42)
emissions_split = initial_split(df_processed, prop = 0.8)
emissions_train = training(emissions_split)
emissions_test = testing(emissions_split)
Then, we define several pre-processing steps. First, we drop the Year
and Area
columns as they are not inputs we want the model to use. We
also remove columns with zero variance then standardize the variables to
have a mean of 0 and a standard deviation of 1. Both steps are important
for fitting a lasso regression model.
emissions_rec <-
recipe(`CO2` ~ ., data = emissions_train) %>%
step_rm(Year, Area) %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
emissions_prep <-
emissions_rec %>%
prep()
For the modeling portion, we decided to use lasso regression. As we had
a large amount of inputs, we wanted to leverage lasso regression’s
property of pushing coefficients to 0 in order to have a more tractable
end model. To start, we define our model with a naively chosen penalty
of 0.1
then fit on the train set to get a sense of performance.
We visualize the coefficients of the model with the largest magnitude to get a sense of what the model values in the prediction process. Seeing cattle as the greatest factor is encouraging - we know from other research that cattle greatly contribute to multiple types of emissions.
df_coeff %>%
mutate(
term = fct_reorder(term, estimate)
) %>%
arrange(desc(abs(estimate))) %>%
filter(term != "(Intercept)") %>%
head(20) %>%
ggplot(aes(
x = estimate,
y = term
)) +
geom_col() +
labs(
x = "Coefficient",
y = "Variable",
title = "Top 20 Regression Coefficients by Magnitude"
) +
theme(
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.margin = unit(c(t = +0.5, b = +0.5, r = +0.5, l = +0.5), "cm"),
)
We then sweep across a range of L1 penalty parameters on our train set to determine the setting that minimizes root mean squared error on our data.
The chosen penalty is used to fit a model on our train set to create our
final model. We evaluate the final fit on our test set to see how the
model performs on data that it has not seen during training. Here, we
achieve a r-squared of about 0.998
. Given that this r-squared value is
close to 1, we believe that we have a strong model for predicting
agricultural emissions.
final_fit <- last_fit(
final_lasso,
emissions_split
)
final_fit %>%
collect_metrics() %>%
select(-.config, -.estimator) %>%
filter(.metric == "rsq") %>%
rename("Metric" = .metric, "Estimate" = .estimate)
## # A tibble: 1 x 2
## Metric Estimate
## <chr> <dbl>
## 1 rsq 0.998
However we can address a source of uncertainty in our process - the
random train / test split. To account for this, we apply 10-fold cross
validation on our train data which allows us to quantify how our model’s
performance might vary based on the initial split. We use the produced
matrix to make a 95% confidence interval of about 0.997 - 0.998
for
r-squared. Given that the high end of the r-squared interval is still
close to 1 and the overall range of the interval is relatively speaking
small, we believe that the train / test split does not significantly
affect the performance of our model.
set.seed(44)
folds <- vfold_cv(emissions_train, v = 10)
confidence_level = 0.95
results <-
fit_resamples(
final_lasso,
folds
) %>%
collect_metrics()
stat <- results %>% filter(.metric == "rsq")
lo <- pull(stat, "mean") - qnorm( 1 - (1 - confidence_level) / 2 ) * pull(stat, "std_err")
hi <- pull(stat, "mean") + qnorm( 1 - (1 - confidence_level) / 2 ) * pull(stat, "std_err")
cat("95% Confidence Interval for r^2\n", "Lower bound:", lo, "\n", "Upper bound:", hi)
## 95% Confidence Interval for r^2
## Lower bound: 0.9969372
## Upper bound: 0.9980363
As a final validation step, we produce two graphs. The first is a graph of the predicted CO2 versus the actual CO2 for all observations in our test set. Here, we note that the majority of points lie around the line y = x, which means that the models predictions are close to the actual value for many of the observations in the test set.
predictions <-
collect_predictions(
last_fit(
final_lasso,
emissions_split,
)
) %>%
pull(".pred")
emissions_test %>%
mutate("Predicted CO2" = predictions) %>%
ggplot(aes(CO2, `Predicted CO2`)) +
geom_point() +
geom_abline(aes(color = "y = x", intercept = 0, slope = 1)) +
labs(
x = "Actual CO2 Emissions (Gigagrams CO2)",
y = "Predicted CO2 Emissions (Gigagrams CO2)",
title = "Actual CO2 Emissions versus Predicted CO2 Emissions on Test Set",
colour = "Reference Line:"
) +
theme(
legend.position = "bottom",
legend.box = "horizontal",
)
The second is a graph of the actual CO2 for a set of countries over time next to the predicted CO2 for the same set of countries over time. We filter for countries with more than 10 data points in the test set and exclude the many countries with a small amount of emissions to make a clearer graph. This does potentially ignore issues that we might have in predicting low emission countries, but considering we have validated our results several ways we believe this is alright. Looking at the graph, it is clear that the general shape of the emissions over time for both predicted and actual CO2 are very similar.
emissions_test %>%
mutate("Predicted CO2" = predictions) %>%
filter(CO2 > 40000) %>%
group_by(Area) %>%
filter(n() > 10) %>%
ungroup() %>%
rename("Actual" = "CO2", "Predicted" = "Predicted CO2") %>%
pivot_longer(
cols = c("Actual", "Predicted"),
names_to = "metric",
values_to = "value"
) %>%
ggplot(aes(x = Year, group = Area, color = Area)) +
geom_line(aes(y = value)) +
facet_wrap(~ metric) +
scale_y_log10() +
labs(
x = "Year",
y = "Agricultural Emissions (Gigagrams CO2)",
title = "Actual CO2 Over Time versus Predicted CO2 Over Time (Y Axis Log Scale)"
) +
theme(
legend.position = "bottom",
legend.box = "horizontal",
)
To conclude, it seems that we can indeed predict agricultural emissions given the raw agricultural and livestock production of the country that year. Although we used country and year to link the different datasets, that was for the sole purpose of linking the datasets; our model works independent of country and year, with raw agricultural and livestock production as the only necessary inputs to accurately predict agricultural emissions. We do however note that this is holds for data similar to our dataset - we do not know how the model performs on out of distribution data.
We believe our model is a strong predictor as we have conducted multiple
validation metrics, all of which the model has performed well on. Our
first was an evaluation on our held-out test set, with a r-squared value
of 0.998
, close to 1. We also conducted 10-fold cross-validation on
our training set, creating out train / test split 10 ways, and taking a
confidence interval of our r-squared calculated from each sample. This
produced a 95% confidence interval for r-squared of 0.997 - 0.998
,
which again indicates a strong model while accounting for uncertainty.
Our final two metrics were more qualitative; we plotted our predictions
to see how they compared to the actual data. In our first graph, we
found that the predicted CO2 and actual CO2 tracked, following the line
of y = x
. For our second graph, we presented the same data from the
first graph but in a more visually intuitive manner. We produced two
graphs of predicted and actual CO2, plotting each value over time with
each country as a line. Since our model operated only on raw data of
agricultural and livestock production without the year or country
identified, the fact that we see similar trends in countries over time
further validates our model.
One interesting observation is that the model can make strong predictions without time as an input. This could suggest that the amount of emissions produced by each of our factors doesn’t change over time. This somewhat conflicts with our mental model of emissions over time - we expected that, due to the advancement of technology, the amount of emissions produced per factor would decrease over time. For example, we might have developed a new feed that lead to cows producing less emissions via enteric fermentation. The fact that we can predict well without year could then lead to the (somewhat depressing) conclusion that we have not gotten much better about producing agricultural emissions.