Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Ensure if() is always fed objects of length 1 #442

Open
maurolepore opened this issue Apr 2, 2021 · 9 comments
Open

Ensure if() is always fed objects of length 1 #442

maurolepore opened this issue Apr 2, 2021 · 9 comments

Comments

@maurolepore
Copy link
Contributor

maurolepore commented Apr 2, 2021

#436 is one example of a problem I've seen multiple times. Maybe we should review all calls to if() and ensure they are all fed objects of length 1?

This is a likely source of bugs and the fix is fairly straight forward. Nicely, it will force us to review the logic of our programs and decide when we should wrap in all(), any(), use identical(), extract the first element, or do something else.

We could continue to fix them as we bump into them but likely it's more efficient to fix them all in one go, allowing us to fix a bunch of bugs with relatively little code and time.

@cjyetman
Copy link
Member

cjyetman commented Apr 2, 2021

Yeah, these need to get cleaned up eventually, but I'd hesitate believing that there's one-size-fits-all solution for every instance of this. Let's approach each variation specifically. Maybe we can identify them here as we find them?

@maurolepore
Copy link
Contributor Author

maurolepore commented Apr 3, 2021

Okay. I think whether or not we review all if() statements or tackle them as we bump into them is a strategic decision about how we tackle technical debt and therefore a decision that @AlexAxthelm will likely prefer to make. I would then close this issue because it'll be sitting here for a while and we already know there is a lot of technical debt.

Before I close I paste a snippet of a global search for "if(" in .R files. Just here I see four instances of == inside an if() statement. Compared to identical(), using == inside an if() statement is suspicious because it can return objects of length > 1. Therefore, all such if() statements are good candidates for review.

@cjyetman
Copy link
Member

cjyetman commented Apr 3, 2021

Personally, I don't see an issue like this --with a reasonable goal, well thought options for how to fix them, and some preliminary investigation of where the fixes need to happen-- as tech debt, though I admittedly don't know nor really care so much what the precise definition of that is, so I could be wrong, but in any case, I would leave this issue open.

I'm also not opposed to fixing these all at once, only opposed to blindly applying the same "fix" to all of them... as you've already pointed out, in some cases all() or any() might be more appropriate than identical() or vector[[1]] == "something"... it might not be obvious which is most appropriate in each case and require deeper investigation.

@cjyetman cjyetman reopened this Apr 3, 2021
@AlexAxthelm
Copy link
Contributor

1: I think this qualifies as "tech debt" in the sense that the system (mostly) works now, but this could cause us problems in the future.

2: I agree that resolving this issue is going to be handled on a case-by-case basis. Maybe we can make a coordinated effort to tackle this, but If they haven't been causing (many) serious issues yet, I think that addressing these whenever we're touching those files anyway seems like a reasonable approach.

@cjyetman
Copy link
Member

cjyetman commented Apr 6, 2021

@AlexAxthelm does that mean that this issue thread being open is tech debt, or that the problem that some if statements are not well defined is tech debt?

@AlexAxthelm
Copy link
Contributor

I view the if statements as tech debt. It opens up an avenue for errors, but it seems like #436 is the only one that has popped up as an actual problem so far.

@cjyetman
Copy link
Member

I dug into this today a bit. Just want to note here that in some places, especially when there's a default value, I actually prefer the warning in the if statement because it reveals something that should be fixed upstream, rather than obscuring it by circumventing the problem and passing on garbage. For instance, in the code below, financial_timestamp conceptually should never be more than a single character element. If somehow we make a change that causes financial_timestamp to have more than one element under specific conditions, I would rather the below conditionals use the first element and throw a warning than simply pass garbage back and not leave any message.

https://github.com/2DegreesInvesting/PACTA_analysis/blob/e46ce20befa09278b2f3884340bf2795f97226c1/0_reporting_functions.R#L394-L404

example...

test <- function(financial_timestamp) {
  if (financial_timestamp == "2017Q4") {
    "31.12.2017"
  } else if (financial_timestamp == "2018Q4") {
    "31.12.2018"
  } else if (financial_timestamp == "2019Q4") {
    "31.12.2019"
  } else {
    financial_timestamp
  }
}

test2 <- function(financial_timestamp) {
  if (identical(financial_timestamp, "2017Q4")) {
    "31.12.2017"
  } else if (identical(financial_timestamp, "2018Q4")) {
    "31.12.2018"
  } else if (identical(financial_timestamp, "2019Q4")) {
    "31.12.2019"
  } else {
    financial_timestamp
  }
}

test(c("2018Q4", "2019Q4"))
#> [1] "31.12.2018"
#> Warning messages:
#> 1: In if (financial_timestamp == "2017Q4") { :
#>   the condition has length > 1 and only the first element will be used
#> 2: In if (financial_timestamp == "2018Q4") { :
#>   the condition has length > 1 and only the first element will be used

test2(c("2018Q4", "2019Q4"))
#> [1] "2018Q4" "2019Q4"

@AlexAxthelm
Copy link
Contributor

I think this is a good example of why it's important to document assumptions, even implicitly.

for example, I would say that either of those functions is fine, with the following changes:

# this code not tested, for illustrative purpose only.
validate_financial_timestamp <- function(financial_timestamp) {
  # financial timestamp should be a single point in time, in form of YYYYQN, 
  # where YYYY is a 4 digit year, and N is a number 1-4 (inclusive)
  stopifnot(length(financial_timestamp) == 1L) # could also use assert, if available
  stopifnot(grepl(x = financial_timestamp, pattern = "^[:digit]{4}Q[1-4]$"))
  return(invisible(financial_timestamp))
}

test <- function(financial_timestamp) {
  validate_financial_timestamp(financial_timestamp)
  if (financial_timestamp == "2017Q4") {
    "31.12.2017"
  } else if (financial_timestamp == "2018Q4") {
    "31.12.2018"
  } else if (financial_timestamp == "2019Q4") {
    "31.12.2019"
  } else {
    financial_timestamp
  }
}

@AlexAxthelm
Copy link
Contributor

basically, validation functions should be quick to run, and therefore, we're willing to use them more frequently (basically every time we ingest an object through a public function). The other major alternative to this would be to take a more OOP based approach, but that's a can of worms that I don't know if we want to go down.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants