Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Black Box Predictive Control Model Causal Inference Library #146

Open
mikepsinn opened this issue Mar 11, 2024 · 0 comments
Open

Black Box Predictive Control Model Causal Inference Library #146

mikepsinn opened this issue Mar 11, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@mikepsinn
Copy link
Contributor

Data Collection and Analysis

We collect data on food and drug intake in addition to symptom severity ratings.

Adaptive Intervention and Predictive Control Models

This data is fed into a predictive control model system, a concept borrowed from behavioral medicine and control systems engineering. This system uses the data to continually refine its suggestions, helping you optimize your health and well-being.

Adaptive intervention is a strategy used in behavioral medicine to create individually tailored strategies for the prevention and treatment of chronic disorders. It involves intensive measurement and frequent decision-making over time, allowing the intervention to adapt to the individual's needs.

Predictive control models are a control system that uses data to predict future outcomes and adjust actions accordingly. In the context of Longevitron, this means using the data it collects to predict your future health outcomes and adjust its suggestions to optimize your health.

image

A control systems engineering approach for adaptive behavioral interventions: illustration with a fibromyalgia intervention - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4167895/

Real-Life Application and Benefits

Consider a hypothetical scenario where you're dealing with a chronic condition like fibromyalgia. We collect data on your symptoms, medication intake, stress levels, sleep quality, and other relevant factors. It would then feed this data into its predictive control model system, which would use it to predict your future symptoms and adjust your treatment plan accordingly.

This could involve suggesting changes to your medication dosage, recommending lifestyle changes, or even alerting your healthcare provider if it detects a potential issue. The goal is to optimize your health and well-being based on your needs and circumstances.

decision-support-notifications

☝️The image above is what we're trying to achieve here.

To determine the effects of various factors on health outcomes, we currently apply pharmacokinetic modeling over various onset delay and duration of action hyper-parameters and combine that with some other parameters for each of Hill's criteria for causality.

The distributions in this type of data aren't super normal, and you've got the onset delays and durations of action so regular Pearson correlations don't work so well. So we mainly focus on change from baseline. There's a ton of room for improvement by controlling using instrumental variables or convolutional recursive neural networks.

Hybrid Predictive Control Black Box Models seem most appropriate.

image

Test and Training Data

It's a matrix of years of self-reported Arthritis Severity Rating measurements and hundreds of potential factors over
time.

https://github.com/curedao/curedao-black-box-optimization-engine/raw/main/data/arthritis-factor-measurements-matrix-zeros-unixtime.csv

Format

The first row is the variable names. The first column is Unix timestamp (seconds since 1970-01-01 00:00:00 UTC).

Pre-Processing

To make it easier to analyze some preprocessing has been done. This includes zero-filling where appropriate. Also,
the factor measurement values are aggregated values preceding the Arthritis measurements based on the onset
delay
and duration of action.

Hyper-Parameters

The aggregation method and other hyper-parameters can be found by putting the Variable Name in either

  1. the API Explorer or
  2. in the URL https://studies.fdai.earth/VARIABLE_NAME_HERE.

Determining Treatment Effects from Sparse and Irregular Time Series Data

Introduction

Analyzing the effects of a treatment based on observational time series data is a common need in many domains like medicine, psychology, and economics. However, this analysis often faces several key challenges:

  • The data is sparse - there are limited number of observations.
  • The data is irregular - observations are not at regular time intervals.
  • There is missing data - many timepoints have no observation.
  • The onset delay of the treatment effect is unknown. It may take time to appear.
  • The duration of the treatment effect is unknown. It may persist after cessation.
  • Both acute (short-term) and cumulative (long-term) effects need to be analyzed.
  • Causality and statistical significance need to be established rigorously.
  • The optimal dosage needs to be determined to maximize benefits.

This article provides a comprehensive methodology to overcome these challenges and determine whether a treatment makes an outcome metric better, worse, or has no effect based on sparse, irregular time series data with missingness.

Data Preprocessing

Before statistical analysis can begin, the data must be preprocessed:

  • Resample the time series to a regular interval if needed while preserving original timestamps. This allows handling missing data. For example, resample to 1 measurement per day.
  • Do not do interpolation or forward fill to estimate missing values. This introduces incorrect data. Simply exclude those time periods from analysis.
  • Filter out any irrelevant variances like daily/weekly cycles. For example, detrend the data.

Proper preprocessing sets up the data for robust analysis.

Statistical Analysis Methodology

With cleaned data, a rigorous methodology can determine treatment effects:

Segment Data

First, split the data into three segments:

  • Pre-treatment - Period before treatment began
  • During treatment - Period during which treatment was actively administered
  • Post-treatment - Period after treatment ended

This enables separate analysis of the acute and cumulative effects.

Acute Effects Analysis

To analyze acute effects, compare the 'during treatment' segment vs the 'pre-treatment' segment:

  • Use interrupted time series analysis models to determine causality.
  • Apply statistical tests like t-tests to determine significance.
  • Systematically test different onset delays by shifting the 'during treatment' segment start time back incrementally. Account for unknown onset.
  • Systematically test excluding various amounts of time after treatment cessation to account for effect duration.
  • Look for acute improvements or decrements right after treatment begins based on the models.

Cumulative Effects Analysis

To analyze cumulative effects, build regression models between the outcome variable and the cumulative treatment dosage over time:

  • Use linear regression, enforcing causality constraints.
  • Apply statistical tests like F-tests for significance.
  • Systematically test excluding various amounts of time after treatment cessation to account for effect duration.
  • Look for long-term improvements or decrements over time based on the regression models.

Overall Effect Determination

Combine the acute and cumulative insights to determine the overall effect direction and statistical significance.

For example, acute worsening but long-term cumulative improvement would imply an initial side effect but long-term benefits. Lack of statistical significance would imply no effect.

Optimization

To determine the optimal dosage, incrementally adjust the daily dosage amount in the models above. Determine the dosage that minimizes the outcome variable in both the acute and cumulative sense.

Analysis Pipeline

Absolutely, given your constraints and requirements, here's a refined methodology:

  1. Data Preprocessing:

    • Handling Missingness: Exclude rows or time periods with missing data. This ensures the analysis is grounded in actual observations.
    • Standardization: For treatments with larger scales, standardize values to have a mean of 0 and a standard deviation of 1. This will make regression coefficients more interpretable, representing changes in symptom severity per standard deviation change in treatment.
  2. Lagged Regression Analysis:

    • Evaluate if treatment on previous days affects today's outcome, given the discrete nature of treatment.
    • Examine up to a certain number of lags (e.g., 30 days) to determine potential onset delay and duration.
    • Coefficients represent the change in symptom severity due to a one unit or one standard deviation change in treatment, depending on whether standardization was applied. P-values indicate significance.
  3. Reverse Causality Check:

    • Assess if symptom severity on previous days predicts treatment intake. This helps in understanding potential feedback mechanisms.
  4. Cross-Correlation Analysis:

    • Analyze the correlation between treatment and symptom severity across various lags.
    • This aids in understanding potential onset delays and durations of effect.
  5. Granger Causality Tests:

    • Test if past values of treatment provide information about future values of symptom severity and vice-versa.
    • This test can help in determining the direction of influence.
  6. Moving Window Analysis (for cumulative effects):

    • Create aggregated variables representing the sum or average treatment intake over windows (e.g., 7 days, 14 days) leading up to each observation.
    • Use these in regression models to assess if cumulative intake over time affects symptom severity.
  7. Optimal Dosage Analysis:

    • Group data by discrete dosage levels.
    • Calculate the mean symptom severity for each group.
    • The dosage associated with the lowest mean symptom severity suggests the optimal intake level.
  8. Control for Confounders (if data is available):

    • If data on potential confounding variables is available, incorporate them in the regression models. This helps in isolating the unique effect of the treatment.
  9. Model Diagnostics:

    • After regression, check residuals for normality, autocorrelation, and other potential issues to validate the model.
  10. Interpretation:

    • Consistency in findings across multiple analyses strengthens the case for a relationship.
    • While no single test confirms causality, evidence from multiple methods can offer a compelling case.

By adhering to this methodology, you will be working with actual observations, minimizing the introduction of potential errors from imputation. The combination of lagged regression, Granger causality tests, and moving window analysis will provide insights into both acute and cumulative effects, onset delays, and optimal treatment dosages.

Data Schema for Storing User Variable Relationship Analyses

Property Type Nullable Description
id int auto_increment No Unique identifier for each correlation entry.
user_id bigint unsigned No ID of the user to whom this correlation data belongs.
cause_variable_id int unsigned No ID of the variable considered as the cause in the correlation.
effect_variable_id int unsigned No ID of the variable considered as the effect in the correlation.
qm_score double Yes Quantitative metric scoring the importance of the correlation based on strength, usefulness, and causal plausibility.
forward_pearson_correlation_coefficient float(10, 4) Yes Statistical measure indicating the linear relationship strength between cause and effect.
value_predicting_high_outcome double Yes Specific cause variable value that predicts a higher than average effect.
value_predicting_low_outcome double Yes Specific cause variable value that predicts a lower than average effect.
predicts_high_effect_change int(5) Yes Percentage change in the effect when the predictor is near the value predicting high outcome.
predicts_low_effect_change int(5) No Percentage change in the effect when the predictor is near the value predicting low outcome.
average_effect double No Average value of the effect variable across all measurements.
average_effect_following_high_cause double No Average value of the effect variable following high cause variable measurements.
average_effect_following_low_cause double No Average value of the effect variable following low cause variable measurements.
average_daily_low_cause double No Daily average of cause variable values that are below average.
average_daily_high_cause double No Daily average of cause variable values that are above average.
average_forward_pearson_correlation_over_onset_delays float Yes Average of Pearson correlation coefficients calculated over different onset delays.
average_reverse_pearson_correlation_over_onset_delays float Yes Average of reverse Pearson correlation coefficients over different onset delays.
cause_changes int No Count of changes in cause variable values across the dataset.
cause_filling_value double Yes Default value used to fill gaps in cause variable data.
cause_number_of_processed_daily_measurements int No Count of daily processed measurements for the cause variable.
cause_number_of_raw_measurements int No Count of raw data measurements for the cause variable.
cause_unit_id smallint unsigned Yes ID representing the unit of measurement for the cause variable.
confidence_interval double No Statistical range indicating the reliability of the correlation effect size.
critical_t_value double No Threshold value for statistical significance in correlation analysis.
created_at timestamp No Timestamp of when the correlation record was created.
data_source_name varchar(255) Yes Name of the data source for the correlation data.
deleted_at timestamp Yes Timestamp of when the correlation record was marked as deleted.
duration_of_action int No Duration in seconds for which the cause is expected to have an effect.
effect_changes int No Count of changes in effect variable values across the dataset.
effect_filling_value double Yes Default value used to fill gaps in effect variable data.
effect_number_of_processed_daily_measurements int No Count of daily processed measurements for the effect variable.
effect_number_of_raw_measurements int No Count of raw data measurements for the effect

variable. |
| forward_spearman_correlation_coefficient| float | No | Spearman correlation assessing monotonic relationships between lagged cause and effect data. |
| number_of_days | int | No | Number of days over which the correlation data was collected. |
| number_of_pairs | int | No | Total number of cause-effect pairs used for calculating the correlation. |
| onset_delay | int | No | Estimated time in seconds between cause occurrence and effect observation. |
| onset_delay_with_strongest_pearson_correlation | int(10) | Yes | Onset delay duration yielding the strongest Pearson correlation. |
| optimal_pearson_product | double | Yes | Theoretical optimal value for the Pearson product in the correlation analysis. |
| p_value | double | Yes | Statistical significance indicator for the correlation, with values below 0.05 indicating high significance. |
| pearson_correlation_with_no_onset_delay | float | Yes | Pearson correlation coefficient calculated without considering onset delay. |
| predictive_pearson_correlation_coefficient | double | Yes | Pearson coefficient quantifying the predictive strength of the cause variable on the effect. |
| reverse_pearson_correlation_coefficient | double | Yes | Correlation coefficient when cause and effect variables are reversed, used to assess causality. |
| statistical_significance | float(10, 4) | Yes | Value representing the combination of effect size and sample size in determining correlation significance. |
| strongest_pearson_correlation_coefficient | float | Yes | The highest Pearson correlation coefficient observed in the analysis. |
| t_value | double | Yes | Statistical value derived from correlation and sample size, used in assessing significance. |
| updated_at | timestamp | No | Timestamp of the most recent update made to the correlation record. |
| grouped_cause_value_closest_to_value_predicting_low_outcome | double | No | Realistic daily cause variable value associated with lower-than-average outcomes. |
| grouped_cause_value_closest_to_value_predicting_high_outcome | double | No | Realistic daily cause variable value associated with higher-than-average outcomes. |

Conclusion

This rigorous methodology uses interrupted time series analysis, regression modeling, statistical testing, onset/duration modeling, and optimization to determine treatment effects from sparse, irregular observational data with missingness. It establishes causality and significance in both an acute and cumulative sense. By finding the optimal dosage, it provides actionable insights for maximizing the benefits of the treatment.

Resources

Links

  1. SunilDeshpande_S2014_ETD.pdf (asu.edu)
  2. LocalControl: An R Package for Comparative Safety and Effectiveness Research | Journal of Statistical Software (jstatsoft.org)
  3. bbotk: A brief introduction (r-project.org)
  4. artemis-toumazi/dfpk (github.com)
  5. miroslavgasparek/MPC_Cancer: Model Predictive Control for the optimisation of the tumour treatment through the combination of the chemotherapy and immunotherapy. (github.com)
  6. Doubly Robust Learning — econml 0.12.0 documentation
  7. A control systems engineering approach for adaptive behavioral interventions: illustration with a fibromyalgia intervention (nih.gov)
  8. The promise of machine learning in predicting treatment outcomes in psychiatry - Chekroud - 2021 - World Psychiatry - Wiley Online Library
  9. CURATE.AI: Optimizing Personalized Medicine with Artificial Intelligence - Agata Blasiak, Jeffrey Khong, Theodore Kee, 2020 (sagepub.com)
  10. Using nonlinear model predictive control to find optimal therapeutic strategies to modulate inflammation (aimspress.com)
  11. Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks (nips.cc)
  12. Estimating counterfactual treatment outcomes over time through adversarially balanced representations | OpenReview
  13. https://dash.harvard.edu/bitstream/handle/1/37366470/AGUILAR-SENIORTHESIS-2019.pdf?isAllowed=y&sequence=1
@mikepsinn mikepsinn added the enhancement New feature or request label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant