nbdev2 fixes

amaiya · Jun 15, 2024 · b2acc5e · b2acc5e
1 parent f41cba2
commit b2acc5e
Show file tree

Hide file tree

Showing 23 changed files with 347 additions and 343 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+*.ipynb merge=nbdev-merge
diff --git a/.gitconfig b/.gitconfig
@@ -0,0 +1,11 @@
+# Generated by nbdev_install_hooks
+#
+# If you need to disable this instrumentation do:
+#   git config --local --unset include.path
+#
+# To restore:
+#   git config --local include.path ../.gitconfig
+#
+[merge "nbdev-merge"]
+	name = resolve conflicts with nbdev_fix
+	driver = nbdev_merge %O %A %B %P
diff --git a/README.md b/README.md
@@ -1,55 +1,90 @@
 # Welcome to CausalNLP
 
 
+<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
 
 ## What is CausalNLP?
-> CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
+
+> CausalNLP is a practical toolkit for causal inference with text as
+> treatment, outcome, or “controlled-for” variable.
 
 ## Features
-- Low-code [causal inference](https://amaiya.github.io/causalnlp/examples.html) in as little as two commands
-- Out-of-the-box support for using [**text** as a "controlled-for" variable](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-a-positive-review-on-product-views?) (e.g., confounder)
-- Built-in [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) that transforms raw text into useful variables for causal analyses (e.g., topics, sentiment, emotion, etc.)
-- Sensitivity analysis to [assess robustness of causal estimates](https://amaiya.github.io/causalnlp/causalinference.html#CausalInferenceModel.evaluate_robustness)
-- Quick and simple [key driver analysis](https://amaiya.github.io/causalnlp/key_driver_analysis.html) to yield clues on potential drivers of an outcome based on predictive power, correlations, etc.
-- Can easily be applied to ["traditional" tabular datasets without text](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?) (i.e., datasets with only numerical and categorical variables)
-- Includes an experimental [PyTorch implementation](https://amaiya.github.io/causalnlp/core.causalbert.html) of [CausalBert](https://arxiv.org/abs/1905.12741) by Veitch, Sridar, and Blei (based on [reference implementation](https://github.com/rpryzant/causal-bert-pytorch) by R. Pryzant)
+
+- Low-code [causal
+  inference](https://amaiya.github.io/causalnlp/examples.html) in as
+  little as two commands
+- Out-of-the-box support for using [**text** as a “controlled-for”
+  variable](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-a-positive-review-on-product-views?)
+  (e.g., confounder)
+- Built-in
+  [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) that
+  transforms raw text into useful variables for causal analyses (e.g.,
+  topics, sentiment, emotion, etc.)
+- Sensitivity analysis to [assess robustness of causal
+  estimates](https://amaiya.github.io/causalnlp/causalinference.html#CausalInferenceModel.evaluate_robustness)
+- Quick and simple [key driver
+  analysis](https://amaiya.github.io/causalnlp/key_driver_analysis.html)
+  to yield clues on potential drivers of an outcome based on predictive
+  power, correlations, etc.
+- Can easily be applied to [“traditional” tabular datasets without
+  text](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)
+  (i.e., datasets with only numerical and categorical variables)
+- Includes an experimental [PyTorch
+  implementation](https://amaiya.github.io/causalnlp/core.causalbert.html)
+  of [CausalBert](https://arxiv.org/abs/1905.12741) by Veitch, Sridar,
+  and Blei (based on [reference
+  implementation](https://github.com/rpryzant/causal-bert-pytorch) by R.
+  Pryzant)
 
 ## Install
 
-1. `pip install -U pip`
-2. `pip install causalnlp`
+1.  `pip install -U pip`
+2.  `pip install causalnlp`
 
-**NOTE**: On Python 3.6.x, if you get a `RuntimeError: Python version >= 3.7 required`, try ensuring NumPy is installed **before** CausalNLP (e.g., `pip install numpy==1.18.5`).
+**NOTE**: On Python 3.6.x, if you get a
+`RuntimeError: Python version >= 3.7 required`, try ensuring NumPy is
+installed **before** CausalNLP (e.g., `pip install numpy==1.18.5`).
 
 ## Usage
 
-To try out the [examples](https://amaiya.github.io/causalnlp/examples.html) yourself:
+To try out the
+[examples](https://amaiya.github.io/causalnlp/examples.html) yourself:
 
 <a href="https://colab.research.google.com/drive/1hu7j2QCWkVlFsKbuereWWRDOBy1anMbQ?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
 ### Example: What is the causal impact of a positive review on a product click?
 
-```
+``` python
 import pandas as pd
-df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)
 ```
 
-The file `music_seed50.tsv` is a semi-simulated dataset from [here](https://github.com/rpryzant/causal-text). Columns of relevance include:
-- `Y_sim`: outcome, where 1 means product was clicked and 0 means not.
-- `text`: raw text of review
-- `rating`: rating associated with review (1 through 5)
-- `T_true`: 0 means rating less than 3, 1 means rating of 5, where `T_true` affects the outcome `Y_sim`.
-- `T_ac`: an approximation of true review sentiment (`T_true`) created with [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) from raw review text
-- `C_true`:confounding categorical variable (1=audio CD, 0=other)
+``` python
+df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip')
+```
 
+The file `music_seed50.tsv` is a semi-simulated dataset from
+[here](https://github.com/rpryzant/causal-text). Columns of relevance
+include: - `Y_sim`: outcome, where 1 means product was clicked and 0
+means not. - `text`: raw text of review - `rating`: rating associated
+with review (1 through 5) - `T_true`: 0 means rating less than 3, 1
+means rating of 5, where `T_true` affects the outcome `Y_sim`. - `T_ac`:
+an approximation of true review sentiment (`T_true`) created with
+[Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) from raw
+review text - `C_true`:confounding categorical variable (1=audio CD,
+0=other)
 
-We'll pretend the true sentiment (i.e., review rating and `T_true`) is hidden and only use `T_ac` as the treatment variable. 
+We’ll pretend the true sentiment (i.e., review rating and `T_true`) is
+hidden and only use `T_ac` as the treatment variable.
 
-Using the `text_col` parameter, we include the raw review text as another "controlled-for" variable.
+Using the `text_col` parameter, we include the raw review text as
+another “controlled-for” variable.
 
-```
+``` python
 from causalnlp import CausalInferenceModel
 from lightgbm import LGBMClassifier
+```
+
+``` python
 cm = CausalInferenceModel(df, 
                          metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
                          treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
@@ -65,34 +100,36 @@ cm.fit()
     start fitting causal inference model
     time to fit causal inference model:  10.361494302749634  sec
 
-
 #### Estimating Treatment Effects
 
-CausalNLP supports estimation of heterogeneous treatment effects (i.e., how causal impacts vary across observations, which could be documents, emails, posts, individuals, or organizations).
+CausalNLP supports estimation of heterogeneous treatment effects (i.e.,
+how causal impacts vary across observations, which could be documents,
+emails, posts, individuals, or organizations).
 
-We will first calculate the overall average treatment effect (or ATE), which shows that a positive review increases the probability of a click by **13 percentage points** in this dataset.
+We will first calculate the overall average treatment effect (or ATE),
+which shows that a positive review increases the probability of a click
+by **13 percentage points** in this dataset.
 
 **Average Treatment Effect** (or **ATE**):
 
-```
+``` python
 print( cm.estimate_ate() )
 ```
 
     {'ate': 0.1309311542209525}
 
+**Conditional Average Treatment Effect** (or **CATE**): reviews that
+mention the word “toddler”:
 
-**Conditional Average Treatment Effect** (or **CATE**): reviews that mention the word "toddler":
-
-```
+``` python
 print( cm.estimate_ate(df['text'].str.contains('toddler')) )
 ```
 
     {'ate': 0.15559234254638685}
 
+**Individualized Treatment Effects** (or **ITE**):
 
- **Individualized Treatment Effects** (or **ITE**):
-
-```
+``` python
 test_df = pd.DataFrame({'T_ac' : [1], 'C_true' : [1], 
                         'text' : ['I never bought this album, but I love his music and will soon!']})
 effect = cm.predict(test_df)
@@ -101,10 +138,9 @@ print(effect)
 
     [[0.80538201]]
 
-
 **Model Interpretability**:
 
-```
+``` python
 print( cm.interpret(plot=False)[1][:10] )
 ```
 
@@ -120,28 +156,33 @@ print( cm.interpret(plot=False)[1][:10] )
     v_heard    0.028373
     dtype: float64
 
-
-Features with the `v_` prefix are word features. `C_true` is the categorical variable indicating whether or not the product is a CD. 
+Features with the `v_` prefix are word features. `C_true` is the
+categorical variable indicating whether or not the product is a CD.
 
 ### Text is Optional in CausalNLP
 
-Despite the "NLP" in CausalNLP, the library can be used for causal inference on data **without** text (e.g., only numerical and categorical variables). See [the examples](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?) for more info.
+Despite the “NLP” in CausalNLP, the library can be used for causal
+inference on data **without** text (e.g., only numerical and categorical
+variables). See [the
+examples](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)
+for more info.
 
 ## Documentation
-API documentation and additional usage examples are available at: https://amaiya.github.io/causalnlp/
 
-## How to Cite
+API documentation and additional usage examples are available at:
+https://amaiya.github.io/causalnlp/
 
-Please cite [the following paper](https://arxiv.org/abs/2106.08043) when using CausalNLP in your work:
+## How to Cite
 
-```
-@article{maiya2021causalnlp,
-    title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
-    author={Arun S. Maiya},
-    year={2021},
-    eprint={2106.08043},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL},
-    journal={arXiv preprint arXiv:2106.08043},
-}
-```
+Please cite [the following paper](https://arxiv.org/abs/2106.08043) when
+using CausalNLP in your work:
+
+    @article{maiya2021causalnlp,
+        title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
+        author={Arun S. Maiya},
+        year={2021},
+        eprint={2106.08043},
+        archivePrefix={arXiv},
+        primaryClass={cs.CL},
+        journal={arXiv preprint arXiv:2106.08043},
+    }
diff --git a/causalnlp/analyzers.py b/causalnlp/analyzers.py
@@ -333,7 +333,7 @@ def get_topics(self, n_words=10, as_string=True):
         Returns a list of discovered topics
         """
         self._check_model()
-        feature_names = self.vectorizer.get_feature_names()
+        feature_names = self.vectorizer.get_feature_names_out()
         topic_summaries = []
         for topic_idx, topic in enumerate(self.model.components_):
             summary = [feature_names[i] for i in topic.argsort()[:-n_words - 1:-1]]

diff --git a/causalnlp/preprocessing.py b/causalnlp/preprocessing.py
@@ -144,7 +144,7 @@ def preprocess(self, df,
                 v_features = self.tv.fit_transform(df[self.text_col])
             else:
                 v_features = self.tv.transform(df[self.text_col])
-            vocab = self.tv.get_feature_names()
+            vocab = self.tv.get_feature_names_out()
             vocab_df = pd.DataFrame(v_features.toarray(), columns = ["v_%s" % (v) for v in vocab])
             X = pd.concat([X, vocab_df], axis=1, join='inner')
         outcome_type = 'categorical' if self.is_classification else 'numerical'