Skip to content

Commit

Permalink
nbdev2 fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed Jun 15, 2024
1 parent f41cba2 commit b2acc5e
Show file tree
Hide file tree
Showing 23 changed files with 347 additions and 343 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb merge=nbdev-merge
11 changes: 11 additions & 0 deletions .gitconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Generated by nbdev_install_hooks
#
# If you need to disable this instrumentation do:
# git config --local --unset include.path
#
# To restore:
# git config --local include.path ../.gitconfig
#
[merge "nbdev-merge"]
name = resolve conflicts with nbdev_fix
driver = nbdev_merge %O %A %B %P
147 changes: 94 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,90 @@
# Welcome to CausalNLP


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## What is CausalNLP?
> CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

> CausalNLP is a practical toolkit for causal inference with text as
> treatment, outcome, or “controlled-for” variable.
## Features
- Low-code [causal inference](https://amaiya.github.io/causalnlp/examples.html) in as little as two commands
- Out-of-the-box support for using [**text** as a "controlled-for" variable](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-a-positive-review-on-product-views?) (e.g., confounder)
- Built-in [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) that transforms raw text into useful variables for causal analyses (e.g., topics, sentiment, emotion, etc.)
- Sensitivity analysis to [assess robustness of causal estimates](https://amaiya.github.io/causalnlp/causalinference.html#CausalInferenceModel.evaluate_robustness)
- Quick and simple [key driver analysis](https://amaiya.github.io/causalnlp/key_driver_analysis.html) to yield clues on potential drivers of an outcome based on predictive power, correlations, etc.
- Can easily be applied to ["traditional" tabular datasets without text](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?) (i.e., datasets with only numerical and categorical variables)
- Includes an experimental [PyTorch implementation](https://amaiya.github.io/causalnlp/core.causalbert.html) of [CausalBert](https://arxiv.org/abs/1905.12741) by Veitch, Sridar, and Blei (based on [reference implementation](https://github.com/rpryzant/causal-bert-pytorch) by R. Pryzant)

- Low-code [causal
inference](https://amaiya.github.io/causalnlp/examples.html) in as
little as two commands
- Out-of-the-box support for using [**text** as a “controlled-for”
variable](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-a-positive-review-on-product-views?)
(e.g., confounder)
- Built-in
[Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) that
transforms raw text into useful variables for causal analyses (e.g.,
topics, sentiment, emotion, etc.)
- Sensitivity analysis to [assess robustness of causal
estimates](https://amaiya.github.io/causalnlp/causalinference.html#CausalInferenceModel.evaluate_robustness)
- Quick and simple [key driver
analysis](https://amaiya.github.io/causalnlp/key_driver_analysis.html)
to yield clues on potential drivers of an outcome based on predictive
power, correlations, etc.
- Can easily be applied to [“traditional” tabular datasets without
text](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)
(i.e., datasets with only numerical and categorical variables)
- Includes an experimental [PyTorch
implementation](https://amaiya.github.io/causalnlp/core.causalbert.html)
of [CausalBert](https://arxiv.org/abs/1905.12741) by Veitch, Sridar,
and Blei (based on [reference
implementation](https://github.com/rpryzant/causal-bert-pytorch) by R.
Pryzant)

## Install

1. `pip install -U pip`
2. `pip install causalnlp`
1. `pip install -U pip`
2. `pip install causalnlp`

**NOTE**: On Python 3.6.x, if you get a `RuntimeError: Python version >= 3.7 required`, try ensuring NumPy is installed **before** CausalNLP (e.g., `pip install numpy==1.18.5`).
**NOTE**: On Python 3.6.x, if you get a
`RuntimeError: Python version >= 3.7 required`, try ensuring NumPy is
installed **before** CausalNLP (e.g., `pip install numpy==1.18.5`).

## Usage

To try out the [examples](https://amaiya.github.io/causalnlp/examples.html) yourself:
To try out the
[examples](https://amaiya.github.io/causalnlp/examples.html) yourself:

<a href="https://colab.research.google.com/drive/1hu7j2QCWkVlFsKbuereWWRDOBy1anMbQ?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Example: What is the causal impact of a positive review on a product click?

```
``` python
import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', error_bad_lines=False)
```

The file `music_seed50.tsv` is a semi-simulated dataset from [here](https://github.com/rpryzant/causal-text). Columns of relevance include:
- `Y_sim`: outcome, where 1 means product was clicked and 0 means not.
- `text`: raw text of review
- `rating`: rating associated with review (1 through 5)
- `T_true`: 0 means rating less than 3, 1 means rating of 5, where `T_true` affects the outcome `Y_sim`.
- `T_ac`: an approximation of true review sentiment (`T_true`) created with [Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) from raw review text
- `C_true`:confounding categorical variable (1=audio CD, 0=other)
``` python
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip')
```

The file `music_seed50.tsv` is a semi-simulated dataset from
[here](https://github.com/rpryzant/causal-text). Columns of relevance
include: - `Y_sim`: outcome, where 1 means product was clicked and 0
means not. - `text`: raw text of review - `rating`: rating associated
with review (1 through 5) - `T_true`: 0 means rating less than 3, 1
means rating of 5, where `T_true` affects the outcome `Y_sim`. - `T_ac`:
an approximation of true review sentiment (`T_true`) created with
[Autocoder](https://amaiya.github.io/causalnlp/autocoder.html) from raw
review text - `C_true`:confounding categorical variable (1=audio CD,
0=other)

We'll pretend the true sentiment (i.e., review rating and `T_true`) is hidden and only use `T_ac` as the treatment variable.
We’ll pretend the true sentiment (i.e., review rating and `T_true`) is
hidden and only use `T_ac` as the treatment variable.

Using the `text_col` parameter, we include the raw review text as another "controlled-for" variable.
Using the `text_col` parameter, we include the raw review text as
another “controlled-for” variable.

```
``` python
from causalnlp import CausalInferenceModel
from lightgbm import LGBMClassifier
```

``` python
cm = CausalInferenceModel(df,
metalearner_type='t-learner', learner=LGBMClassifier(num_leaves=500),
treatment_col='T_ac', outcome_col='Y_sim', text_col='text',
Expand All @@ -65,34 +100,36 @@ cm.fit()
start fitting causal inference model
time to fit causal inference model: 10.361494302749634 sec


#### Estimating Treatment Effects

CausalNLP supports estimation of heterogeneous treatment effects (i.e., how causal impacts vary across observations, which could be documents, emails, posts, individuals, or organizations).
CausalNLP supports estimation of heterogeneous treatment effects (i.e.,
how causal impacts vary across observations, which could be documents,
emails, posts, individuals, or organizations).

We will first calculate the overall average treatment effect (or ATE), which shows that a positive review increases the probability of a click by **13 percentage points** in this dataset.
We will first calculate the overall average treatment effect (or ATE),
which shows that a positive review increases the probability of a click
by **13 percentage points** in this dataset.

**Average Treatment Effect** (or **ATE**):

```
``` python
print( cm.estimate_ate() )
```

{'ate': 0.1309311542209525}

**Conditional Average Treatment Effect** (or **CATE**): reviews that
mention the word “toddler”:

**Conditional Average Treatment Effect** (or **CATE**): reviews that mention the word "toddler":

```
``` python
print( cm.estimate_ate(df['text'].str.contains('toddler')) )
```

{'ate': 0.15559234254638685}

**Individualized Treatment Effects** (or **ITE**):

**Individualized Treatment Effects** (or **ITE**):

```
``` python
test_df = pd.DataFrame({'T_ac' : [1], 'C_true' : [1],
'text' : ['I never bought this album, but I love his music and will soon!']})
effect = cm.predict(test_df)
Expand All @@ -101,10 +138,9 @@ print(effect)

[[0.80538201]]


**Model Interpretability**:

```
``` python
print( cm.interpret(plot=False)[1][:10] )
```

Expand All @@ -120,28 +156,33 @@ print( cm.interpret(plot=False)[1][:10] )
v_heard 0.028373
dtype: float64


Features with the `v_` prefix are word features. `C_true` is the categorical variable indicating whether or not the product is a CD.
Features with the `v_` prefix are word features. `C_true` is the
categorical variable indicating whether or not the product is a CD.

### Text is Optional in CausalNLP

Despite the "NLP" in CausalNLP, the library can be used for causal inference on data **without** text (e.g., only numerical and categorical variables). See [the examples](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?) for more info.
Despite the “NLP” in CausalNLP, the library can be used for causal
inference on data **without** text (e.g., only numerical and categorical
variables). See [the
examples](https://amaiya.github.io/causalnlp/examples.html#What-is-the-causal-impact-of-having-a-PhD-on-making-over-$50K?)
for more info.

## Documentation
API documentation and additional usage examples are available at: https://amaiya.github.io/causalnlp/

## How to Cite
API documentation and additional usage examples are available at:
https://amaiya.github.io/causalnlp/

Please cite [the following paper](https://arxiv.org/abs/2106.08043) when using CausalNLP in your work:
## How to Cite

```
@article{maiya2021causalnlp,
title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
author={Arun S. Maiya},
year={2021},
eprint={2106.08043},
archivePrefix={arXiv},
primaryClass={cs.CL},
journal={arXiv preprint arXiv:2106.08043},
}
```
Please cite [the following paper](https://arxiv.org/abs/2106.08043) when
using CausalNLP in your work:

@article{maiya2021causalnlp,
title={CausalNLP: A Practical Toolkit for Causal Inference with Text},
author={Arun S. Maiya},
year={2021},
eprint={2106.08043},
archivePrefix={arXiv},
primaryClass={cs.CL},
journal={arXiv preprint arXiv:2106.08043},
}
2 changes: 1 addition & 1 deletion causalnlp/analyzers.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ def get_topics(self, n_words=10, as_string=True):
Returns a list of discovered topics
"""
self._check_model()
feature_names = self.vectorizer.get_feature_names()
feature_names = self.vectorizer.get_feature_names_out()
topic_summaries = []
for topic_idx, topic in enumerate(self.model.components_):
summary = [feature_names[i] for i in topic.argsort()[:-n_words - 1:-1]]
Expand Down
2 changes: 1 addition & 1 deletion causalnlp/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def preprocess(self, df,
v_features = self.tv.fit_transform(df[self.text_col])
else:
v_features = self.tv.transform(df[self.text_col])
vocab = self.tv.get_feature_names()
vocab = self.tv.get_feature_names_out()
vocab_df = pd.DataFrame(v_features.toarray(), columns = ["v_%s" % (v) for v in vocab])
X = pd.concat([X, vocab_df], axis=1, join='inner')
outcome_type = 'categorical' if self.is_classification else 'numerical'
Expand Down
Loading

0 comments on commit b2acc5e

Please sign in to comment.