Skip to content

Commit

Permalink
getting model-GBM.ipynb working on py 3.12.3
Browse files Browse the repository at this point in the history
new model generated
old models archived
readme updates to document spark installation
  • Loading branch information
kennethjmyers committed Apr 26, 2024
1 parent 780305a commit 2fd02f3
Show file tree
Hide file tree
Showing 10 changed files with 5,685 additions and 283 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -271,4 +271,4 @@ scripts/zippedLambdaFunction/*
*reddit*.cfg
!example_reddit.cfg

model/pickledModels/latestModel.sav
model/pickledModels/sklearn-1.0.2/latestModel.sav
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ The purpose of this repo is to:
1. Install Terraform CLI
2. Install AWS CLI and run `aws configure` and enter in your aws credentials.
3. JDK 17 installed (8, 11 or 17 are compatible with spark 3.4.0)
1. You will need to add this to you're `.zshrc`: `export JAVA_HOME=\$(/usr/libexec/java_home)`
3. Clone this repository
4. You can run the tests locally yourself by doing the following (it is recommended that you manage your python environments with something like [asdf](https://asdf-vm.com/) and use python==3.12.3 as your local runtime):

Expand All @@ -25,6 +26,16 @@ The purpose of this repo is to:
pip install -e ."[dev]" # installs this packages in local env with dependencies
pytest . -r f -s # -r f shows extra info for failures, -s disables capturing
```
1. If everything installed without issue then test that pyspark works, open a fresh terminal and type `pyspark` and hit enter. This is dependent upon setting JAVA_HOME in the earlier step. `exit()` out of this if it worked.
2. You need to follow the steps in the [Getting Started](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Getting_Started) section for connecting to S3, see also StackOverflow posts [like this one](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics/44500698#44500698) for clarifications. The important thing is that you install these 2 JARs in the pyspark classpath and that their versions match each other:
1. **hadoop-aws** JAR must match the version of hadoop required by this version of spark. Spark 3.4.0 requires hadoop 3.3.4.
2. the **AWS SDK For Java Bundle** JAR - this one you need to find the version that hadoop-aws was created with by looking at its [dependencies](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4). For hadoop-aws 3.3.4 this is 1.12.262.
3. The installed by navigating to something like the following:
```shell
cd venv/lib/python3.12/site-packages/pyspark/jars/
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
```
5. From within this repository run the following:
Expand Down
2,550 changes: 2,271 additions & 279 deletions model/model-GBM.ipynb

Large diffs are not rendered by default.

3,395 changes: 3,395 additions & 0 deletions model/notebookArchive/model-GBM-20240426.ipynb

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions model/pickledModels/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
When I first started this project in April 2023, I used sklearn 1.0.2 which I think was because I was on python 3.7 (already a bit old) and locked into some older packages (older versions of pyspark, numpy, pandas, etc).

Upon returning to this project in April 2024 I had a newer Mac and upgraded to python 3.12.3 and I was able to upgrade all of the above packages. However, because of this the pickled models could no longer be loaded. Therefore, the sklearn 1.0.2 have been moved to a separate folder for archival purposes. And newly trained models using sklearn 1.4.2 will be stored in the current directory.
Binary file not shown.
Binary file modified model/pickledModels/test_latestModel.sav
Binary file not shown.
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,13 @@ dynamic = ["version"]

dependencies = [
"boto3==1.26.117",
"matplotlib==3.8",
"numpy==1.26",
"pandas==2.2.2", # 1.3 at least needed for M1 Mac
"pg8000==1.29.4", # this was easier to pip install than psycopg2
"pyarrow==15.0.2", # don't use low versions which pin lower versions of numpy that break on M1 Mac
"pyspark==3.4.0",
"requests==2.31.0",
"scikit-learn==1.4.2",
"seaborn==0.11.2",
"shap==0.41.0",
"sqlalchemy==1.4.46", # originally tried 2.0.10, but this was incompatible with old versions of pandas https://stackoverflow.com/a/75282604/5034651,
"viral_reddit_posts_utils @ git+https://github.com/ViralRedditPosts/Utils.git@main",
"Reddit-Scraping @ git+https://github.com/ViralRedditPosts/Reddit-Scraping.git@main",
Expand Down Expand Up @@ -48,8 +45,12 @@ build = [
"Reddit-Model[test]"
]
dev = [
"matplotlib==3.8", # packages for plotting and notebook work are only needed in dev
"notebook==7.1.3",
"pre-commit==2.21.0",
"Reddit-Model[build]"
"seaborn==0.11.2",
"shap==0.45.0",
]

[tool.setuptools.packages.find]
Expand Down

0 comments on commit 2fd02f3

Please sign in to comment.