getting model-GBM.ipynb working on py 3.12.3

new model generated old models archived readme updates to document spark installation
ViralRedditPosts · Apr 26, 2024 · 2fd02f3 · 2fd02f3
1 parent 780305a
commit 2fd02f3
Show file tree

Hide file tree

Showing 10 changed files with 5,685 additions and 283 deletions.
diff --git a/.gitignore b/.gitignore
@@ -271,4 +271,4 @@ scripts/zippedLambdaFunction/*
 *reddit*.cfg
 !example_reddit.cfg
 
-model/pickledModels/latestModel.sav
+model/pickledModels/sklearn-1.0.2/latestModel.sav
diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@ The purpose of this repo is to:
     1. Install Terraform CLI
     2. Install AWS CLI and run `aws configure` and enter in your aws credentials.
     3. JDK 17 installed (8, 11 or 17 are compatible with spark 3.4.0)
+       1. You will need to add this to you're `.zshrc`: `export JAVA_HOME=\$(/usr/libexec/java_home)`
 3. Clone this repository 
 4. You can run the tests locally yourself by doing the following (it is recommended that you manage your python environments with something like [asdf](https://asdf-vm.com/) and use python==3.12.3 as your local runtime):
 
@@ -25,6 +26,16 @@ The purpose of this repo is to:
     pip install -e ."[dev]"  # installs this packages in local env with dependencies
     pytest . -r f -s   # -r f shows extra info for failures, -s disables capturing
     ```
+   1. If everything installed without issue then test that pyspark works, open a fresh terminal and type `pyspark` and hit enter. This is dependent upon setting JAVA_HOME in the earlier step. `exit()` out of this if it worked.
+   2. You need to follow the steps in the [Getting Started](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Getting_Started) section for connecting to S3, see also StackOverflow posts [like this one](https://stackoverflow.com/questions/44411493/java-lang-noclassdeffounderror-org-apache-hadoop-fs-storagestatistics/44500698#44500698) for clarifications. The important thing is that you install these 2 JARs in the pyspark classpath and that their versions match each other:
+      1. **hadoop-aws** JAR must match the version of hadoop required by this version of spark. Spark 3.4.0 requires hadoop 3.3.4.
+      2. the **AWS SDK For Java Bundle** JAR - this one you need to find the version that hadoop-aws was created with by looking at its [dependencies](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4). For hadoop-aws 3.3.4 this is 1.12.262.
+   3. The installed by navigating to something like the following:
+   ```shell
+   cd venv/lib/python3.12/site-packages/pyspark/jars/
+   curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
+   curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
+   ```
 
 5. From within this repository run the following:
   

diff --git a/model/model-GBM.ipynb b/model/model-GBM.ipynb
diff --git a/model/notebookArchive/model-GBM-20240426.ipynb b/model/notebookArchive/model-GBM-20240426.ipynb
diff --git a/model/pickledModels/README.md b/model/pickledModels/README.md
@@ -0,0 +1,3 @@
+When I first started this project in April 2023, I used sklearn 1.0.2 which I think was because I was on python 3.7 (already a bit old) and locked into some older packages (older versions of pyspark, numpy, pandas, etc). 
+
+Upon returning to this project in April 2024 I had a newer Mac and upgraded to python 3.12.3 and I was able to upgrade all of the above packages. However, because of this the pickled models could no longer be loaded. Therefore, the sklearn 1.0.2 have been moved to a separate folder for archival purposes. And newly trained models using sklearn 1.4.2 will be stored in the current directory.
diff --git a/model/pickledModels/Reddit_model_20240426-075204_GBM.sav b/model/pickledModels/Reddit_model_20240426-075204_GBM.sav
diff --git a/...odels/Reddit_model_20230414-061009_LR.sav → ...1.0.2/Reddit_model_20230414-061009_LR.sav b/...odels/Reddit_model_20230414-061009_LR.sav → ...1.0.2/Reddit_model_20230414-061009_LR.sav
diff --git a/...dels/Reddit_model_20230503-235329_GBM.sav → ....0.2/Reddit_model_20230503-235329_GBM.sav b/...dels/Reddit_model_20230503-235329_GBM.sav → ....0.2/Reddit_model_20230503-235329_GBM.sav
diff --git a/model/pickledModels/test_latestModel.sav b/model/pickledModels/test_latestModel.sav
diff --git a/pyproject.toml b/pyproject.toml
@@ -10,16 +10,13 @@ dynamic = ["version"]
 
 dependencies = [
     "boto3==1.26.117",
-    "matplotlib==3.8",
     "numpy==1.26",
     "pandas==2.2.2",  # 1.3 at least needed for M1 Mac
     "pg8000==1.29.4",  # this was easier to pip install than psycopg2
     "pyarrow==15.0.2",  # don't use low versions which pin lower versions of numpy that break on M1 Mac
     "pyspark==3.4.0",
     "requests==2.31.0",
     "scikit-learn==1.4.2",
-    "seaborn==0.11.2",
-    "shap==0.41.0",
     "sqlalchemy==1.4.46",  # originally tried 2.0.10, but this was incompatible with old versions of pandas https://stackoverflow.com/a/75282604/5034651,
     "viral_reddit_posts_utils @ git+https://github.com/ViralRedditPosts/Utils.git@main",
     "Reddit-Scraping @ git+https://github.com/ViralRedditPosts/Reddit-Scraping.git@main",
@@ -48,8 +45,12 @@ build = [
     "Reddit-Model[test]"
 ]
 dev = [
+    "matplotlib==3.8",  # packages for plotting and notebook work are only needed in dev
+    "notebook==7.1.3",
     "pre-commit==2.21.0",
     "Reddit-Model[build]"
+    "seaborn==0.11.2",
+    "shap==0.45.0",
 ]
 
 [tool.setuptools.packages.find]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		When I first started this project in April 2023, I used sklearn 1.0.2 which I think was because I was on python 3.7 (already a bit old) and locked into some older packages (older versions of pyspark, numpy, pandas, etc).

		Upon returning to this project in April 2024 I had a newer Mac and upgraded to python 3.12.3 and I was able to upgrade all of the above packages. However, because of this the pickled models could no longer be loaded. Therefore, the sklearn 1.0.2 have been moved to a separate folder for archival purposes. And newly trained models using sklearn 1.4.2 will be stored in the current directory.