Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meeting a problem in Gaussian Mixture clustering part #8

Open
lockbro opened this issue Apr 1, 2020 · 5 comments
Open

Meeting a problem in Gaussian Mixture clustering part #8

lockbro opened this issue Apr 1, 2020 · 5 comments

Comments

@lockbro
Copy link

lockbro commented Apr 1, 2020

`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture

t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features",
predictionCol="cluster", probabilityCol="gm_prob")

gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)

gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()

gm_params = (gm_model.stages[2].gaussiansDF.rdd
.map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
.collect())
gm_weights = gm_model.stages[2].weights

print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`

When i run this part in jupyter notebook, an error appear:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
14
15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
17 .collect())
18 gm_weights = gm_model.stages[2].weights

C:\Spark\python\pyspark\rdd.py in collect(self)
813 to be small, as all the data is loaded into the driver's memory.
814 """
--> 815 with SCCallSiteSync(self.context) as css:
816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))

C:\Spark\python\pyspark\traceback_utils.py in enter(self)
70 def enter(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74

AttributeError: 'NoneType' object has no attribute 'setCallSite'`

I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.

Can you please help me solve the problem? I'll thank you very much!

@thinline72
Copy link
Owner

Hi @lockbro ,
I haven't been updating this repo for a while, but I just pushed a commit that has simplified instructions to build and run docker image with pysprak installed etc. It uses pyspark 2.4.5 too. I've run the notebook end-to-end and haven't faced any issues, so I'd advise you to just use the docker container for that.

So you just need to run make nsl-kdd-pyspark command. It'll download the latest jupyter/pyspark-notebook docker image, start a container with Jupyter at 8889 port and print you current Jupyter token after 15 seconds (to make sure that Jupyter had enough time to get running).

Hope it'll help!

@lockbro
Copy link
Author

lockbro commented Apr 2, 2020

I don't figure out how to use docker image untill now, so i am wonder why these code can't run smothly in my own environment (Win7 32bit + Anaconda 3.7 + Pyspark 2.4.5)
(My computer can't use docker because the OS is Win7 32bit :( )

@thinline72
Copy link
Owner

@lockbro I'm sorry, but I don't use Windows as a main platform for my work so I'm not able to help here.
I'd suggest you to just skip that part with Gaussian Mixture models if you cannot run it in cloud or on different machine.

@lockbro
Copy link
Author

lockbro commented Apr 2, 2020

Can I ask what platform and specific edition did you use? I wanna use virtual machine to run the code.

@thinline72
Copy link
Owner

thinline72 commented Apr 2, 2020

As I mentioned above I'm running it in the docker container which is already kind of VM, so I can run it from my macbook or from my PC with Linux.

If you are interested in what OS is used inside that docker image, pls check here
Seems like it's Ubuntu 18.04 (bionic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants