Meeting a problem in Gaussian Mixture clustering part #8

lockbro · 2020-04-01T09:34:08Z

`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture

t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features",
predictionCol="cluster", probabilityCol="gm_prob")

gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)

gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()

gm_params = (gm_model.stages[2].gaussiansDF.rdd
.map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
.collect())
gm_weights = gm_model.stages[2].weights

print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`

When i run this part in jupyter notebook, an error appear：
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
14
15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
17 .collect())
18 gm_weights = gm_model.stages[2].weights

C:\Spark\python\pyspark\rdd.py in collect(self)
813 to be small, as all the data is loaded into the driver's memory.
814 """
--> 815 with SCCallSiteSync(self.context) as css:
816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))

C:\Spark\python\pyspark\traceback_utils.py in enter(self)
70 def enter(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74

AttributeError: 'NoneType' object has no attribute 'setCallSite'`

I do some research but there is few answer，some people said it‘s spark own bug.And by the way，i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.

Can you please help me solve the problem? I'll thank you very much！

thinline72 · 2020-04-01T21:17:53Z

Hi @lockbro ,
I haven't been updating this repo for a while, but I just pushed a commit that has simplified instructions to build and run docker image with pysprak installed etc. It uses pyspark 2.4.5 too. I've run the notebook end-to-end and haven't faced any issues, so I'd advise you to just use the docker container for that.

So you just need to run make nsl-kdd-pyspark command. It'll download the latest jupyter/pyspark-notebook docker image, start a container with Jupyter at 8889 port and print you current Jupyter token after 15 seconds (to make sure that Jupyter had enough time to get running).

Hope it'll help!

lockbro · 2020-04-02T09:37:22Z

I don't figure out how to use docker image untill now, so i am wonder why these code can't run smothly in my own environment (Win7 32bit + Anaconda 3.7 + Pyspark 2.4.5)
(My computer can't use docker because the OS is Win7 32bit :( )

thinline72 · 2020-04-02T10:30:08Z

@lockbro I'm sorry, but I don't use Windows as a main platform for my work so I'm not able to help here.
I'd suggest you to just skip that part with Gaussian Mixture models if you cannot run it in cloud or on different machine.

lockbro · 2020-04-02T12:15:14Z

Can I ask what platform and specific edition did you use? I wanna use virtual machine to run the code.

thinline72 · 2020-04-02T13:38:27Z

As I mentioned above I'm running it in the docker container which is already kind of VM, so I can run it from my macbook or from my PC with Linux.

If you are interested in what OS is used inside that docker image, pls check here
Seems like it's Ubuntu 18.04 (bionic).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting a problem in Gaussian Mixture clustering part #8

Meeting a problem in Gaussian Mixture clustering part #8

lockbro commented Apr 1, 2020

thinline72 commented Apr 1, 2020

lockbro commented Apr 2, 2020

thinline72 commented Apr 2, 2020

lockbro commented Apr 2, 2020

thinline72 commented Apr 2, 2020 •

edited

Loading

Meeting a problem in Gaussian Mixture clustering part #8

Meeting a problem in Gaussian Mixture clustering part #8

Comments

lockbro commented Apr 1, 2020

thinline72 commented Apr 1, 2020

lockbro commented Apr 2, 2020

thinline72 commented Apr 2, 2020

lockbro commented Apr 2, 2020

thinline72 commented Apr 2, 2020 • edited Loading

thinline72 commented Apr 2, 2020 •

edited

Loading