-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meeting a problem in Gaussian Mixture clustering part #8
Comments
Hi @lockbro , So you just need to run Hope it'll help! |
I don't figure out how to use docker image untill now, so i am wonder why these code can't run smothly in my own environment (Win7 32bit + Anaconda 3.7 + Pyspark 2.4.5) |
@lockbro I'm sorry, but I don't use Windows as a main platform for my work so I'm not able to help here. |
Can I ask what platform and specific edition did you use? I wanna use virtual machine to run the code. |
As I mentioned above I'm running it in the docker container which is already kind of VM, so I can run it from my macbook or from my PC with Linux. If you are interested in what OS is used inside that docker image, pls check here |
`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture
t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features",
predictionCol="cluster", probabilityCol="gm_prob")
gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)
gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()
gm_params = (gm_model.stages[2].gaussiansDF.rdd
.map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
.collect())
gm_weights = gm_model.stages[2].weights
print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`
When i run this part in jupyter notebook, an error appear:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
14
15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
17 .collect())
18 gm_weights = gm_model.stages[2].weights
C:\Spark\python\pyspark\rdd.py in collect(self)
813 to be small, as all the data is loaded into the driver's memory.
814 """
--> 815 with SCCallSiteSync(self.context) as css:
816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
C:\Spark\python\pyspark\traceback_utils.py in enter(self)
70 def enter(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74
AttributeError: 'NoneType' object has no attribute 'setCallSite'`
I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.
Can you please help me solve the problem? I'll thank you very much!
The text was updated successfully, but these errors were encountered: