[notebook] MLlib: machine learning on PySpark

Today I gave a tutorial on MLlib in PySpark. I post the notebook here for whoever could be interested =)

MLlib is a package of Spark (available also in PySpark).

MLlib is just a package of Spark, therefore, no need for extra intallation (once you have your Spark up and running). There are different (sub-)packages available in MLlib that can be useful for machine learning on big data.

In this lab we will see something from Statistics, Regression, Classification, and Clustering. But the documentation often comes with example, so I enourage you to take a look: MLlib on PySpark

Dataset

In this lab, we will use data about the 2016 US Presidential elections. The data is available on Kaggle: here

Continue reading