I have prepared a GitHub Repository that provides a set of self-study tutorials on Machine Learning for big data using Apache Spark (PySpark) from basics (Dataframes and SQL) to advanced (Machine Learning Library (MLlib)) topics with practical real-world projects and datasets.
https://github.com/iamirmasoud/pyspark_tutorials
Note: I have tested the codes on Linux. It can surely be run on Windows and Mac with some little changes.
git clone https://github.com/iamirmasoud/pyspark_tutorials.git
cd pyspark_tutorials
spark_env
with Python 3.7. If prompted to proceed with the install (Proceed [y]/n)
type y.conda create -n spark_env python=3.7.10 source activate spark_envAt this point your command line should look something like: (spark_env) <User>:pyspark_tutorials <user>$
. The (spark_env)
indicates that your environment has been activated, and you can proceed with further package installations.
pip install -r requirements.txt
cd pyspark_tutorials
jupyter notebook
spark_env
environment by clicking Kernel > Change Kernel > spark_env
.
├── 1_Python vs PySpark
│ └── 1_Python vs PySpark
│ ├── Datasets
│ ├── Python vs PySpark [PySpark].ipynb
│ ├── Python vs PySpark [PySpark].py
│ ├── Python vs PySpark [Python].ipynb
│ └── Python vs PySpark [Python].py
├── 2_IO_Filter_SQL
│ ├── 1_Read_Write_and_Validate_Data
│ │ ├── Datasets
│ │ ├── parquet
│ │ ├── partitioned_parquet
│ │ ├── partition_parquet
│ │ ├── part_parquet
│ │ ├── Read_Write_and_Validate_Data_HW.ipynb
│ │ ├── Read_Write_and_Validate_Data_HW.py
│ │ ├── Read_Write_and_Validate_Data_HW_Solutions.ipynb
│ │ ├── Read_Write_and_Validate_Data_HW_Solutions.py
│ │ ├── Read_Write_and_Validate_Data.ipynb
│ │ ├── Read_Write_and_Validate_Data.py
│ │ └── write_test2.csv
│ ├── 2_Search_and_Filter_DataFrames_in_PySpark
│ │ ├── Datasets
│ │ ├── Search and Filter DataFrames in PySpark-HW.ipynb
│ │ ├── Search and Filter DataFrames in PySpark-HW.py
│ │ ├── Search and Filter DataFrames in PySpark-HW-Solutions.ipynb
│ │ ├── Search and Filter DataFrames in PySpark-HW-Solutions.py
│ │ ├── Search and Filter DataFrames in PySpark.ipynb
│ │ └── Search and Filter DataFrames in PySpark.py
│ └── 3_SQL_Options_in_Spark
│ ├── Datasets
│ ├── SQL_Options_in_Spark_HW.ipynb
│ ├── SQL_Options_in_Spark_HW.py
│ ├── SQL_Options_in_Spark_HW_Solutions.ipynb
│ ├── SQL_Options_in_Spark_HW_Solutions.py
│ ├── SQL_Options_in_Spark.ipynb
│ └── SQL_Options_in_Spark.py
├── 3_Manipulation_Aggregation
│ ├── 1_Manipulating_Data_in_DataFrames
│ │ ├── Datasets
│ │ ├── Manipulating_Data_in_DataFrames_HW.ipynb
│ │ ├── Manipulating_Data_in_DataFrames_HW.py
│ │ ├── Manipulating_Data_in_DataFrames_HW_Solutions.ipynb
│ │ ├── Manipulating_Data_in_DataFrames_HW_Solutions.py
│ │ ├── Manipulating_Data_in_DataFrames.ipynb
│ │ └── Manipulating_Data_in_DataFrames.py
│ ├── 2_Aggregating_DataFrames
│ │ ├── Aggregating_DataFrames_in_PySpark_HW.ipynb
│ │ ├── Aggregating_DataFrames_in_PySpark_HW.py
│ │ ├── Aggregating_DataFrames_in_PySpark_HW_Solutions.ipynb
│ │ ├── Aggregating_DataFrames_in_PySpark_HW_Solutions.py
│ │ ├── Aggregating_DataFrames_in_PySpark.ipynb
│ │ ├── Aggregating_DataFrames_in_PySpark.py
│ │ └── Datasets
│ ├── 3_Joining_and_Appending_DataFrames
│ │ ├── Datasets
│ │ ├── Joining_and_Appending_DataFrames_in_PySpark_HW.ipynb
│ │ ├── Joining_and_Appending_DataFrames_in_PySpark_HW.py
│ │ ├── Joining_and_Appending_DataFrames_in_PySpark_HW_Solutions.ipynb
│ │ ├── Joining_and_Appending_DataFrames_in_PySpark_HW_Solutions.py
│ │ ├── Joining_and_Appending_DataFrames_in_PySpark.ipynb
│ │ └── Joining_and_Appending_DataFrames_in_PySpark.py
│ ├── 4_Handling_Missing_Data
│ │ ├── Datasets
│ │ ├── Handling_Missing_Data_in_PySpark_HW.ipynb
│ │ ├── Handling_Missing_Data_in_PySpark_HW.py
│ │ ├── Handling_Missing_Data_in_PySpark_HW_Solutions.ipynb
│ │ ├── Handling_Missing_Data_in_PySpark_HW_Solutions.py
│ │ ├── Handling_Missing_Data_in_PySpark.ipynb
│ │ └── Handling_Missing_Data_in_PySpark.py
│ └── 5_PySpark_Dataframe_Basics
│ ├── Datasets
│ ├── PySpark_Dataframe_Basics_MASTER.ipynb
│ └── PySpark_Dataframe_Basics_MASTER.py
├── 4_Classification_in_PySparks_MLlib
│ ├── 1_Classification_in_PySparks_MLlib
│ │ ├── Classification_in_PySparks_MLlib_with_functions.ipynb
│ │ ├── Classification_in_PySparks_MLlib_with_functions.py
│ │ ├── Classification_in_PySparks_MLlib_without_functions.ipynb
│ │ ├── Classification_in_PySparks_MLlib_without_functions.py
│ │ └── Datasets
│ ├── 2_Classification_in_PySparks_MLlib_with_MLflow
│ │ ├── Classification_in_PySparks_MLlib_with_MLflow.ipynb
│ │ ├── Classification_in_PySparks_MLlib_with_MLflow.py
│ │ └── Datasets
│ └── 3_Classification_in_PySparks_MLlib_Project
│ ├── Classification_in_PySparks_MLlib_Project.ipynb
│ ├── Classification_in_PySparks_MLlib_Project.py
│ ├── Classification_in_PySparks_MLlib_Project_Solution.ipynb
│ ├── Classification_in_PySparks_MLlib_Project_Solution.py
│ └── Datasets
├── 5_NLP_in_Pysparks_MLlib
│ ├── 1_NLP_in_Pysparks_MLlib
│ │ ├── Datasets
│ │ ├── NLP_in_Pysparks_MLlib.ipynb
│ │ └── NLP_in_Pysparks_MLlib.py
│ └── 2_NLP_in_Pysparks_MLlib_Project
│ ├── Datasets
│ ├── NLP_in_Pysparks_MLlib_Project.ipynb
│ ├── NLP_in_Pysparks_MLlib_Project.py
│ ├── NLP_in_Pysparks_MLlib_Project_Solution.ipynb
│ └── NLP_in_Pysparks_MLlib_Project_Solution.py
├── 6_Regression_in_Pysparks_MLlib
│ ├── 1_Regression_in_Pysparks_MLlib
│ │ ├── Datasets
│ │ ├── Regression_in_Pysparks_MLlib_with_functions.ipynb
│ │ ├── Regression_in_Pysparks_MLlib_with_functions.py
│ │ ├── Regression_in_Pysparks_MLlib_without_functions.ipynb
│ │ └── Regression_in_Pysparks_MLlib_without_functions.py
│ └── 2_Regression_in_Pysparks_MLlib_Project
│ ├── Datasets
│ ├── Regression_in_Pysparks_MLlib_Project.ipynb
│ ├── Regression_in_Pysparks_MLlib_Project.py
│ ├── Regression_in_Pysparks_MLlib_Project_Solution.ipynb
│ └── Regression_in_Pysparks_MLlib_Project_Solution.py
├── 7_Unsupervised_Learning_in_Pyspark_MLlib
│ ├── 1_Kmeans_and_Bisecting_Kmeans_in_Pysparks_MLlib
│ │ ├── Datasets
│ │ ├── Kmeans_and_Bisecting_Kmeans_in_Pysparks_MLlib.ipynb
│ │ └── Kmeans_and_Bisecting_Kmeans_in_Pysparks_MLlib.py
│ ├── 2_LDA_in_PySpark_MLlib
│ │ ├── Datasets
│ │ ├── LDA_in_PySpark_MLlib.ipynb
│ │ └── LDA_in_PySpark_MLlib.py
│ ├── 3_GaussianMixture_in_Pysparks_MLlib
│ │ ├── Datasets
│ │ ├── GaussianMixture_in_Pysparks_MLlib.ipynb
│ │ └── GaussianMixture_in_Pysparks_MLlib.py
│ └── 4_Clustering_in_Pysparks_MLlib_Project
│ ├── Clustering_in_Pysparks_MLlib_Project.ipynb
│ ├── Clustering_in_Pysparks_MLlib_Project.py
│ ├── Clustering_in_Pysparks_MLlib_Project_Solution.ipynb
│ ├── Clustering_in_Pysparks_MLlib_Project_Solution.py
│ └── Datasets
├── 8_Frequent_Pattern_Mining_in_PySparks_MLlib
│ ├── 1_Frequent_Pattern_Mining_in_PySparks_MLlib
│ │ ├── Datasets
│ │ ├── Frequent_Pattern_Mining_in_PySparks_MLlib.ipynb
│ │ └── Frequent_Pattern_Mining_in_PySparks_MLlib.py
│ └── 2_Frequent_Pattern_Mining_in_PySparks_MLlib_Project
│ ├── Datasets
│ ├── Frequent_Pattern_Mining_in_PySparks_MLlib_Project.ipynb
│ ├── Frequent_Pattern_Mining_in_PySparks_MLlib_Project.py
│ ├── Frequent_Pattern_Mining_in_PySparks_MLlib_Project_Solution.ipynb
│ └── Frequent_Pattern_Mining_in_PySparks_MLlib_Project_Solution.py
└──────────
Before running the python scripts and jupyter notebooks of each section, please download the necessary datasets for each section from the list below and put them in a directory called Datasets
next to the scripts. You can find more details about each dataset in the jupyter notebook files.
Datasets:
Datasets:
Datasets:
Datasets:
Datasets:
Datasets:
Datasets:
Datasets:
Datasets:
Project – Genre classification:
Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherently know the difference between a pop song and heavy metal? This type of classification may seem easy for us, but it’s a very difficult challenge for a computer to do. So the question is, could an automatic genre classification model be possible? For this project, we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!
Datasets:
Project – Kickstarter Project Success Prediction:
Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform, focused on creativity and merchandising. The company’s stated mission is to “help bring creative projects to life”. Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology, and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to the subscription model of arts patronage, where artists would go directly to their audiences to fund their work.
The goal is to predict if a project will be or not to be able to get the money from their backers.
Datasets:
Project – Indeed Real/Fake Job Posting Prediction:
Indeed.com has just hired you to create a system that automatically flags suspicious job postings on its website. It has recently seen an influx of fake job postings that are negatively impacting its customer experience. Because of the high volume of job postings, it receives every day, its employees don’t have the capacity to check every posting, so they would like an automated system that prioritizes which postings to review before deleting them. The final task is to use the attached dataset to create an NLP algorithm that automatically flags suspicious posts for review.
Datasets:
Project – House Price Prediction in California
Datasets:
Project – Cement Strength Prediction based on Ingredients:
You have been hired as a consultant to a cement production company that wants to be able to improve their customer experience around a number of areas like being able to provide recommendations to customers on optimal amounts of certain ingredients in the cement-making process and perhaps even create an application where users can input their own values and received a predicted cement strength!
Datasets:
Project – Customer Segmentation:
Use customer data to target marketing efforts! We could use clustering to target similar customer segments. For example, if we do some research about the groups and discover that one is mostly a certain social economic status and purchasing frequency, and offer them a cost savings package that could be beneficial to them. How cool would that be?!
We could also learn a bit more about our clustering by calling on various aggregate statistics for each one of the clusters across each of the variables in our dataframe like this.
Datasets:
Project – Topic Modeling for Cooking Recipes from BBC Good Food:
We will be analyzing a collection of Christmas cooking recipes scraped from BBC Good Food. We want to try to discover some additional themes amongst these recipes imagining that we want to create our own website that provides a more intelligent tagging system for recipes that are pulled from multiple data sources.
Datasets:
Project – Customer Segmentation based on sales:
Datasets:
Project – University Clustering for the Greater Good:
You are a data scientist employed by the ABCDE Foundation, a non-profit organization whose mission is to increase college graduation rates for underprivileged populations. Through advocacy and targeted outreach programs, ABCDE strives to identify and alleviate barriers to educational achievement. ABCDE is driven by the belief that with the right support, an increase in college attendance and completion rates can be achieved, thereby weakening the grip of the cycles of poverty and social immobility affecting many of our communities. ABCDE is committed to developing a more data-driven approach to decision-making. As a prelude to future analyses, ABCDE has requested that you analyze the data to identify clusters of similar colleges and universities.
Your task is to use cluster analysis to identify the groups of characteristically similar schools in the dataset.
Datasets:
Project – Analyzing Participants in Personality Test
Datasets:
Project – Market Basket Analysis:
You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income, and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data. You want to understand the customers who can be easily grouped together so that a strategy can be provided to the marketing team to plan accordingly.
Datasets: