YouTube View Prediction with Machine Learning

Jingzong Wang

2020-03

Libraries and Setup

In [ ]:
! sudo apt install openjdk-8-jdk
! sudo update-alternatives --config java
In [ ]:
import numpy as np 
import pandas as pd 
import json
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from datetime import datetime
import glob
import seaborn as sns
import re
import os
In [ ]:
import boto3
from botocore import UNSIGNED

from botocore.config import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3.Bucket('penn-cis545-files').download_file('youtube_data.zip', 'youtube_data.zip')

!unzip /content/youtube_data.zip

Section 1 : Machine Learning with Sklearn

1.1 Data loading and Preprocessing

The dataset is a daily record of the top trending YouTube videos.

To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Top performers on the YouTube trending list are music videos (such as the famously viral “Gangnam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for numerous countries, with up to 200 listed trending videos per day.

Each region’s data is in a separate file. Data includes:

  • Video Title
  • Channel title
  • Publish time
  • Tags
  • Views
  • Likes
  • Dislikes
  • Description
  • Comment count

The data also includes a category_id field, which varies between regions.

1.1.1: Combining Multiple CSV's

There are multiple csv files in the dataset, each corresponding to a specific country. So, the first step is to read them and combine these csv files into a single dataframe.

In [0]:
# Import all the csv files
files = [i for i in glob.glob('/content/youtube_data/*.csv'.format('csv'))]
sorted(files)
Out[0]:
['/content/youtube_data/CAvideos.csv',
 '/content/youtube_data/FRvideos.csv',
 '/content/youtube_data/INvideos.csv',
 '/content/youtube_data/USvideos.csv']
In [0]:
# Combine all into a single dataframe "combined_data" and add a 'country' column. 
all_dataframes = list()
country=['CA','FR','IN','US']
i=0
for csv in files:
  all_dataframes = all_dataframes+[pd.read_csv(csv,header=[0]).set_index('video_id')]
  all_dataframes[i]['country']=country[i]
  i=i+1
combined_data = pd.concat(all_dataframes)
In [0]:
combined_data
Out[0]:
trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description country
video_id
2kyS6SvSYSE 17.14.11 WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat 22 2017-11-13T17:13:01.000Z SHANtell martin 748374 57527 2966 15954 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False False False SHANTELL'S CHANNEL - https://www.youtube.com/s... CA
1ZAPwfrtAFY 17.14.11 The Trump Presidency: Last Week Tonight with J... LastWeekTonight 24 2017-11-13T07:30:00.000Z last week tonight trump presidency|"last week ... 2418783 97185 6146 12703 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False False False One year after the presidential election, John... CA
5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146033 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http... CA
puqaWrEC7tY 17.14.11 Nickelback Lyrics: Real or Fake? Good Mythical Morning 24 2017-11-13T11:00:04.000Z rhett and link|"gmm"|"good mythical morning"|"... 343168 10172 666 2146 https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg False False False Today we find out if Link is a Nickelback amat... CA
d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095731 132235 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho... CA
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
iNHecA3PJCo 18.14.06 फेकू आशिक़ - राजस्थान की सबसे शानदार कॉमेडी | ... RDC Rajasthani 23 2018-06-13T08:01:11.000Z twinkle vaishnav comedy|"twinkle vaishnav"|"tw... 214378 3291 404 196 https://i.ytimg.com/vi/iNHecA3PJCo/default.jpg False False False PRG Music & RDC Rajasthani presents फेकू आशिक़... US
dpPmPbhcslM 18.14.06 Seetha | Flowers | Ep# 364 Flowers TV 24 2018-06-13T11:30:04.000Z flowers serials|"actress"|"malayalam serials"|... 406828 1726 478 1428 https://i.ytimg.com/vi/dpPmPbhcslM/default.jpg False False False Flowers - A R Rahman Show,Book your Tickets He... US
mV6aztP58f8 18.14.06 Bhramanam I Episode 87 - 12 June 2018 I Mazhav... Mazhavil Manorama 24 2018-06-13T05:00:02.000Z mazhavil manorama|"bhramanam full episode"|"gt... 386319 1216 453 697 https://i.ytimg.com/vi/mV6aztP58f8/default.jpg False False False Subscribe to Mazhavil Manorama now for your da... US
qxqDNP1bDEw 18.14.06 Nua Bohu | Full Ep 285 | 13th June 2018 | Odia... Tarang TV 24 2018-06-13T15:07:49.000Z tarang|"tarang tv"|"tarang tv online"|"tarang ... 130263 698 115 65 https://i.ytimg.com/vi/qxqDNP1bDEw/default.jpg False False False Nuabohu : Story of a rustic village girl who w... US
wERgpPK44w0 18.14.06 Ee Nagaraniki Emaindi Trailer | Tharun Bhascke... Suresh Productions 24 2018-06-10T04:29:54.000Z Ee Nagaraniki Emaindi|"Ee Nagaraniki Emaindi T... 1278249 22466 1609 1205 https://i.ytimg.com/vi/wERgpPK44w0/default.jpg False False False Check out Ee Nagaraniki Emaindi Trailer #EeNag... US

159906 rows × 16 columns

1.1.2: Map category Id's to categories

In [0]:
# Read the category_id.json file and map the category_id's in the dataframe to the category name.
combined_data['category_id'] = combined_data['category_id'].astype(str)

def find_title(item):
  return item['title']
category_df = pd.DataFrame(pd.read_json('/content/youtube_data/US_category_id.json').iloc[:,2].tolist())
category_df['title'] = category_df['snippet'].apply(find_title)
def map_cat(id_str):
  return category_df.loc[category_df.id==id_str,'title'].values[0]
combined_data.insert(4, 'category',combined_data['category_id'].apply(map_cat))
In [0]:
combined_data
Out[0]:
trending_date title channel_title category_id category publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description country
video_id
2kyS6SvSYSE 17.14.11 WE WANT TO TALK ABOUT OUR MARRIAGE CaseyNeistat 22 People & Blogs 2017-11-13T17:13:01.000Z SHANtell martin 748374 57527 2966 15954 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg False False False SHANTELL'S CHANNEL - https://www.youtube.com/s... CA
1ZAPwfrtAFY 17.14.11 The Trump Presidency: Last Week Tonight with J... LastWeekTonight 24 Entertainment 2017-11-13T07:30:00.000Z last week tonight trump presidency|"last week ... 2418783 97185 6146 12703 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg False False False One year after the presidential election, John... CA
5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 Comedy 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146033 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http... CA
puqaWrEC7tY 17.14.11 Nickelback Lyrics: Real or Fake? Good Mythical Morning 24 Entertainment 2017-11-13T11:00:04.000Z rhett and link|"gmm"|"good mythical morning"|"... 343168 10172 666 2146 https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg False False False Today we find out if Link is a Nickelback amat... CA
d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 Entertainment 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095731 132235 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho... CA
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
iNHecA3PJCo 18.14.06 फेकू आशिक़ - राजस्थान की सबसे शानदार कॉमेडी | ... RDC Rajasthani 23 Comedy 2018-06-13T08:01:11.000Z twinkle vaishnav comedy|"twinkle vaishnav"|"tw... 214378 3291 404 196 https://i.ytimg.com/vi/iNHecA3PJCo/default.jpg False False False PRG Music & RDC Rajasthani presents फेकू आशिक़... US
dpPmPbhcslM 18.14.06 Seetha | Flowers | Ep# 364 Flowers TV 24 Entertainment 2018-06-13T11:30:04.000Z flowers serials|"actress"|"malayalam serials"|... 406828 1726 478 1428 https://i.ytimg.com/vi/dpPmPbhcslM/default.jpg False False False Flowers - A R Rahman Show,Book your Tickets He... US
mV6aztP58f8 18.14.06 Bhramanam I Episode 87 - 12 June 2018 I Mazhav... Mazhavil Manorama 24 Entertainment 2018-06-13T05:00:02.000Z mazhavil manorama|"bhramanam full episode"|"gt... 386319 1216 453 697 https://i.ytimg.com/vi/mV6aztP58f8/default.jpg False False False Subscribe to Mazhavil Manorama now for your da... US
qxqDNP1bDEw 18.14.06 Nua Bohu | Full Ep 285 | 13th June 2018 | Odia... Tarang TV 24 Entertainment 2018-06-13T15:07:49.000Z tarang|"tarang tv"|"tarang tv online"|"tarang ... 130263 698 115 65 https://i.ytimg.com/vi/qxqDNP1bDEw/default.jpg False False False Nuabohu : Story of a rustic village girl who w... US
wERgpPK44w0 18.14.06 Ee Nagaraniki Emaindi Trailer | Tharun Bhascke... Suresh Productions 24 Entertainment 2018-06-10T04:29:54.000Z Ee Nagaraniki Emaindi|"Ee Nagaraniki Emaindi T... 1278249 22466 1609 1205 https://i.ytimg.com/vi/wERgpPK44w0/default.jpg False False False Check out Ee Nagaraniki Emaindi Trailer #EeNag... US

159906 rows × 17 columns

1.1.3: Fix datetime format and remove rows with NA's

In [0]:
def trending_datetime(time):
  return pd.to_datetime('20'+time,format='%Y.%d.%m', errors='ignore')
combined_data['trending_date'] = combined_data['trending_date'].apply(trending_datetime)
combined_data['publish_time'] = pd.to_datetime(combined_data['publish_time'])

# remove NA's
combined_data = combined_data.dropna()

1.2 EDA and Feature Engineering

1.2.1: Mean, standard deviation, min and max.

In [0]:
# Print some simple statistics like mean, standard deviation, min annd max for each of the numerical features in the dataset.
maxs = combined_data[['views','likes','dislikes','comment_count']].apply(np.max,axis=0)
mins = combined_data[['views','likes','dislikes','comment_count']].apply(np.min,axis=0)
stds = combined_data[['views','likes','dislikes','comment_count']].apply(np.std,axis=0)
means = combined_data[['views','likes','dislikes','comment_count']].apply(np.mean,axis=0)
In [0]:
maxs
Out[0]:
views            225211923
likes              5613827
dislikes           1643059
comment_count      1228655
dtype: int64
In [0]:
mins
Out[0]:
views            223
likes              0
dislikes           0
comment_count      0
dtype: int64
In [0]:
stds
Out[0]:
views            4.605278e+06
likes            1.521485e+05
dislikes         1.825848e+04
comment_count    2.327815e+04
dtype: float64
In [0]:
means
Out[0]:
views            1.281578e+06
likes            4.096105e+04
dislikes         2.056138e+03
comment_count    4.606594e+03
dtype: float64

1.2.2: Rescale features

In [ ]:
# Rescale the likes, views, dislkes and comment_count to log scale (base e) to avoid numerical instability issues
def plusone(num):
  return num+1
combined_data['likes_log'] = combined_data['likes'].apply(plusone).apply(np.log)
combined_data['views_log'] = combined_data['views'].apply(np.log)
combined_data['dislikes_log'] = combined_data['dislikes'].apply(plusone).apply(np.log)
combined_data['comment_log'] = combined_data['comment_count'].apply(plusone).apply(np.log)

1.2.3: Plot the distribution

In [0]:
sns.distplot(combined_data['likes_log'])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07474d6240>
In [0]:
sns.distplot(combined_data['views_log'])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0740f1cc50>
In [0]:
sns.distplot(combined_data['dislikes_log'])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07400b8160>
In [0]:
sns.distplot(combined_data['comment_log'])
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f073feee278>

1.2.4: Comparing views, likes, dislikes against categories

As a next step, we try to gain insights into the data using categories, views, likes and dislikes.

  1. Number of videos for each category
In [0]:
# Number of videos for each category
plt.figure(figsize=(10,5))
chart = sns.countplot(
    data=combined_data,
    x='category',
    palette='Set1', 
    order=combined_data['category'].value_counts().index
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)
Out[0]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)
  1. The distribution of category against views
In [0]:
# The distribution of category against views
chart = sns.catplot(
    data=combined_data,
    x='category',
    y='views_log',
    kind='box',
    palette='Set1'
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light' 
)
Out[0]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)
  1. The distribution of dislikes against views
In [0]:
# The distribution of dislikes against views
chart = sns.catplot(
    data=combined_data,
    x='category',
    y='dislikes_log',
    kind='box',
    palette='Set1'
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light' 
)
Out[0]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)

1.2.5: Feature Engineering

a. Processing tags
In [ ]:
# Count the number of tags.
combined_data["num_tags"]=combined_data['tags'].apply(lambda x:x.count("|")+1)
b. Processing description and title
In [ ]:
# Compute the length of description
combined_data["desc_len"]=combined_data['description'].apply(len)
In [ ]:
# Compute the length of title
combined_data["len_title"]=combined_data['title'].apply(len)
c. Processing publish_time.
In [ ]:
# Split 'publish_time' feature into three parts time, date, and weekday
combined_data['publish_weekday']= combined_data['publish_time'].apply(lambda x:x.weekday()+1)
combined_data['publish_date'] = combined_data['publish_time'].apply(lambda x:x.date())
combined_data['publish_time'] = combined_data['publish_time'].apply(lambda x:x.time())
d. Number of videos per weekday
In [0]:
# Compute the number of videos published per day of the week.
plt.figure(figsize=(10,5))
chart = sns.countplot(
    data=combined_data,
    x='publish_weekday',
    palette='Set1', 
    order=combined_data['publish_weekday'].value_counts().index
)

1.2.6: Drop all non numeric columns

In [0]:
combined_data=combined_data.drop(['title', 'channel_title', 'category', 'tags', 'views', 'likes', 'dislikes', 'comment_count', 'thumbnail_link', 'thumbnail_link', 'description','publish_date', 'publish_time','trending_date'], axis=1)  

1.2.7: Convert categorical features in the dataset into one hot vectors.

In [0]:
combined_data.publish_weekday = combined_data.publish_weekday.astype('category')
combined_data.country = combined_data.country.astype('category')

combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['publish_weekday'])], axis=1)
combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['country'])], axis=1)
combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['category_id'])], axis=1)
combined_data = combined_data.drop(['category_id','country','publish_weekday'],axis=1)
In [0]:
# Write out the modified data to a file
combined_data_sec_2 = combined_data.copy()
combined_data_sec_2.rename(columns = {'views_log':'label'}, inplace = True) 
combined_data_sec_2.to_csv('combined_data.csv')

1.2.8: Split into x and y

In [0]:
# Split the data into features and label
combined_data=pd.read_csv('combined_data.csv').set_index('video_id')
In [0]:
label = combined_data['label']
features = combined_data.drop(['label'],axis=1)

1.3 : Machine Learning using sklearn

1.3.1 : Split data into train and test

In [0]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.2)

1.3.2: Train Machine Learning Models.

1.3.2.1 Linear Regression

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error

clf = LinearRegression()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print("Score:", clf.score(x_test, y_test))
plt.figure(dpi=100)
plt.scatter(y_test, y_pred)

# Different error measures
print("MAE:", mean_absolute_error(y_test, y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
Score: 0.8660262246648466
MAE: 0.5068012207962754
MSE: 0.44355818659686275
RMSE: 0.6660016415872132

1.3.2.2 Dimensionality reduction with PCA

Use Principal component analysis to reduce number of dimensions of the dataset

In [0]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Fit a pca model on your train set
X = StandardScaler().fit_transform(features)
pca = PCA(n_components=38)
X2 = pca.fit_transform(X)
np.set_printoptions(suppress=True)
pca.explained_variance_ratio_
pc_vs_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot the explained_variance_ratio against the number of components
pc_vs_variance
plt.plot(pc_vs_variance)
In [0]:
pca = PCA(n_components=31)
pca.fit(x_train)
x_train_2 = pca.transform(x_train)

1.3.2.3 Random Forest.

In [0]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# Grid search 
param_grid = { 
    'n_estimators'     : [20,30,40],
    'max_depth'        : [8,12,16]
}
CV_clf = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid,cv=3)
CV_clf.fit(x_train_2, y_train)
CV_clf.best_params_
Out[0]:
{'max_depth': 16, 'n_estimators': 40}
In [0]:
# Fit the random forest on the traing data
rfr = RandomForestRegressor(max_depth=16,n_estimators=40)
rfr.fit(x_train_2,y_train)
pca = PCA(n_components=31)
pca.fit(x_test)
x_test_2 = pca.transform(x_test)

# Make predictions
y_pred=rfr.predict(x_test_2)
In [ ]:
np.sqrt(mean_squared_error(y_test, y_pred))

Section 2 : Distributed Machine Learning with Spark

Initializing Spark Connection

In [ ]:
!apt install libkrb5-dev
!wget https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark
!pip install sparkmagic
!pip install pyspark
!pip install pyspark --user
!pip install seaborn --user
!pip install plotly --user
!pip install imageio --user
!pip install folium --user
In [ ]:
!apt update
!apt install gcc python-dev libkrb5-dev
In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

import os

spark = SparkSession.builder.appName('ml-hw4').getOrCreate()
In [0]:
%load_ext sparkmagic.magics
The sparkmagic.magics extension is already loaded. To reload it, use:
  %reload_ext sparkmagic.magics
In [0]:
#graph section
import networkx as nx
# SQLite RDBMS
import sqlite3
# Parallel processing
# import swifter
import pandas as pd
# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import os
os.environ['SPARK_HOME'] = '/content/spark-2.4.5-bin-hadoop2.7'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
from pyspark.sql import SQLContext
In [0]:
try:
    if(spark == None):
        spark = SparkSession.builder.appName('Initial').getOrCreate()
        sqlContext=SQLContext(spark)
except NameError:
    spark = SparkSession.builder.appName('Initial').getOrCreate()
    sqlContext=SQLContext(spark)

2.1 Data for Spark ML

In [0]:
train_sdf = spark.read.format('csv').options(header='true', inferSchema='true').load('combined_data.csv')
In [0]:
train_sdf.show()
+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|   video_id|comments_disabled|ratings_disabled|video_error_or_removed|         likes_log|             label|      dislikes_log|       comment_log|num_tags|desc_len|len_title|111|212|  3|  4|  5|  6|  7| CA| FR| IN| US|122| 10| 15| 17| 19|227| 20| 22| 23| 24| 25| 26| 27| 28| 29| 30| 43| 44|
+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|2kyS6SvSYSE|            false|           false|                 false| 10.96002706478233|13.525658131998265| 7.995306620290822| 9.677527538712344|       1|    1410|       34|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|1ZAPwfrtAFY|            false|           false|                 false|11.484381947152983|14.698775079078011|  8.72371943690701| 9.449672183486918|       4|     630|       62|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|5qpjK5DgCt4|            false|           false|                 false| 11.89159475029123| 14.97598090353335| 8.582980931954241| 9.009691898489343|      23|    1177|       53|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|puqaWrEC7tY|            false|           false|                 false| 9.227492430793749|12.745975402155576| 6.502790045915623| 7.671826797878781|      27|    1403|       32|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|d380meD0W0M|            false|           false|                 false|11.792343484003547| 14.55541297649217| 7.595889917718538| 9.771041285235823|      14|     636|       24|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|gHZ1Qz0KiKM|            false|           false|                 false| 9.186457431512851| 11.68839023430097| 6.238324625039508| 7.268920128193722|       7|    1511|       21|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|39idVpFF7NQ|            false|           false|                 false| 9.679968930891835| 14.55907372318811| 7.802209316247118|  7.58629630715272|      42|     503|       41|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|nc99ccSXST0|            false|           false|                 false| 10.07171018495058|13.614289933541128| 6.658011045870748| 8.141189793457691|      13|     734|       35|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|jr9QtXwC9vc|            false|           false|                 false| 8.173011311724972|13.624421478523645| 4.787491742782046| 5.831882477283517|      28|    1310|       65|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|TUmyygCMMGA|            false|           false|                 false| 9.445807672979225| 12.45459540294377| 7.218176838403408|7.7702232041587855|      20|    2257|       53|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|
|9wRQljFNDW8|            false|           false|                 false| 6.486160788944089|11.306847956781812| 3.258096538021482| 5.181783550292085|      49|    1290|       86|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|VifQlJit6A0|            false|           false|                 false|7.3632795869630385|11.557688483443744| 5.717027701406222| 7.154615356913663|      40|     819|       78|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|5E4ZBSInqUU|            false|           false|                 false| 11.64561024932308| 13.44093637413771| 7.195937226475569|  9.03264808356589|      23|    1042|       42|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|GgVmn66oK_A|            false|           false|                 false| 8.968141414126814|13.208118966221953| 7.066466970136958| 8.289539484624141|      25|     871|       38|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|
|TaTleo4cOs8|            false|           false|                 false|  8.91918561004543|12.243040823630162|5.5093883366279774| 7.659642954564682|      14|    1800|       24|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|kgaO45SyaO4|            false|           false|                 false| 9.150590367570409|11.235220125663336| 3.970291913552122| 7.115582126184454|       5|      38|       16|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|ZAQs-ctOqXQ|            false|           false|                 false| 8.988695696785708|12.596894394400882| 6.459904454377535| 7.136483208590247|      32|    1145|       48|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|YVfyYrEmzgM|            false|           false|                 false| 8.593969030218288| 11.26502804918979|3.9889840465642745| 5.955837369464831|      24|    1133|       52|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|
|eNSN6qet1kE|            false|           false|                 false| 9.389657419749838| 11.48253841983021|3.6109179126442243| 7.701652362642226|      18|    1512|       26|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|B5HORANmzHw|            false|           false|                 false| 9.038602608721874| 12.31882527209005|5.2574953720277815| 7.102499355774649|      17|    3142|       40|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|
+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
only showing top 20 rows

In [0]:
#Print the dataframe schema and verify
train_sdf.printSchema()
root
 |-- video_id: string (nullable = true)
 |-- comments_disabled: boolean (nullable = true)
 |-- ratings_disabled: boolean (nullable = true)
 |-- video_error_or_removed: boolean (nullable = true)
 |-- likes_log: double (nullable = true)
 |-- label: double (nullable = true)
 |-- dislikes_log: double (nullable = true)
 |-- comment_log: double (nullable = true)
 |-- num_tags: integer (nullable = true)
 |-- desc_len: integer (nullable = true)
 |-- len_title: integer (nullable = true)
 |-- 111: integer (nullable = true)
 |-- 212: integer (nullable = true)
 |-- 3: integer (nullable = true)
 |-- 4: integer (nullable = true)
 |-- 5: integer (nullable = true)
 |-- 6: integer (nullable = true)
 |-- 7: integer (nullable = true)
 |-- CA: integer (nullable = true)
 |-- FR: integer (nullable = true)
 |-- IN: integer (nullable = true)
 |-- US: integer (nullable = true)
 |-- 122: integer (nullable = true)
 |-- 10: integer (nullable = true)
 |-- 15: integer (nullable = true)
 |-- 17: integer (nullable = true)
 |-- 19: integer (nullable = true)
 |-- 227: integer (nullable = true)
 |-- 20: integer (nullable = true)
 |-- 22: integer (nullable = true)
 |-- 23: integer (nullable = true)
 |-- 24: integer (nullable = true)
 |-- 25: integer (nullable = true)
 |-- 26: integer (nullable = true)
 |-- 27: integer (nullable = true)
 |-- 28: integer (nullable = true)
 |-- 29: integer (nullable = true)
 |-- 30: integer (nullable = true)
 |-- 43: integer (nullable = true)
 |-- 44: integer (nullable = true)

In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
In [0]:
# Drop unwanted columns
all_columns = train_sdf.columns
drop_columns = ['video_id','label']
columns_to_use = [i for i in all_columns if i not in drop_columns]
In [0]:
# Create a VectorAssembler object
assembler = VectorAssembler(inputCols=columns_to_use, outputCol='features')

In this step, we will create a pipeline with a single stage - the assembler.

In [0]:
from pyspark.ml import Pipeline
# Create a pipeline
pipeline = Pipeline(stages=[assembler])
modified_data_sdf = pipeline.fit(train_sdf).transform(train_sdf).drop(*columns_to_use)
In [0]:
modified_data_sdf.show()
+-----------+------------------+--------------------+
|   video_id|             label|            features|
+-----------+------------------+--------------------+
|2kyS6SvSYSE|13.525658131998265|(38,[3,4,5,6,7,8,...|
|1ZAPwfrtAFY|14.698775079078011|(38,[3,4,5,6,7,8,...|
|5qpjK5DgCt4| 14.97598090353335|(38,[3,4,5,6,7,8,...|
|puqaWrEC7tY|12.745975402155576|(38,[3,4,5,6,7,8,...|
|d380meD0W0M| 14.55541297649217|(38,[3,4,5,6,7,8,...|
|gHZ1Qz0KiKM| 11.68839023430097|(38,[3,4,5,6,7,8,...|
|39idVpFF7NQ| 14.55907372318811|(38,[3,4,5,6,7,8,...|
|nc99ccSXST0|13.614289933541128|(38,[3,4,5,6,7,8,...|
|jr9QtXwC9vc|13.624421478523645|(38,[3,4,5,6,7,8,...|
|TUmyygCMMGA| 12.45459540294377|(38,[3,4,5,6,7,8,...|
|9wRQljFNDW8|11.306847956781812|(38,[3,4,5,6,7,8,...|
|VifQlJit6A0|11.557688483443744|(38,[3,4,5,6,7,8,...|
|5E4ZBSInqUU| 13.44093637413771|(38,[3,4,5,6,7,8,...|
|GgVmn66oK_A|13.208118966221953|(38,[3,4,5,6,7,8,...|
|TaTleo4cOs8|12.243040823630162|(38,[3,4,5,6,7,8,...|
|kgaO45SyaO4|11.235220125663336|(38,[3,4,5,6,7,8,...|
|ZAQs-ctOqXQ|12.596894394400882|(38,[3,4,5,6,7,8,...|
|YVfyYrEmzgM| 11.26502804918979|(38,[3,4,5,6,7,8,...|
|eNSN6qet1kE| 11.48253841983021|(38,[3,4,5,6,7,8,...|
|B5HORANmzHw| 12.31882527209005|(38,[3,4,5,6,7,8,...|
+-----------+------------------+--------------------+
only showing top 20 rows

In [0]:
# Split into an 80-20 ratio between the train and test sets.
train_sdf, test_sdf = modified_data_sdf.randomSplit([0.8, 0.2], seed = 2020)

2.2 Linear regression using Spark ML

Using Spark ML's linear regression to Train a linear regression model and try to predict the views.

In [0]:
from pyspark.ml.regression import LinearRegression

# Train a linear regression model
lr = LinearRegression(featuresCol = 'features', labelCol='label')
lr_model = lr.fit(train_sdf)
In [0]:
trainingSummary = lr_model.summary

print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
RMSE: 0.666726
r2: 0.865483
In [0]:
predictions = lr_model.transform(test_sdf)
In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# Compute root mean squared error on the test set
test_result = lr_model.evaluate(test_sdf)
test_rmse_orig = test_result.rootMeanSquaredError
In [0]:
test_rmse_orig
Out[0]:
0.6707749535378126
In [0]:
# Add regularization to avoid overfitting, try out L1, L2 and elastic net
lrl1 = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.1,elasticNetParam=1.0)
lrl1_model = lrl1.fit(train_sdf)
lrl2 = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.2,elasticNetParam=0.0)
lrl2_model = lrl2.fit(train_sdf)
lre = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.1,elasticNetParam=0.5)
lre_model = lre.fit(train_sdf)
In [0]:
# Compute predictions using each of the models
l1_predictions = lrl1_model.transform(test_sdf)
l2_predictions = lrl2_model.transform(test_sdf)
elastic_net_predictions = lre_model.transform(test_sdf)

# Compute root mean squared error on test set for each of your models
test_result_l1 = lrl1_model.evaluate(test_sdf)
test_rmse_l1 = test_result_l1.rootMeanSquaredError

test_result_l2 = lrl2_model.evaluate(test_sdf)
test_rmse_l2 = test_result_l2.rootMeanSquaredError

test_result_e = lre_model.evaluate(test_sdf)
test_rmse_elastic = test_result_e.rootMeanSquaredError
In [0]:
print(test_rmse_l1)
print(test_rmse_l2)
print(test_rmse_elastic)
0.6707749535378126
0.681360477784039
0.7007979237874897

2.3 Random Forest Regression

In [0]:
from pyspark.ml.regression import RandomForestRegressor
# Create a random forest regressor model, fit the training data and evaluate using RegressionEvaluator
rf_reg = RandomForestRegressor(labelCol="label", featuresCol="features")
rf_model = rf_reg.fit(train_sdf)
train_pred = rf_model.transform(train_sdf)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
train_rmse_rf =  evaluator.evaluate(train_pred)#TODO: calculate the training rmse
train_rmse_rf
Out[0]:
0.7166346288805886
In [0]:
# Predictions on the test set
predictions = rf_model.transform(test_sdf)
In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# Calculate the rmse on the test set
rmse_rf = evaluator.evaluate(predictions) 
rmse_rf
Out[0]:
0.7176383770276658

2.4 Dimensionality Reduction using PCA

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import PCA as PCAml
pca = PCAml(k=31, inputCol='features', outputCol='pcaFeature')
pca_fit = pca.fit(train_sdf)
train_sdf_pca=pca_fit.transform(train_sdf)
pca_model = lr.fit(train_sdf_pca)
pcaSummary = pca_model.summary
training_rmse_pca = pcaSummary.rootMeanSquaredError
training_rmse_pca
In [0]:
training_rmse_pca
Out[0]:
0.6667264867184813
In [0]:
# Your code goes here
from pyspark.ml.evaluation import RegressionEvaluator

test_sdf_pca=pca_fit.transform(test_sdf)
# Get predictions on the test set
predictions_pca = pca_model.transform(test_sdf_pca) 
test_result_pca = pca_model.evaluate(test_sdf_pca)
# Get RMSE for test data
test_rmse_pca = test_result_pca.rootMeanSquaredError #TODO: Get RMSE for test data
test_rmse_pca
Out[0]:
0.6707749535378126