YouTube View Prediction with Machine Learning¶

Jingzong Wang

2020-03

Libraries and Setup¶

! sudo apt install openjdk-8-jdk
! sudo update-alternatives --config java

import numpy as np 
import pandas as pd 
import json
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from datetime import datetime
import glob
import seaborn as sns
import re
import os

import boto3
from botocore import UNSIGNED

from botocore.config import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
s3.Bucket('penn-cis545-files').download_file('youtube_data.zip', 'youtube_data.zip')

!unzip /content/youtube_data.zip

Section 1 : Machine Learning with Sklearn¶

1.1 Data loading and Preprocessing¶

The dataset is a daily record of the top trending YouTube videos.

To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Top performers on the YouTube trending list are music videos (such as the famously viral “Gangnam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for numerous countries, with up to 200 listed trending videos per day.

Each region’s data is in a separate file. Data includes:

Video Title
Channel title
Publish time
Tags
Views
Likes
Dislikes
Description
Comment count

The data also includes a category_id field, which varies between regions.

1.1.1: Combining Multiple CSV's¶

There are multiple csv files in the dataset, each corresponding to a specific country. So, the first step is to read them and combine these csv files into a single dataframe.

# Import all the csv files
files = [i for i in glob.glob('/content/youtube_data/*.csv'.format('csv'))]
sorted(files)

['/content/youtube_data/CAvideos.csv',
 '/content/youtube_data/FRvideos.csv',
 '/content/youtube_data/INvideos.csv',
 '/content/youtube_data/USvideos.csv']

# Combine all into a single dataframe "combined_data" and add a 'country' column. 
all_dataframes = list()
country=['CA','FR','IN','US']
i=0
for csv in files:
  all_dataframes = all_dataframes+[pd.read_csv(csv,header=[0]).set_index('video_id')]
  all_dataframes[i]['country']=country[i]
  i=i+1
combined_data = pd.concat(all_dataframes)

combined_data

1.1.2: Map category Id's to categories¶

# Read the category_id.json file and map the category_id's in the dataframe to the category name.
combined_data['category_id'] = combined_data['category_id'].astype(str)

def find_title(item):
  return item['title']
category_df = pd.DataFrame(pd.read_json('/content/youtube_data/US_category_id.json').iloc[:,2].tolist())
category_df['title'] = category_df['snippet'].apply(find_title)
def map_cat(id_str):
  return category_df.loc[category_df.id==id_str,'title'].values[0]
combined_data.insert(4, 'category',combined_data['category_id'].apply(map_cat))

combined_data

1.1.3: Fix datetime format and remove rows with NA's¶

def trending_datetime(time):
  return pd.to_datetime('20'+time,format='%Y.%d.%m', errors='ignore')
combined_data['trending_date'] = combined_data['trending_date'].apply(trending_datetime)
combined_data['publish_time'] = pd.to_datetime(combined_data['publish_time'])

# remove NA's
combined_data = combined_data.dropna()

1.2 EDA and Feature Engineering¶

1.2.1: Mean, standard deviation, min and max.¶

# Print some simple statistics like mean, standard deviation, min annd max for each of the numerical features in the dataset.
maxs = combined_data[['views','likes','dislikes','comment_count']].apply(np.max,axis=0)
mins = combined_data[['views','likes','dislikes','comment_count']].apply(np.min,axis=0)
stds = combined_data[['views','likes','dislikes','comment_count']].apply(np.std,axis=0)
means = combined_data[['views','likes','dislikes','comment_count']].apply(np.mean,axis=0)

maxs

views            225211923
likes              5613827
dislikes           1643059
comment_count      1228655
dtype: int64

mins

views            223
likes              0
dislikes           0
comment_count      0
dtype: int64

stds

views            4.605278e+06
likes            1.521485e+05
dislikes         1.825848e+04
comment_count    2.327815e+04
dtype: float64

means

views            1.281578e+06
likes            4.096105e+04
dislikes         2.056138e+03
comment_count    4.606594e+03
dtype: float64

1.2.2: Rescale features¶

# Rescale the likes, views, dislkes and comment_count to log scale (base e) to avoid numerical instability issues
def plusone(num):
  return num+1
combined_data['likes_log'] = combined_data['likes'].apply(plusone).apply(np.log)
combined_data['views_log'] = combined_data['views'].apply(np.log)
combined_data['dislikes_log'] = combined_data['dislikes'].apply(plusone).apply(np.log)
combined_data['comment_log'] = combined_data['comment_count'].apply(plusone).apply(np.log)

1.2.3: Plot the distribution¶

sns.distplot(combined_data['likes_log'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f07474d6240>

sns.distplot(combined_data['views_log'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f0740f1cc50>

sns.distplot(combined_data['dislikes_log'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f07400b8160>

sns.distplot(combined_data['comment_log'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f073feee278>

1.2.4: Comparing views, likes, dislikes against categories¶

As a next step, we try to gain insights into the data using categories, views, likes and dislikes.

Number of videos for each category

# Number of videos for each category
plt.figure(figsize=(10,5))
chart = sns.countplot(
    data=combined_data,
    x='category',
    palette='Set1', 
    order=combined_data['category'].value_counts().index
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)

The distribution of category against views

# The distribution of category against views
chart = sns.catplot(
    data=combined_data,
    x='category',
    y='views_log',
    kind='box',
    palette='Set1'
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light' 
)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)

The distribution of dislikes against views

# The distribution of dislikes against views
chart = sns.catplot(
    data=combined_data,
    x='category',
    y='dislikes_log',
    kind='box',
    palette='Set1'
)

plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light' 
)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17]), <a list of 18 Text major ticklabel objects>)

1.2.5: Feature Engineering¶

a. Processing tags¶

# Count the number of tags.
combined_data["num_tags"]=combined_data['tags'].apply(lambda x:x.count("|")+1)

b. Processing description and title¶

# Compute the length of description
combined_data["desc_len"]=combined_data['description'].apply(len)

# Compute the length of title
combined_data["len_title"]=combined_data['title'].apply(len)

c. Processing publish_time.¶

# Split 'publish_time' feature into three parts time, date, and weekday
combined_data['publish_weekday']= combined_data['publish_time'].apply(lambda x:x.weekday()+1)
combined_data['publish_date'] = combined_data['publish_time'].apply(lambda x:x.date())
combined_data['publish_time'] = combined_data['publish_time'].apply(lambda x:x.time())

d. Number of videos per weekday¶

# Compute the number of videos published per day of the week.
plt.figure(figsize=(10,5))
chart = sns.countplot(
    data=combined_data,
    x='publish_weekday',
    palette='Set1', 
    order=combined_data['publish_weekday'].value_counts().index
)

1.2.6: Drop all non numeric columns¶

combined_data=combined_data.drop(['title', 'channel_title', 'category', 'tags', 'views', 'likes', 'dislikes', 'comment_count', 'thumbnail_link', 'thumbnail_link', 'description','publish_date', 'publish_time','trending_date'], axis=1)

1.2.7: Convert categorical features in the dataset into one hot vectors.¶

combined_data.publish_weekday = combined_data.publish_weekday.astype('category')
combined_data.country = combined_data.country.astype('category')

combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['publish_weekday'])], axis=1)
combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['country'])], axis=1)
combined_data = pd.concat([combined_data, pd.get_dummies(combined_data['category_id'])], axis=1)
combined_data = combined_data.drop(['category_id','country','publish_weekday'],axis=1)

# Write out the modified data to a file
combined_data_sec_2 = combined_data.copy()
combined_data_sec_2.rename(columns = {'views_log':'label'}, inplace = True) 
combined_data_sec_2.to_csv('combined_data.csv')

1.2.8: Split into x and y¶

# Split the data into features and label
combined_data=pd.read_csv('combined_data.csv').set_index('video_id')

label = combined_data['label']
features = combined_data.drop(['label'],axis=1)

1.3 : Machine Learning using sklearn¶

1.3.1 : Split data into train and test¶

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features, label, test_size=0.2)

1.3.2: Train Machine Learning Models.¶

1.3.2.1 Linear Regression¶

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error

clf = LinearRegression()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

print("Score:", clf.score(x_test, y_test))
plt.figure(dpi=100)
plt.scatter(y_test, y_pred)

# Different error measures
print("MAE:", mean_absolute_error(y_test, y_pred))
print('MSE:', mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))

Score: 0.8660262246648466
MAE: 0.5068012207962754
MSE: 0.44355818659686275
RMSE: 0.6660016415872132

1.3.2.2 Dimensionality reduction with PCA¶

Use Principal component analysis to reduce number of dimensions of the dataset

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Fit a pca model on your train set
X = StandardScaler().fit_transform(features)
pca = PCA(n_components=38)
X2 = pca.fit_transform(X)
np.set_printoptions(suppress=True)
pca.explained_variance_ratio_
pc_vs_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot the explained_variance_ratio against the number of components
pc_vs_variance
plt.plot(pc_vs_variance)

pca = PCA(n_components=31)
pca.fit(x_train)
x_train_2 = pca.transform(x_train)

1.3.2.3 Random Forest.¶

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
# Grid search 
param_grid = { 
    'n_estimators'     : [20,30,40],
    'max_depth'        : [8,12,16]
}
CV_clf = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid,cv=3)
CV_clf.fit(x_train_2, y_train)
CV_clf.best_params_

{'max_depth': 16, 'n_estimators': 40}

# Fit the random forest on the traing data
rfr = RandomForestRegressor(max_depth=16,n_estimators=40)
rfr.fit(x_train_2,y_train)
pca = PCA(n_components=31)
pca.fit(x_test)
x_test_2 = pca.transform(x_test)

# Make predictions
y_pred=rfr.predict(x_test_2)

np.sqrt(mean_squared_error(y_test, y_pred))

Section 2 : Distributed Machine Learning with Spark¶

Initializing Spark Connection¶

!apt install libkrb5-dev
!wget https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install findspark
!pip install sparkmagic
!pip install pyspark
!pip install pyspark --user
!pip install seaborn --user
!pip install plotly --user
!pip install imageio --user
!pip install folium --user

!apt update
!apt install gcc python-dev libkrb5-dev

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

import os

spark = SparkSession.builder.appName('ml-hw4').getOrCreate()

%load_ext sparkmagic.magics

The sparkmagic.magics extension is already loaded. To reload it, use:
  %reload_ext sparkmagic.magics

#graph section
import networkx as nx
# SQLite RDBMS
import sqlite3
# Parallel processing
# import swifter
import pandas as pd
# NoSQL DB
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure

import os
os.environ['SPARK_HOME'] = '/content/spark-2.4.5-bin-hadoop2.7'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
import pyspark
from pyspark.sql import SQLContext

try:
    if(spark == None):
        spark = SparkSession.builder.appName('Initial').getOrCreate()
        sqlContext=SQLContext(spark)
except NameError:
    spark = SparkSession.builder.appName('Initial').getOrCreate()
    sqlContext=SQLContext(spark)

2.1 Data for Spark ML¶

train_sdf = spark.read.format('csv').options(header='true', inferSchema='true').load('combined_data.csv')

train_sdf.show()

+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|   video_id|comments_disabled|ratings_disabled|video_error_or_removed|         likes_log|             label|      dislikes_log|       comment_log|num_tags|desc_len|len_title|111|212|  3|  4|  5|  6|  7| CA| FR| IN| US|122| 10| 15| 17| 19|227| 20| 22| 23| 24| 25| 26| 27| 28| 29| 30| 43| 44|
+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|2kyS6SvSYSE|            false|           false|                 false| 10.96002706478233|13.525658131998265| 7.995306620290822| 9.677527538712344|       1|    1410|       34|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|1ZAPwfrtAFY|            false|           false|                 false|11.484381947152983|14.698775079078011|  8.72371943690701| 9.449672183486918|       4|     630|       62|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|5qpjK5DgCt4|            false|           false|                 false| 11.89159475029123| 14.97598090353335| 8.582980931954241| 9.009691898489343|      23|    1177|       53|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|puqaWrEC7tY|            false|           false|                 false| 9.227492430793749|12.745975402155576| 6.502790045915623| 7.671826797878781|      27|    1403|       32|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|d380meD0W0M|            false|           false|                 false|11.792343484003547| 14.55541297649217| 7.595889917718538| 9.771041285235823|      14|     636|       24|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|gHZ1Qz0KiKM|            false|           false|                 false| 9.186457431512851| 11.68839023430097| 6.238324625039508| 7.268920128193722|       7|    1511|       21|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|39idVpFF7NQ|            false|           false|                 false| 9.679968930891835| 14.55907372318811| 7.802209316247118|  7.58629630715272|      42|     503|       41|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|nc99ccSXST0|            false|           false|                 false| 10.07171018495058|13.614289933541128| 6.658011045870748| 8.141189793457691|      13|     734|       35|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|jr9QtXwC9vc|            false|           false|                 false| 8.173011311724972|13.624421478523645| 4.787491742782046| 5.831882477283517|      28|    1310|       65|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|TUmyygCMMGA|            false|           false|                 false| 9.445807672979225| 12.45459540294377| 7.218176838403408|7.7702232041587855|      20|    2257|       53|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|
|9wRQljFNDW8|            false|           false|                 false| 6.486160788944089|11.306847956781812| 3.258096538021482| 5.181783550292085|      49|    1290|       86|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|VifQlJit6A0|            false|           false|                 false|7.3632795869630385|11.557688483443744| 5.717027701406222| 7.154615356913663|      40|     819|       78|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|
|5E4ZBSInqUU|            false|           false|                 false| 11.64561024932308| 13.44093637413771| 7.195937226475569|  9.03264808356589|      23|    1042|       42|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|GgVmn66oK_A|            false|           false|                 false| 8.968141414126814|13.208118966221953| 7.066466970136958| 8.289539484624141|      25|     871|       38|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|
|TaTleo4cOs8|            false|           false|                 false|  8.91918561004543|12.243040823630162|5.5093883366279774| 7.659642954564682|      14|    1800|       24|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|kgaO45SyaO4|            false|           false|                 false| 9.150590367570409|11.235220125663336| 3.970291913552122| 7.115582126184454|       5|      38|       16|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|ZAQs-ctOqXQ|            false|           false|                 false| 8.988695696785708|12.596894394400882| 6.459904454377535| 7.136483208590247|      32|    1145|       48|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|YVfyYrEmzgM|            false|           false|                 false| 8.593969030218288| 11.26502804918979|3.9889840465642745| 5.955837369464831|      24|    1133|       52|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|
|eNSN6qet1kE|            false|           false|                 false| 9.389657419749838| 11.48253841983021|3.6109179126442243| 7.701652362642226|      18|    1512|       26|  1|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|
|B5HORANmzHw|            false|           false|                 false| 9.038602608721874| 12.31882527209005|5.2574953720277815| 7.102499355774649|      17|    3142|       40|  0|  0|  0|  0|  0|  0|  1|  1|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  0|  1|  0|  0|  0|  0|  0|
+-----------+-----------------+----------------+----------------------+------------------+------------------+------------------+------------------+--------+--------+---------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
only showing top 20 rows

#Print the dataframe schema and verify
train_sdf.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- comments_disabled: boolean (nullable = true)
 |-- ratings_disabled: boolean (nullable = true)
 |-- video_error_or_removed: boolean (nullable = true)
 |-- likes_log: double (nullable = true)
 |-- label: double (nullable = true)
 |-- dislikes_log: double (nullable = true)
 |-- comment_log: double (nullable = true)
 |-- num_tags: integer (nullable = true)
 |-- desc_len: integer (nullable = true)
 |-- len_title: integer (nullable = true)
 |-- 111: integer (nullable = true)
 |-- 212: integer (nullable = true)
 |-- 3: integer (nullable = true)
 |-- 4: integer (nullable = true)
 |-- 5: integer (nullable = true)
 |-- 6: integer (nullable = true)
 |-- 7: integer (nullable = true)
 |-- CA: integer (nullable = true)
 |-- FR: integer (nullable = true)
 |-- IN: integer (nullable = true)
 |-- US: integer (nullable = true)
 |-- 122: integer (nullable = true)
 |-- 10: integer (nullable = true)
 |-- 15: integer (nullable = true)
 |-- 17: integer (nullable = true)
 |-- 19: integer (nullable = true)
 |-- 227: integer (nullable = true)
 |-- 20: integer (nullable = true)
 |-- 22: integer (nullable = true)
 |-- 23: integer (nullable = true)
 |-- 24: integer (nullable = true)
 |-- 25: integer (nullable = true)
 |-- 26: integer (nullable = true)
 |-- 27: integer (nullable = true)
 |-- 28: integer (nullable = true)
 |-- 29: integer (nullable = true)
 |-- 30: integer (nullable = true)
 |-- 43: integer (nullable = true)
 |-- 44: integer (nullable = true)

from pyspark.ml.feature import StringIndexer, VectorAssembler

# Drop unwanted columns
all_columns = train_sdf.columns
drop_columns = ['video_id','label']
columns_to_use = [i for i in all_columns if i not in drop_columns]

# Create a VectorAssembler object
assembler = VectorAssembler(inputCols=columns_to_use, outputCol='features')

In this step, we will create a pipeline with a single stage - the assembler.

from pyspark.ml import Pipeline
# Create a pipeline
pipeline = Pipeline(stages=[assembler])
modified_data_sdf = pipeline.fit(train_sdf).transform(train_sdf).drop(*columns_to_use)

modified_data_sdf.show()

+-----------+------------------+--------------------+
|   video_id|             label|            features|
+-----------+------------------+--------------------+
|2kyS6SvSYSE|13.525658131998265|(38,[3,4,5,6,7,8,...|
|1ZAPwfrtAFY|14.698775079078011|(38,[3,4,5,6,7,8,...|
|5qpjK5DgCt4| 14.97598090353335|(38,[3,4,5,6,7,8,...|
|puqaWrEC7tY|12.745975402155576|(38,[3,4,5,6,7,8,...|
|d380meD0W0M| 14.55541297649217|(38,[3,4,5,6,7,8,...|
|gHZ1Qz0KiKM| 11.68839023430097|(38,[3,4,5,6,7,8,...|
|39idVpFF7NQ| 14.55907372318811|(38,[3,4,5,6,7,8,...|
|nc99ccSXST0|13.614289933541128|(38,[3,4,5,6,7,8,...|
|jr9QtXwC9vc|13.624421478523645|(38,[3,4,5,6,7,8,...|
|TUmyygCMMGA| 12.45459540294377|(38,[3,4,5,6,7,8,...|
|9wRQljFNDW8|11.306847956781812|(38,[3,4,5,6,7,8,...|
|VifQlJit6A0|11.557688483443744|(38,[3,4,5,6,7,8,...|
|5E4ZBSInqUU| 13.44093637413771|(38,[3,4,5,6,7,8,...|
|GgVmn66oK_A|13.208118966221953|(38,[3,4,5,6,7,8,...|
|TaTleo4cOs8|12.243040823630162|(38,[3,4,5,6,7,8,...|
|kgaO45SyaO4|11.235220125663336|(38,[3,4,5,6,7,8,...|
|ZAQs-ctOqXQ|12.596894394400882|(38,[3,4,5,6,7,8,...|
|YVfyYrEmzgM| 11.26502804918979|(38,[3,4,5,6,7,8,...|
|eNSN6qet1kE| 11.48253841983021|(38,[3,4,5,6,7,8,...|
|B5HORANmzHw| 12.31882527209005|(38,[3,4,5,6,7,8,...|
+-----------+------------------+--------------------+
only showing top 20 rows

# Split into an 80-20 ratio between the train and test sets.
train_sdf, test_sdf = modified_data_sdf.randomSplit([0.8, 0.2], seed = 2020)

2.2 Linear regression using Spark ML¶

Using Spark ML's linear regression to Train a linear regression model and try to predict the views.

from pyspark.ml.regression import LinearRegression

# Train a linear regression model
lr = LinearRegression(featuresCol = 'features', labelCol='label')
lr_model = lr.fit(train_sdf)

trainingSummary = lr_model.summary

print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 0.666726
r2: 0.865483

predictions = lr_model.transform(test_sdf)

from pyspark.ml.evaluation import RegressionEvaluator

# Compute root mean squared error on the test set
test_result = lr_model.evaluate(test_sdf)
test_rmse_orig = test_result.rootMeanSquaredError

test_rmse_orig

0.6707749535378126

# Add regularization to avoid overfitting, try out L1, L2 and elastic net
lrl1 = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.1,elasticNetParam=1.0)
lrl1_model = lrl1.fit(train_sdf)
lrl2 = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.2,elasticNetParam=0.0)
lrl2_model = lrl2.fit(train_sdf)
lre = LinearRegression(featuresCol = 'features', labelCol='label',regParam=0.1,elasticNetParam=0.5)
lre_model = lre.fit(train_sdf)

# Compute predictions using each of the models
l1_predictions = lrl1_model.transform(test_sdf)
l2_predictions = lrl2_model.transform(test_sdf)
elastic_net_predictions = lre_model.transform(test_sdf)

# Compute root mean squared error on test set for each of your models
test_result_l1 = lrl1_model.evaluate(test_sdf)
test_rmse_l1 = test_result_l1.rootMeanSquaredError

test_result_l2 = lrl2_model.evaluate(test_sdf)
test_rmse_l2 = test_result_l2.rootMeanSquaredError

test_result_e = lre_model.evaluate(test_sdf)
test_rmse_elastic = test_result_e.rootMeanSquaredError

print(test_rmse_l1)
print(test_rmse_l2)
print(test_rmse_elastic)

0.6707749535378126
0.681360477784039
0.7007979237874897

2.3 Random Forest Regression¶

from pyspark.ml.regression import RandomForestRegressor
# Create a random forest regressor model, fit the training data and evaluate using RegressionEvaluator
rf_reg = RandomForestRegressor(labelCol="label", featuresCol="features")
rf_model = rf_reg.fit(train_sdf)
train_pred = rf_model.transform(train_sdf)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
train_rmse_rf =  evaluator.evaluate(train_pred)#TODO: calculate the training rmse
train_rmse_rf

0.7166346288805886

# Predictions on the test set
predictions = rf_model.transform(test_sdf)

from pyspark.ml.evaluation import RegressionEvaluator

# Calculate the rmse on the test set
rmse_rf = evaluator.evaluate(predictions) 
rmse_rf

0.7176383770276658

2.4 Dimensionality Reduction using PCA¶

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import PCA as PCAml
pca = PCAml(k=31, inputCol='features', outputCol='pcaFeature')
pca_fit = pca.fit(train_sdf)
train_sdf_pca=pca_fit.transform(train_sdf)
pca_model = lr.fit(train_sdf_pca)
pcaSummary = pca_model.summary
training_rmse_pca = pcaSummary.rootMeanSquaredError
training_rmse_pca

training_rmse_pca

0.6667264867184813

# Your code goes here
from pyspark.ml.evaluation import RegressionEvaluator

test_sdf_pca=pca_fit.transform(test_sdf)
# Get predictions on the test set
predictions_pca = pca_model.transform(test_sdf_pca) 
test_result_pca = pca_model.evaluate(test_sdf_pca)
# Get RMSE for test data
test_rmse_pca = test_result_pca.rootMeanSquaredError #TODO: Get RMSE for test data
test_rmse_pca

0.6707749535378126

	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description	country
video_id
2kyS6SvSYSE	17.14.11	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	2017-11-13T17:13:01.000Z	SHANtell martin	748374	57527	2966	15954	https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg	False	False	False	SHANTELL'S CHANNEL - https://www.youtube.com/s...	CA
1ZAPwfrtAFY	17.14.11	The Trump Presidency: Last Week Tonight with J...	LastWeekTonight	24	2017-11-13T07:30:00.000Z	last week tonight trump presidency\|"last week ...	2418783	97185	6146	12703	https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg	False	False	False	One year after the presidential election, John...	CA
5qpjK5DgCt4	17.14.11	Racist Superman \| Rudy Mancuso, King Bach & Le...	Rudy Mancuso	23	2017-11-12T19:05:24.000Z	racist superman\|"rudy"\|"mancuso"\|"king"\|"bach"...	3191434	146033	5339	8181	https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg	False	False	False	WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...	CA
puqaWrEC7tY	17.14.11	Nickelback Lyrics: Real or Fake?	Good Mythical Morning	24	2017-11-13T11:00:04.000Z	rhett and link\|"gmm"\|"good mythical morning"\|"...	343168	10172	666	2146	https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg	False	False	False	Today we find out if Link is a Nickelback amat...	CA
d380meD0W0M	17.14.11	I Dare You: GOING BALD!?	nigahiga	24	2017-11-12T18:01:41.000Z	ryan\|"higa"\|"higatv"\|"nigahiga"\|"i dare you"\|"...	2095731	132235	1989	17518	https://i.ytimg.com/vi/d380meD0W0M/default.jpg	False	False	False	I know it's been a while since we did this sho...	CA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
iNHecA3PJCo	18.14.06	फेकू आशिक़ - राजस्थान की सबसे शानदार कॉमेडी \| ...	RDC Rajasthani	23	2018-06-13T08:01:11.000Z	twinkle vaishnav comedy\|"twinkle vaishnav"\|"tw...	214378	3291	404	196	https://i.ytimg.com/vi/iNHecA3PJCo/default.jpg	False	False	False	PRG Music & RDC Rajasthani presents फेकू आशिक़...	US
dpPmPbhcslM	18.14.06	Seetha \| Flowers \| Ep# 364	Flowers TV	24	2018-06-13T11:30:04.000Z	flowers serials\|"actress"\|"malayalam serials"\|...	406828	1726	478	1428	https://i.ytimg.com/vi/dpPmPbhcslM/default.jpg	False	False	False	Flowers - A R Rahman Show,Book your Tickets He...	US
mV6aztP58f8	18.14.06	Bhramanam I Episode 87 - 12 June 2018 I Mazhav...	Mazhavil Manorama	24	2018-06-13T05:00:02.000Z	mazhavil manorama\|"bhramanam full episode"\|"gt...	386319	1216	453	697	https://i.ytimg.com/vi/mV6aztP58f8/default.jpg	False	False	False	Subscribe to Mazhavil Manorama now for your da...	US
qxqDNP1bDEw	18.14.06	Nua Bohu \| Full Ep 285 \| 13th June 2018 \| Odia...	Tarang TV	24	2018-06-13T15:07:49.000Z	tarang\|"tarang tv"\|"tarang tv online"\|"tarang ...	130263	698	115	65	https://i.ytimg.com/vi/qxqDNP1bDEw/default.jpg	False	False	False	Nuabohu : Story of a rustic village girl who w...	US
wERgpPK44w0	18.14.06	Ee Nagaraniki Emaindi Trailer \| Tharun Bhascke...	Suresh Productions	24	2018-06-10T04:29:54.000Z	Ee Nagaraniki Emaindi\|"Ee Nagaraniki Emaindi T...	1278249	22466	1609	1205	https://i.ytimg.com/vi/wERgpPK44w0/default.jpg	False	False	False	Check out Ee Nagaraniki Emaindi Trailer #EeNag...	US

	trending_date	title	channel_title	category_id	category	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description	country
video_id
2kyS6SvSYSE	17.14.11	WE WANT TO TALK ABOUT OUR MARRIAGE	CaseyNeistat	22	People & Blogs	2017-11-13T17:13:01.000Z	SHANtell martin	748374	57527	2966	15954	https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg	False	False	False	SHANTELL'S CHANNEL - https://www.youtube.com/s...	CA
1ZAPwfrtAFY	17.14.11	The Trump Presidency: Last Week Tonight with J...	LastWeekTonight	24	Entertainment	2017-11-13T07:30:00.000Z	last week tonight trump presidency\|"last week ...	2418783	97185	6146	12703	https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg	False	False	False	One year after the presidential election, John...	CA
5qpjK5DgCt4	17.14.11	Racist Superman \| Rudy Mancuso, King Bach & Le...	Rudy Mancuso	23	Comedy	2017-11-12T19:05:24.000Z	racist superman\|"rudy"\|"mancuso"\|"king"\|"bach"...	3191434	146033	5339	8181	https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg	False	False	False	WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...	CA
puqaWrEC7tY	17.14.11	Nickelback Lyrics: Real or Fake?	Good Mythical Morning	24	Entertainment	2017-11-13T11:00:04.000Z	rhett and link\|"gmm"\|"good mythical morning"\|"...	343168	10172	666	2146	https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg	False	False	False	Today we find out if Link is a Nickelback amat...	CA
d380meD0W0M	17.14.11	I Dare You: GOING BALD!?	nigahiga	24	Entertainment	2017-11-12T18:01:41.000Z	ryan\|"higa"\|"higatv"\|"nigahiga"\|"i dare you"\|"...	2095731	132235	1989	17518	https://i.ytimg.com/vi/d380meD0W0M/default.jpg	False	False	False	I know it's been a while since we did this sho...	CA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
iNHecA3PJCo	18.14.06	फेकू आशिक़ - राजस्थान की सबसे शानदार कॉमेडी \| ...	RDC Rajasthani	23	Comedy	2018-06-13T08:01:11.000Z	twinkle vaishnav comedy\|"twinkle vaishnav"\|"tw...	214378	3291	404	196	https://i.ytimg.com/vi/iNHecA3PJCo/default.jpg	False	False	False	PRG Music & RDC Rajasthani presents फेकू आशिक़...	US
dpPmPbhcslM	18.14.06	Seetha \| Flowers \| Ep# 364	Flowers TV	24	Entertainment	2018-06-13T11:30:04.000Z	flowers serials\|"actress"\|"malayalam serials"\|...	406828	1726	478	1428	https://i.ytimg.com/vi/dpPmPbhcslM/default.jpg	False	False	False	Flowers - A R Rahman Show,Book your Tickets He...	US
mV6aztP58f8	18.14.06	Bhramanam I Episode 87 - 12 June 2018 I Mazhav...	Mazhavil Manorama	24	Entertainment	2018-06-13T05:00:02.000Z	mazhavil manorama\|"bhramanam full episode"\|"gt...	386319	1216	453	697	https://i.ytimg.com/vi/mV6aztP58f8/default.jpg	False	False	False	Subscribe to Mazhavil Manorama now for your da...	US
qxqDNP1bDEw	18.14.06	Nua Bohu \| Full Ep 285 \| 13th June 2018 \| Odia...	Tarang TV	24	Entertainment	2018-06-13T15:07:49.000Z	tarang\|"tarang tv"\|"tarang tv online"\|"tarang ...	130263	698	115	65	https://i.ytimg.com/vi/qxqDNP1bDEw/default.jpg	False	False	False	Nuabohu : Story of a rustic village girl who w...	US
wERgpPK44w0	18.14.06	Ee Nagaraniki Emaindi Trailer \| Tharun Bhascke...	Suresh Productions	24	Entertainment	2018-06-10T04:29:54.000Z	Ee Nagaraniki Emaindi\|"Ee Nagaraniki Emaindi T...	1278249	22466	1609	1205	https://i.ytimg.com/vi/wERgpPK44w0/default.jpg	False	False	False	Check out Ee Nagaraniki Emaindi Trailer #EeNag...	US