Quantitative Analysis of Online Radio “LastFM” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the online radio dataset called (lastfm.csv). The project is divided into two main Parts.  Part-I evaluates and examines the dataset for understanding the Dataset using the RStudio.  Part-I involves three major tasks to review and understand the Dataset variables.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame. The Association Rule data mining technique is used in this project.  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users.  The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6. 

Keywords: Online Radio, Association Rule Data Mining Analysis

Introduction

This project examines and analyzes the Dataset of (lastfm.csv).  The dataset is downloaded from CTU course materials.  The lastfm.csv dataset reflect online radio which keeps track of every thing the user plays.  It has 289,955 observations with four variables.  The focus of this analysis is Association Rule.  The information in the dataset is used for recommending music the user is likely to enjoy and supports focused on marketing which sends the user advertisements for music the user is likely to buy.  From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  Thus, to calculate such support, the count of the incidences (0/1) (frequency) is implemented across all memebers of the network and divide those frequencies by the number of the members.  From the support, the confidence and the lift is calculated.

This project addresses two major Parts.  Part-I covers the following key Tasks to understand and examine the Dataset of “lastfm.csv.” 

  • Task-1:  Review the Variables of the Dataset.
  • Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.
  • Task-3:  Examine the Dataset, Summary of the Descriptive Statistics, and Visualization of the Variables.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

  • Task-1: Required Computations for Association Rules and Frequent Items.
  • Task-2: Association Rules.
  • Task-3: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “lastfm.csv”

Task-1:  Review the Variables of the Dataset

The purpose of this task is to understand the variables of the dataset.  The Dataset is “lastfm.csv” dataset.  The Dataset describes the artists and the users who listens to the music. From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  There are 4 variables.  Table 1 summarizes the selected variables for this project.  

Table 1:  LastFm Dataset Variables

Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.

            The purpose of this task is to load and understand the Dataset using names(), head(), dim() function.  The task also displays the first three observations.

  • ## reading the data
  • lf <-read.csv(“C:/CS871/Data/lastfm.csv”)
  • lf
  • dim(lf)
  • length(lf$user)
  • names(lf)
  • head(lf)
  • lf <- data.frame(lf)
  • head(lf)
  • str(lf)
  • lf[1:20,]
  • lfsmallset <- lf[1:1000,]
  • lfsmallset
  • plot(lfsmallset, col=”blue”, main=”Small Set of Online Radio”)

Figure 1.  First Sixteen Observations for User (1) – Woman from Germany.

Figure 2. The plot of Small Set of Last FM Variables.

 Task-3:  Examine the Dataset, Summary of the Descriptive Statistics and Visualization of the Variables.

            The purpose of this task is to examine the dataset.  This task also factor the user and levels users and artist variables.  It also displays the summary of the variables and the visualization of each variable.

  • ### Factor user and levels user and artist
  • lf$user <- factor(lf$user)
  • levels(lf$user)    ## 15,000 users
  • levels(lf$artist)  ## 1,004 artists        
  • ## Summary of the Variables
  • summary(lf)
  • summary(lf$user)
  • summary(lf$artist)
  • ## Plot for Visualization of the variables.
  • plot(lf$user, col=”blue”)
  • plot(lf$artist, col=”blue”)
  • plot(lf$sex, col=”orange”)
  • plot(lf$country, col=”orange”)

Figure 3.  Plots of LastFM Variables.

Part-II:  Association Rules Data Mining, Discussion and Analysis

 Task-1:  Required Computations for Association Rules and Frequent Items

The purpose of this task is to first implement computations which are required for the association rules.  The required package arules is first installed.  This task visualizes the frequency of items in Figure 4.

  • ## Install arules library for association rules
  • install.packages(“arules”)
  • library(arules)
  • ### computational environment for mining association rules and frequent item sets
  • playlist <- split(x=lf[,”artist”], f=lf$user)
  • playlist[1:2]
  • ## Remove Artist Duplicates.
  • playlist <- lapply(playlist,unique)
  • playlist <- as(playlist,”transactions”)
  • ## view this as a list of “transaction”
  • ## transactions is a data class defined in arules
  • itemFrequency(playlist)
  • ## lists the support of the 1,004 bands
  • ## number of times band is listed to on the playlist of 15,000 users
  • ## computes relative frequency of artist mentioned by the 15,000 users
  • ## plots the item frequencies.
  • itemFrequencyPlot(playlist,support=0.08, cex.names=1.5, col=”blue”, main=”Item Frequency”)

Figure 4.  Plot of Item Frequency.

Task-2:  Association Rules Data Mining

The purpose of this task is to implement the data mining for the music list (lastfm.csv) using Association Rules technique.  First, the code builds the Association Rules, followed by the implementation of the associations with support > 0.01 and confidence > 0.50. Rule out rare bands and ordering the result by confidence for better understanding of the association rules result.

  • ## Build the Association Rules
  • ## Only associations with support > 0.01 and confidence > 0.50
  • ## Rule out rare bands
  • music.association.rules <- apriori(playlist, parameter=list(support=0.01, confidence=0.50))
  • inspect(music.association.rules)
  • ## Filter by lift > 5
  • ## Show only those with lift > 5, among those association with support > 0.01 and confidence > 0.50.
  • inspect(subset(music.association.rules, subset=lift > 5))
  • ## Order by confidence for better understanding of the association rules result.
  • inspect(sort(subset(music.association.rules, subset=lift>5), by=”confidence”))

Figure 5.  Example of Listening to both “Muse” and “Beatles” with a Confidence of 0.507 for Radiohead.

Figure 6.  Narrow the List by increasing the Lift to > 5 and Decreasing Confidence.

 Task-3: Discussion and Analysis

The association rules are used to explore the relationship between items and sets of items (Fischetti et al., 2017; Giudici, 2005).  Each transaction is composed of one or more items.  The interest is in transactions of at least two items because there cannot be relationships between several items in the purchase of a single item (Fischetti et al., 2017). The association rule is the explicit mention in a relationship in the data, in the form of X >= Y, where X (the antecedent) can be composed of one or several items and is called itemset, and Y (the consequent) is always one single item.  In this project, the interest is in the antecedents of music since the interest is in promoting the purchase of music.  The frequent “itemsets” are the items or collections of items which frequently occur in transactions.  The “itemsets” are considered frequent if they occur more frequently than a specified threshold (Fischetti et al., 2017).  The threshold is called minimal support (Fischetti et al., 2017).  The omission of “itemsets” with support less than the minimum support is called support pruning (Fischetti et al., 2017). The support for an itemset is the proportion among all cases where the itemset of interest is present, which allows estimation of how interesting an itemset or a rule is when support is low, the interest is limited (Fischetti et al., 2017). The confidence is the proportion of cases of X where X >= Y, which can be computed as the number of cases featuring X and Y divided by the number of cases featuring X (Fischetti et al., 2017).  Lift is a measure of the improvement of the rule support over what can be expected by chance, which is computed as support(X>=Y)/support(X)*support(Y) (Fischetti et al., 2017).  If the lift value is not higher than 1, the rule does not explain the relationship between the items better than could be expected by chance.  The goal of “apriori” is to compute the frequent “itemsets” and the association rules efficiently and to compute support and confidence.  

In this project, the large dataset of lastfm (289,955 observations and four variables) is used.  The descriptive analysis shows that the number of males (N=211823) exceeds the number of female users (N=78132) as illustrated in Figure 3.  The top artist has a value of 2704, followed by “Beatles” of 2668 and “Coldplay” of 2378.  The top country has the value of 59558 followed by the United Kingdom of 27638 and German of 24251 as illustrated in Task-3 of Part-I.

As illustrated in Figure 1, the first sixteen observations are for the user (1) for a woman from Germany, resulting in the first sixteen rows of the data matrix.  The R package arules was used for mining the association rules and for identifying frequent “itemsets.”  The data is transformed into an incidence matrix where each listener represents a row, with 0 and 1s across the columns indicating whether or not the user has played a particular artist.  The incidence matrix is stored in the R object “playlist.”  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users. 

The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  For instance, listening both to “Muse” and “Beatles” has support larger than 1%, and the confidence for “Radiohead,” given that someone listens to both “Muse” and “Beatles” is 0.507 with a lift of 2.82 as illustrated in Figure 5.  This result exceeded the two requirements as antecedents involving three artists do not come up in the list because they do not meet both requirements.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6.  The result shows that listening to both “Led Zeppelin” and “the Doors” has a support of 1%, the confidence of 0.597 (60%) and lift of 5.69 and is quite predictive of listening to “Pink Floyd” as shown in Figure 6. Another example of the association rule result is listening to “Judas Priest” lifts the chance of listening to the “Iron Maiden” by a factor of 8.56 as illustrated in Figure 6.  Thus, if the user listens to “Judas Priest,” the recommendation for that user to also to listen to “Iron Maiden.”  The same association rules results apply to all of the six items listed in Figure 6.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.