Modeling and Predicting of News Popularity in Social Media Sources

时间：2024-07-28

Kemal Akyol and Baha Şen

Abstract:The popularity of news,which conveys newsworthy events which occur during day to people,is substantially important for the spectator or audience.People interact with news website and share news links or their opinions.This study uses supervised learning based machine learning techniques in order to predict news popularity in social media sources.These techniques consist of basically two phrases:a)the training data is sent as input to the classifier algorithm,b)the performance of prelearned algorithm is tested on the testing data.And so,a knowledge discovery from the data is performed.In this context,firstly,twelve datasets from a set of data are obtained within the frame of four categories:Economic,Microsoft,Obama and Palestine.Second,news popularity prediction in social network services is carried out by utilizing Gradient Boosted Trees,Multi-Layer Perceptron and Random Forest learning algorithms.The prediction performances of all algorithms are examined by considering Mean Absolute Error,Root Mean Squared Error and the R-squared evaluation metrics.The results show that most of the models designed by using these algorithms are proved to be applicable for this subject.Consequently,a comprehensive study for the news prediction is presented,using different techniques,drawing conclusions about the performances of algorithms in this study.

Keywords:News popularity,sentiment scores,social network services,Gradient Boosted Machines,Multi-Layer Perceptron,Random Forest.

1 Introduction

News conveys newsworthy events occurring in the course of day to people.News popularity is substantially important so as to predict the spectator or audience for a particular news or journal in modern mining problems[Alswiti and Rodan(2017)].It is measured through people's interaction with news website.They share links of news or their opinions[Lerman and Ghosh(2010)].Further,social sharing websites and news websites are used in order to read the various news.Online news popularity examines diverse factors such as sharing count,commenting count and liking count etc.on social media.Online examination of news content,which is a large and still growing market for traditional printed media,has undergone major changes[Canneyt,Leroux,Dhoedt et al.(2018)].

Spread of news to large number of readers within a short period is very important for its popularity.Therefore,there exists a competition among different sources to produce content for a major subset of the population[Bandari,Asur and Huberman(2012)].Since user behaviors in social media are a reflection of event in the real world,researchers have discovered that they can use it to predict social media and for predictions about the future.Social media data provides an advantage of information acquisition which may be difficult to collect from relatively large acquisitions,large quantities and other sources of data.That is,news popularity can be measured by means of it[Lawrence,Chase,Kyle et al.(2017)].Evaluation of this subject is relatively novel for researchers.

Some of the studies addressed for this subject are as follows:Alswiti and Rodan examined the effectiveness of feature selection on popularity prediction,by using different features,classification models and attribute ranking models.According to their studies,Random Forest classifier accomplished the best accuracy for all features.J48 and AdaBoost classifiers showed variant sensitivities depending on feature selection[Alswiti and Rodan(2017)].Canneyt et al.presented a model to predict online news popularity.By analyzing the capture view patterns of online news,they introduced suitable models via well-chosen based functions.By means of actual news dataset,they showed that the combination of the content,meta-data,and the temporal behavior features lead to significantly improved predictions.Gradient Tree Boosting algorithm proves to be more successful for news popularity predicting in their studies[Canneyt,Leroux,Dhoedt et al.(2018)].Bandari et al.[Bandari,Asur,Huberman et al.(2012)]built a multi-dimensional feature space derived from attributes of articles and evaluated the effect of these features for online article popularity.By using both regression and classification algorithms,they obtained an overall 84% accuracy on Twitter despite randomness in human behavior.Fletcher and Park explored the influence of individual trust on sharing preferences and online news engagement behaviors in news media across eleven countries[Fletcher and Park(2017)].Anil and Indiramma discussed the importance of recommendation systems,which is useful to find interesting items,different methodologies and social factors[Anil and Indiramma(2015)].Kywe et al.aimed to analyze the massive information and the huge number of people interacted through Twitter system by utilizing taxonomy[Kywe,Lim and Zhu 2012)].Keneshloo et al.dealt with the subject popularity,and built models using metadata,content,temporal,and social features.The study was applied to a real data at the Washington Post[Keneshloo,Wang,Han et al.(2016)].Uddin et al.focused on online news popularity prediction based on sharing the news before publication by using the Gradient Boosting Machine algorithm[Uddin,Patwary,Ahsan et al.(2016)].Lee et al.[Lee,Moon and Salamatian(2012)]proposed a framework for modelling and predicting the online contents popularity based on survival analysis.The framework infers the likelihood for which the content will be popular.A model was introduced by using a lifetime of content and the comment count popular metrics with a set of explanatory factors.Kümpel et al.reviewed the scientific,peer-reviewed 461 articles quantitatively and qualitatively.The articles dealt with the relationship between news sharing and social medias from the year 2004 to 2014[Kümpel,Karnowski and Keyling(2015)].Tatar et al.introduced a valuable study based on user comments.They analyzed the ranking effectiveness of the prediction models online news ranking automatically[Tatar,Antoniadis,Amorim et al.(2014)].Fernandes et al.[Fernandes,Vinagre and Cortez(2015)]introduced a proactive intelligent decision support system in order to detect earlier popularity of news information.Random Forest classifier gave the 73% best accuracy on the 39,000 articles which were taken from the Mashable website.Wu and Shen identified the properties of news propagation by tracing the data on Twitter.They implemented a news popularity prediction model that can predict the final number of retweets of a news tweet very quickly by utilizing these characteristics[Wu and Shen(2015)].Liu and Zhang[Liu and Zhang(2017)]explored that the grammatical construction of titles may affect news popularity positively.They calculated a score of traditional category and author features using logarithmic conversion,and presented a novel methodology in order to predict online news popularity before publication.As it can be seen in these studies,diversified features as input data are used for regression or classification approaches.This study handles out sentiment scores(title and headline),and the number of views in 2 days by interval 20 minutes of news,and presents the news popularity prediction models in social media sources by utilizing the Gradient Boosted Machines(GBM),Multi-Layer Perceptron(MLP)and Random Forest(RF)machine learning algorithms.These algorithms are used in many research areas like medicine,social media and other daily life areas.

The main focus of this study is to carry out the modeling and predicting of news popularity in social media sources.In this context,this study consists of two modules.The first one is to apply the data pre-processing techniques on all datasets.The second one is to demonstrate the performance of boosting,neural networks and ensemble learning based machine learning algorithms.In this context,machine learning algorithms are implemented on the datasets and their performances are discussed in our study.

The rest of the paper is organized as follows.Section 2 presents the materials and methods.Section 3 gives experimental study and results.Finally,the paper ends with conclusions in Section 4.

2 Material and methods

2.1 Data

A set of the data consists of news items and their respective social feedback on multiple platforms:Facebook,Google+and Linkedln.This set is collected from public end-points of the social media sources that are already anonymized and aggregated by the data owners.News data file concerns the description of news items and consists of 93239 instances and each news item is described by 11 attributes,which are explained in Tab.1.The data descriptors are based on information obtained by querying the official media sources Google News and Yahoo News[Moniz and Tongo(2018)].

A set of data files so called Feedbacks is concerned with the evolution of news items'popularity in the social media sources,Facebook,Google+and LinkedIn.News was collected during a two-year period,from January 7,2013 to January 7 2015,for each of the four categories,Economy,Microsoft,Obama and Palestine.News popularity is measured as the number of views 2 days by interval 20 minutes upon publication simultaneously.This set is composed of 12 data files,for all combinations of these categories and social media sources.

Table 1:Descriptions of attributes in news data file

The dataset,which includes enormous data,is a pre-processed and re-structured by discarding the instances which include N/A(null)value(s)from datasets.After preprocessing steps,the number of news in these categories is presented in Tab.2.

Table 2:The number of instances in social media sources

2.2 Methods

In this study,modeling and prediction of news popularity in social media sources is performed by using GBM,MLP and RF which are among the popular evolutionary algorithms and experimental results were compared.

Briefly,GBM conducts new models in repeatedly during learning to better predict the target variable.The goal is to create new basic learning models that will have maximum correlation with the negative gradient of the loss function associated with the whole ensemble[Friedman(2001)].

Then she took her little oil-lamp, and went into her little room, drew off her fur cloak, and washed off the soot from her face and hands, so that her beauty shone forth9, and it was as if one sunbeam after another were coming out of a black cloud

The back-propagated MLP is feed-forward networks updating the weights based on differences between the predicted and actual values for the target variable.The main idea is to minimize the mean square error between the actual and predicted values iteratively[Alpaydin(2010)].

The RF introduced by Breiman is an ensemble learning algorithm created by random decision trees.The main difference of this algorithm from the decision tree is that the RF investigates the best attribute during the division of node while Decision tree investigates the best feature among the random subsets.Therefore,this algorithm gives better results considering better modeling[Breiman(2001)].Internal parameters of algorithms and their values were assigned as given in Tab.3.

Table 3:Internal parameters for algorithms

3 Experiments and results

The proposed study consists of two main modules:data processing and machine learning.The first module carries out the prepared steps mentioned Pseudo Code 1 for machine learning module.In addition to the original data retrieved from the social media sources,the pre-processed dataset consists of the sentiment scores information of both the title and headline of the news items.Therefore,the pre-processed datasets are described by 147 attributes(2 sentiment values,title and headline,144 measurements and outcome variable,the new items' popularity).Flowchart of the proposed study is introduced in Fig.1.

Figure 1:The flow chart of the study

The information of attributes for these datasets is presented in Tab.4.All data collection and processing procedures mentioned in these steps are implemented in Python 2.7 on Anaconda platform.

Table 4:The information of attributes for these datasets

The second module,news popularity prediction,receives the processed data and splits it into training and test sets in order to evaluate the performance of prediction models,GBM,MLP and RF.This module steps mentioned Pseudo Code 2 are executed on ‘Knime'platform by integrated Python programming imported from the ‘protobuf' library.Python codes could run in a node on this platform.The ‘numpy' and ‘pandas' libraries are benefited during the build-up of both modules for practicing of the enormous data.

In our study,the performances of the models are evaluated using measures such as Mean Absolute Error(MAE),Root Mean Squared Error(RMSE)and the R-squared coefficient(R2)to consider how well they are for predictions that match the actual results.These metrics are given by the following equations respectively.

MAE and RMSE metrics are based on statistical summaries of ei(i=1,2,...,n).ei=Pi-Oiis described as individual model prediction error usually.n is the number of data instances,Piand Oiare the predicted and observed values respectively[Willmott and Matsuura(2005)].

where y is the observed response variable,its mean andthe corresponding predicted values.R2coefficient measures the degree of variation in the target variable.This coefficient is a value between 0 and 1,where 1 equates to a perfect fit of the model[Alexander,Tropsha and Winkler(2015)].

This study focuses on the analysis for the attributes of news data in social media sources and evaluates the performances of RF,GBM and MLP algorithms for news popularity prediction.%70 of data is used as a training set randomly,and remain is considered as the test set.Therefore,firstly the models are trained using the training sets and then tested on the test sets.R2,MAE and RMSE measures are used so as to evaluate the performances of the models in all experiments.Tabs.5-8 compares the performance of the models obtained according to Pseudo Code 2 algorithm on the datasets.This module also indicates that sentiment scores of news,and final value of the news items' popularity highly are influential in order to predict news popularity.Sentiment score also known as opinion mining is a field of text mining which examines people' opinions,judgments and ideas about entities[Liu and Zhang(2012)].Theqdap Rpackage[Rinker(2013)]is used in order to obtain this score.

Tab.5 shows the performances of the models on social media sources for Economy dataset.As shown in this table;

a)All algorithms have satisfactory performance on Facebook source for Economy dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with MLP based model on this source.

b)All algorithms have satisfactory performance on Google+source for Economy dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with RF based model on this source.

c)All algorithms have satisfactory performance on Linkedln source for Economy dataset.Further,MAE measure is same for all models.The maximum R2and minimum RMSE measures are obtained with RF based model on this source.

Table 5:The performances of the models for Economy dataset

Table 6:The performances of the models for Microsoft dataset

Tab.6 shows the performances of the models on social media sources for Microsoft dataset.As shown in this table;

a)All algorithms have satisfactory performance on Facebook source for Microsoft dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with RF based model on this source.

b)All algorithms have satisfactory performance on Google+source for Microsoft dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with MLP based model on this source.

c)All algorithms have satisfactory performance on Linkedln source for Microsoft dataset.Further,MAE measure is same for all models.The maximum R2and minimum RMSE measures are obtained with MLP based model on this source.

Tab.7 shows the performances of the models on social media sources for Obama dataset.As shown in this table;all algorithms have satisfactory performance on Facebook,Google+and Linkedln sources for Obama dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with RF based model on for all sources.

Table 7:The performances of the models for Obama dataset

Table 8:The performances of the models for Palestine dataset

Tab.8 shows the performances of the models on social media sources for Palestine dataset.As shown in this table;

a)All algorithms have satisfactory performance on Facebook source for Palestine dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with MLP based model on this source.

b)All algorithms have satisfactory performance on Google+source for Palestine dataset.Further,MAE measures are same for all models.The maximum R2and minimum RMSE measures are obtained with RF based model on this source.

c)All algorithms have satisfactory performance on Linkedln source for Palestine dataset.Further,MAE measure is same for all models.The maximum R2and minimum RMSE measures are obtained with MLP based model on this source.

Since the datasets used in this study were newly released in February 2018,there is no published study that uses these datasets.But the studies were performed on other datasets based on machine learning because this subject is popular.For this reason,sample studies on the use of machine learning for different datasets are presented in Tab.9.

Table 9:Sample of studies performed on different datasets

4 Conclusion

News conveys newsworthy events which occur during day to people.News popularity is measured through people's interaction with news website or social media platforms.They cast in their opinions or news links.The scientists use the social media data since it is the reflection of user behaviors in the real world.This study uses a set of the data consisting of news items and their popularity in the social media sources:Facebook,Google+and LinkedIn.It is composed of 12 data files,for all combinations of the Economy,Microsoft,Obama and Palestine categories,and the social media sources.The study consists of two phrases which are the preparation of the data and the design of prediction models.The pre-processed datasets are described by 147 attributes(2 sentiment values,title and headline,144 measurements of popularity in 20-minute intervals for a total of 2 days and outcome variable,the new items' popularity).The prediction models designed by utilizing GBM,MLP and RF learning algorithms are introduced for twelve datasets and empirical tests are performed.The success of most models for each dataset is approximately same.Further,this study will provide a beneficial reference for news popularity prediction.

Acknowledgement:The authors would like to thank the Fernandes et al.[Fernandes,Vinagre and Cortez(2015)]for providing the datasets.