Automatic Satisfaction Analysis in Call Centers Considering Global Features of E

时间：2024-08-31

Jing Liu, Chaomin Wang,, Yingnan Zhang, Pengyu Cong, Liqiang Xu, Zhijie Ren, Jin Hu, Xiang Xie,, Junlan Feng and Jingming Kuang

(1.School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China;2.China Mobile Communications Corporation, Beijing 100053, China)

Call centers have been widely used for customer services, technical support and sales. Call center agents are the key to the success of a call center operation. To evaluate the job performance of agents, 7 quantitative indicators are proposed in Ref. [1] which are service quality indicator, test score, personal attendance, total calls per hour, first call resolution, survey successful rate and customers’ satisfaction. Measuring and monitoring of customers’ satisfaction is an essential issue. Deficiencies of services and businesses can be clearly understood through the analysis of customers’ satisfaction. A large number of dialogue data are produced in call centers every day, and it is impossible to process by artificial means. So an intelligent system which is aimed to accomplish the satisfaction analysis automatically is greatly needed.

Meanwhile, the paralinguistic information has got more and more attention recently. Since 2010, INTERSPEECH has a challenge about computational paralinguistic. The goal is to identify more information from a speaker’s voice, such as personality, likability, drunken, social information and so on[2-6]. Inspired by these studies, this paper is concentrated on the degree of customer satisfaction of the services.

Various studies have carried out to investigate customers’ satisfaction. Research in Ref.[7] shows that agent traits can influence customers’ satisfaction. The knowledgeableness and preparedness of an agent are good indicators of his/her service quality according to the authors. The work in Ref.[8] predicts the customers’ satisfaction using affective features and textual features based on the context of a customer in social media. The affective features contain customer’s and agent’s personality traits and emotion expression.It is demonstrated in Ref.[9] that negative emotion between a customer and an agent, especially angry emotion, can deliver useful information to analyze the customer satisfaction.The authors use acoustic and lexical features to recognize the customers’ emotion and computed the proportion of emotional turns as the indicator of customer satisfaction.

In our research, we have collected thousands of dialogue speech from call centers to analyze the customers’ satisfaction without a speech recognizer[10].Since the customer may talk while driving, taking bus, taking the subways and so on. The channel noise, background noise and talking style can dramatically decrease the recognition accuracy. So it is hard to recognize the speech with a high accuracy.Our method is that extracting the acoustic features from the customers’ fragments to recognize the emotion and extracting the global features of emotion and duration to analyze the satisfaction based on the emotion recognition result.

1 Data Processing

The corpus is gathered from China Mobile’s call center that provides support to Shanghai customers in Chinese language. After each phone call, the customers are required to feedback by short message whether he/she is satisfied with the agent service. The database contains 5 684 recording audio files, out of which there are 1 170 dissatisfaction labeled recordings occupying 60 h and 4 514 satisfaction labeled recordings occupying 100 h. The duration of each dialogue varies from 20 s to 20 min. The sampling frequency is 6 kHz and the resolution is 16 bit. The training set contains 836 dissatisfaction recordings and 4 180 satisfaction recordings. The testing set contains 334 dissatisfaction recordings and 334 satisfaction recordings.

1.1 Segmentation and annotation

The whole process of segmentation has two steps: automatic segmentation and manual correction.The fragment is labeled with one of the four labels which are agent voice (A), customer voice(B), silence & noise (S) and overlap (AB). Overlapped fragments contains more than one speakers.The automatic segmentations are produced by a commercial ASR engine. These automatic segmentations are then corrected manually for the segmentation point and labeled with the speaker tag.

The object of an emotion annotation is customer voice. Six emotion labels are used: hot anger (HA), cold anger (CA), boredom (B), disappointment (D), neutral (N) and joy (J). The emotion annotation group has three annotators.They are all college students of about 24 years old. Before the annotation, the three annotators are trained to test the six kinds of emotion. Annotators need to label all of the customers’ fragments.The customers’ emotions are classified into positive emotions (neutral and joy) and negative emotions (hot anger, cold anger, boredom and disappointment) artificially.When a fragment has the same tag from more than one annotator, we take it as a sample for the study. Totally, we get 5 478 negative emotion fragments and 5 647 positive emotionfragments.

1.2 Negative emotion distribution

X—the ratio of negative emotion fragments in a dialogue; Y— the ratio of recordings which contain the different ratio of negative emotion fragments in the datasetFig.1 Negative emotion distribution

The purpose of our system is to find out the satisfaction recordings. An investigation is preformed to find out the relation between the satisfaction and the emotion fragments of customers. Fig.1 shows the correlation between negative emotionsand satisfaction. Thexaxis delegates the ratio of negative fragments to all the fragmentsin a dialogue. Theyaxis delegates the ratio of recordings to all the satisfaction recordings/dissatisfaction recordings. For example, 0% on thexaxis means the recording doesn’t contain any negative emotion, and 78% on theyaxismeans that 78% of the satisfied recordings do not contain any negative emotion.

Fig.1 illustrates that negative emotions have different distribution between satisfaction recordings and dissatisfaction recordings. Fig.1a shows the negative emotion distribution among all the satisfaction recordings and while Fig.1b shows the negative emotion distribution among all the dissatisfaction recordings.From Fig.1 we can see that only 22% of satisfaction recordings contain the negative emotions but all the dissatisfaction recordings contain the negative emotions. So it is effective to analyze the customer satisfaction by recognizing the emotions.However,it is not sufficient to merely consider the ratio of negative emotions because some satisfaction recordings contain the negative emotions. So the position and duration of the negative emotions in the recordings need to be considered.

2 Features

2.1 Acoustic features

We employ openSMILE[11-12], a feature extraction toolkit for speech, to extract 384 features with a predefined configuration file[13]. The details are exhibited in Tab.1.The low level acoustic features are extracted on a frame level. These low level descriptors (LLD) and their delta coefficients are projected onto 12 statistic functions. The total number is 16×2×12=384.

Tab.1 Details of 384 features

2.2 Global features of emotion and duration

Global features are extracted based on emotion confidence. It is the result of the emotion recognition. The emotion confidence means the intensity of emotion expression. The larger the absolute value of emotion confidence is, the more obviously the emotion is expressed. According to annotation experiments and data statistics, we find that the customers’ negative emotion position has an influence on the satisfaction degree of the dialogue. The regulation is more rearward, more important. So the statistic features not only contain the information of emotion intensity, but also the information of emotion position. The dialogue is divided into beginning, middle, and ending according to its duration and the number of fragments. The negative emotion rate and the negative emotion intensity are calculated respectively. In the satisfied and unsatisfied dialogues, the duration has a great difference between customer and agent. Generally speaking, the customer’s duration is longer than the agent’s in an unsatisfied dialogue. So 13 rhythm features are added which contain the information of customers’ and agents’ duration and interaction. There are 54 global features totally. The details are shown in Tab.2.

Tab.2 Details of 54 features

3 Baseline System and Proposed System

3.1 Baseline system

The baseline system assumes that customers’ satisfaction is constant during the dialogue. It only extracts 384 acoustic from the customers’ voice without considering the global features of emotion and duration to analyze the customer satisfaction. The basic framework is shown in Fig.2.

Fig.2 Overview of baseline system

3.2 Proposed system

Our system analyzes the customer satisfaction based on local acoustic features and global features of emotions and durations. The system consists of two steps: local emotion recognition and global customer satisfaction analysis. In the first step, we mainly detect the customer’s emotions on the customer’s fragment level using the acoustic features. In the second step, we estimate the result of the firststep on the whole utterances level to analyze the customer satisfaction. Fig.3 shows the diagram of the proposed system.

Fig.3 Overview of the proposed system

Firstly, all the customers’ fragments are collected from the dialogue. And then the acoustic features are extracted from every customer’s fragment. The purpose of the emotion recognition is to obtain the emotion confidence of every fragment. Next we do the satisfaction analysis. The input of the model are global features of emotion and duration. They are extracted based on the emotion confidence.The output of the model is customer satisfaction.

4 Experiments and Results

SVM[14-15]classifier with radial basic function is used for baseline of the proposed method. The optimal cost function parameterCand kernel function parametergare obtained by 5-fold cross validation approach. The performance of the system is measured byFvalue which is defined as a harmonic mean of precision (P) and recall (R). The formula is as

To obtain the emotion confidence, SVM is utilized to classify the customers’ emotions into negative emotions and positive emotions. The emotion confidence means the signed distance between samples point and hyper plane in SVM. When the emotion confidence is greater than zero, the corresponding sample is recognized as negative emotion. Otherwise is positive emotion. The typical emotional fragments are used to validate the emotion model. Training set contains 3 835 negative fragments and 3 953 positive fragments. Testing set consists of 1 643 negative fragments and 1 694 positive fragments. The results are shown in Tab.3.

Tab.3 Results of emotion recognition

P— precision;R— recall;F— a harmonic mean value of precision and recall

To evaluate the performance of the proposed system, we compared to the baseline which extracted the acoustic features on the whole utterance to analyze the customer satisfaction without considering the customer emotion. During the training process, we assign five ratios: 1∶1, 1∶2, 1∶3, 1∶4, and 1∶5 of dissatisfaction recordings to satisfaction recordings. In order to ensure the robustness and practicability of the system, we use the recordings without manual correction as training set and test set. The final number of training set and testing set is shown in Tab.4. Five sets of experiments are conducted for comparison. Tab.5 shows the results in detail.

Tab.4 Size of training set and testing set

n— the factor which control the ratio of unsatisfied recording to satisfied recording in training set.

Tab.5 Satisfaction analysis results

5 Conclusion

Tab.5 shows that the proposed system has a better performance than the baseline system. The averageFvalue is improved to 0.701 from 0.664 with an increase of 5.57%. The baseline assumes that customers’ attitude does not vary and the acoustic features are only used to analyze the satisfaction. But in a real conversation between a customer and an agent, the interaction happens more than once. Customers can utter multiple sentences during the interaction. One difficulty in analyzing customers’ satisfaction is its ambiguity. Not all of the sentences exhibit the characteristics of satisfaction or dissatisfaction. However, almost all the dissatisfaction recordings have negative emotions. So the proposed system that combining the local acoustic features and the global features of emotions and durations can improve the effectivness in analyzing the customer satisfaction.

From the experiment, the SVM has the best performance (Fscore is 0.710) when the ratio is 1∶3. It is concluded that the ratio of dissatisfaction recordings to satisfaction recordings in training set has an influence on a system performance.

In summary, a method is proposed to analyze the customer satisfaction using the acoustic features and global features of emotion and duration.The acoustic features are used to recognize the customer emotion of customer’s fragments. And then, global features of emotion and duration are extracted based on the emotion recognition results and used to conduct the satisfaction analysis. The global features not only contain the intensity of the customers’ emotions, but also the position and duration of customers’ emotions. Experiments show that this novel method can improveFvalue of the performance with 5.57%.

In the future study, we will pay attention to two aspects. Firstly, we will try to classify the customers’ emotions into multi-classes to analyze the customers’ emotions with more details. Secondly, we will shorten the unit of emotion recognition. In the current experiment each turn of customer is used as unit no matter how long it is. So, we’d like to segment customers’ turn into some kind of unit with suitable length for emotion analysis.

[1] Hsu H H, Chen T C, Chan W T, et al. Performance evaluation of call center agents by neural networks[C]∥2016 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA). Crans-Montana, Switzerland: IEEE, 2016: 964-968.

[2] Burkhardt F, Schuller B, Weiss B, et al. “Would you buy a car from me?”-On the likability of telephone voices[C]∥INTERSPEECH, Florence, Italy, 2011.

[3] Schuller B, Batliner A, Steidl S, et al. The INTERSPEECH 2011 Speaker state challenge[C]∥Proceedings INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 2011.

[4] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2012 speaker trait challenge[C]∥INTERSPEECH, Portland, Oregon, USA, 2012.

[5] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism[C]∥INTERSPEECH 2013, Conference of the International Speech Communication Association, 2013.

[6] Schuller B, Steidl S, Batliner A, et al. The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load[C]∥INTERSPEECH, Max Atria, Singapore, 2014.

[7] Froehle C M. Service personnel, technology, and their interaction in influencing customer satisfaction[J]. Decision Sciences, 2006, 37(1): 5-38.

[8] Herzig J, Feigenblat G, Shmueli-Scheuer M, et al. Predicting customer satisfaction in customer support conversations in social media using affective features[C]∥Proceedings of the 2016 Conference on User Modeling Adaptation and Personalizationm, Halifax, Canada, 2016.

[9] Vaudable C, Devillers L. Negative emotions detection as an indicator of dialogs quality in call centers[C]∥Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on IEEE, Kyoto, Japan, 2012.

[10] Xie Xiang, Kuang Jingming. Mandarin digits speech recognition using support vector machines[J]. Journal of Beijing Institute of Technology, 2005, 14(1): 9-12.

[11] Eyben F, Wöllmer M, Schuller B. Opensmile: the munich versatile and fast open-source audio feature extractor[C]∥Proceedings of the 18th ACM international Conference on Multimedia, Firenze, Italy, 2010.

[12] Eyben F, Weninger F, Gross F, et al. Recent developments in openSMILE, the munich open-source multimedia feature extractor[C]∥Proceedings of the 21st ACM international conference on Multimedia, Barcelona, Spain, 2013.

[13] Schuller B, Steidl S, Batliner A. The INTERSPEECH 2009 emotion challenge[C]∥INTERSPEECH, Brighton, United Kingdom, 2009: 312-315.

[14] Zhang Xuegong. Introduction to statistical learning theory and support vector machines[J]. Acta Automatica Sinica, 2000, 26(1): 32-42. (in Chinese)

[15] Chang C C, Lin C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.