Improved Logistic Regression Algorithm Based on Kernel Density Estimation for Mu

时间：2024-07-28

Yang Yu,Zeyu Xiong,Yueshan Xiongand Weizi Li

Abstract:Logistic regression is often used to solve linear binary classi fication problems such as machine vision,speech recognition,and handwriting recognition.However,it usually fails to solve certain nonlinear multi-classi fication problem,such as problem with non-equilibrium samples.Many scholars have proposed some methods,such as neural network,least square support vector machine,AdaBoost meta-algorithm,etc.These methods essentially belong to machine learning categories.In this work,based on the probability theory and statistical principle,we propose an improved logistic regression algorithm based on kernel density estimation for solving nonlinear multi-classi fication.We havecomparedourapproachwithothermethodsusingnon-equilibriumsamples,theresults show that our approach guarantees sample integrity and achieves superior classi fication.

Keywords:Logistic regression,multi-classi fication,kernel function,density estimation,non-equilibrium.

1 Introduction

Machine Learning has become one of the most popular fields in recent years.There are two main tasks of Machine Learning:1)classi fication,which goal is to divide instances into the appropriate categories,and 2)regression,which goal is to study relationship between samples.The most basic classi fication problem is binary classi fication.which can be solved using algorithms such as Naive Bayes(NB),support vector machine(SVM),decision tree,logistic regression,KNN,neural network,etc.More generally,multi-classi fication problems such as identifying handwritten digits 0～9,and and labeling document topics have gained much attention recently.To provide few examples,Liu et al.[Liu,Liang and Xue(2008)]proposed a multi-classi fication algorithm based on fuzzy support vector machines,whichprovidesbetterclassi ficationaccuracyandgeneralizationabilitycompared with traditional One-vs.-Rest methods.Tang et al.[Tang,Wang and Chen(2005)]proposed a new multi-classi fication algorithm based on support vector machine and binary tree structure to solve the problem of non-separable regions.

In the existing regression algorithm,support vector machines are mostly used for multi-classi fication problem,but there are some limitations in algorithm.The logistic regression algorithm can only solve the problem of dichotomy and linear classi fication.Support vector machines typically support only small training samples and are equally dif ficult to deal with multiple classi fication problems.Naive Bayes is based on the assumption that the characteristic conditions are independent.Once the dataset does not satisfy this assumption,its classi fication accuracy will be greatly affected.

In order to solve the problem above,towards dif ficult for implement large scale samples,not applicable to multi-classi fication and uncertainty to constraint conditions,Chen et al.[Chen,Chen,Mao et al.(2013)]proposed a model of Density-based Logistic Regression(DLR),which has a good result in practical application.Our model is based on kernel density-based logistic regression and we construct a new kernel function for multi-classi fication problems.This has three advantages:1)It makes better improvements to classi fication effect.2)It is an extension of DLR model to multi-classi fication problems.3)It shows good generalization performance on nonlinear and unbalanced data.We will describe the theoretical rationality and check classifying quality according to practical application for our new model.

The rest of the paper is organized as the following.In Section 2,we explain background knowledge including logistic regression binary classi fication,multi-classi fication,SoftMax and DLR model.In Section 3,we introduce several solutions for multi-classi fication problems with imbalanced samples.In Section 4,we explain our approach in details.In Section 5,we compare our approach to other methods and analyze the performances.Finally,we conclude in Section 6.

2 Logistic regression and related knowledge

2.1 Logistic regression

Logistic regression is based on linear regression,and a sigmoid logic function is applied,which is a logarithmic probability function.Logistic regression is represented as follows,

In the model of sigmoid function,zvalues are distributed within the range of[0,1].When the independent variable is taken near 0,thez-value change curve is very steep,while thezvalue is relatively stable at other values.Therefore,the binary classi fication tasks can be handled well if taking 0 as the boundary.However,it is sometimes dif ficult to make the representation model approximate to the expected model.By adding a constant termbto the function,

By substituting Eq.(2)into Eq.(1),we have

Based on these formulae,assuming a given datasetD={xi,yi},i=1,···,N,xi∈ R D,D is the dimension of samples,andyi∈{0,1},logistic regression is described as follows:

wherewstands for feature weight,which is a parameter to be learned.φis the characteristic transformation function.

In LR modelφis usually de fined to be equal tox.The key step is to learn unknown parameterswandb.Ifyin Eq.(3)is regarded as posterior probability estimationp(y=1|x),Eq.(4)can be rewritten as:

Thenwcan be obtained by the maximum likelihood estimate.With the definition ofbi=p(yi=1|xi),y=0 or 1,for a single sample,the posterior probability is,

Then,the maximum likelihood function is represented as follows,

For the convenience of calculation,the negative log of the maximum likelihood function is used as the objective function to be optimized,

Since the maximizing likelihood probability is equivalent to minimizing negative likelihood probability,the last step is to minimize the Loss function.

2.2 Density-based logistic regression

In the DLR model,φis a function that mapsxto the eigenspace,

whereDis the dimension of the input data,lnmeasures the contribution ofxdto the probability ofy=1,andmeasures the degree of imbalance for datasets.p(y=1)is the proportion of data in the training set,whose label isy=1.Nadaraya-Watson is usually used to estimatep(y=k|xd)wherek=0,1.

whereDk⊆Dis the subset of data in class k,andK(x,y)is a Gaussian kernel function de fined as follows,

wherehdis the bandwidth of the kernel density function.Thehdis usually set using the Silverman’s rule of thumb[Silverman and Green(1986)],

whereNis the total number of samples andσis the standard deviation ofxd.

Next we need to trainwthrough the learning algorithm untilwconverges.Givenbi=p(yi=1|xi),the loss function based on likelihood probability is calculated as follows,

2.3 Extension of logistic regression to multiple classi fication

Since the logistic classi fication is a binary classi fication model,it is necessary to extend it for multiple classi fication,common extensions include multiple binary classi fication models or SoftMax models.

2.3.1 N-logistic model

The N-logistic model generally adopts One-vs-Rest or One-vs.-One.When classifying a sample,we first classify the two classi fiers,then vote,and select the category with the highest score.At the same time,to prevent the same vote,we also add the probability of the class to each classi fier in the voting.The predictive accuracy of these two approaches is usually very similar,so unless there is a speci fic need for data characteristics,it is generally arbitrary to choose one approach to calculate.

2.3.2 SoftMax model

SoftMax regression is a generalization of logistic regression to multiple classi fication problems.Its basic form is described as follows,

When in the test,to samplex,if there is a categoryc,for all the other categoryc *(c * /=c)meet thep(y=c|x)＞p(y=c *|x),thenxbelongs to the categoryc.

On the question of choosing N-logistic model or SoftMax model,many scholars have conducted in-depth exploration.Currently,it is accepted that it is necessary to investigate whether the various categories are mutually exclusive.If there is a mutual exclusion relationship between the categories to be classi fied,we’d better choose SoftMax classi fier.Whileifthereisnomutualexclusionbetweencategories,andthecategoriesareintersecting,it is best suited to the N-logistic classi fier.We verify this conclusion according to corresponding datasets in Section 5.

3 Analysis of the classi fication results with unbalanced sample proportion

In our actual classi fication tasks,there are often needs to deal with problems with unbalanced data sample proportions.For example,the ratio of positive and negative samples in a dataset is 10:1,including 100 positive classes and 10 negative classes.If using this kind of data to train a classi fier,it is very likely that the test data will be divided into positive classes.Obviously,this classi fier is invalid.

For this kind of data,traditional logistic regression method usually fails to work.In recent years,studies on the problem of unbalanced classi fication have been very active[Ye,Wen and Lv(2009)].In this section we introduce several common approaches to solve the problem of sample imbalance classi fication.

3.1 Obtain more samples

For unbalanced classi fication,the first solution is to obtain more samples and expand a few samples to balance the sample proportion.However,in most cases,the sampling procedure needs speci fic conditions.Thus,it is generally dif ficult to obtain more samples under the same conditions.

3.2 Sampling methods

The general sampling method is mainly based on modifying the number of unbalanced samples.The research of Estabrooks et al.[Estabrooks,Jo and Japkowicz(2004)]show that the general sampling method has a better effect on solving unbalanced classi fication problems.

3.2.1 Under-sampling method

Under-sampling method is also called down-sampling[Gao,Ding and Han(2008)],which is to eliminate some samples from majority class samples,so that the number of samples in the whole group tends to be balanced.The commonly used method is random under-sampling downward method.The method is based onNmin,the number of minority class samples.We randomly sample from the majority class samples and eliminateNsamples,and thenNmax-N=Nmin,so the samples are balanced.

3.2.2 Over-sampling method

Over-sampling method is also called up-sampling,which refers to increase the number of minority class samples.The method of adding a small number of minority class samples(random over-sampling method)or re-fitting some new data in accordance with some law can be used to make the number of samples balanced.One commonly used method is Synthetic Minority Over-sampling Technique(SMOTE)[Chawla,Bowyer,Hall et al.(2002)].The method analyzes the distribution of the characteristic space of a few samples and proposes new samples.Compared to the random over-sampling method,the data added by SMOTE sampling method is completely new,which can follow the regular pattern in the original sample.The main idea of SMOTE is shown in Fig.1.

For each samplexin a minority class,the Euclidean distance of each sample point of a minority sample is calculated,and itskneighbors are obtained.A suitable sampling ratio is set according to the sample proportion to determine the sampling rateN.For each of the minority samplex,select several samples randomly from itskneighbors.For each random nearest neighborxn,a new sample is constructed with the original sample according to the following equation,

3.3 Modify evaluation index

For unbalanced classi fication,using accuracy to evaluate classi fiers may biases.For example,assuming ratio of positive and negative samples in a dataset is 9:1,and all samples are labelled be positive.Although the accuracy rate is up to 90%,the classi fier is useless.

Figure 1:The main idea of SMOTE method

Table 1:A hybrid matrix of binary classi fication

Therefore,accuracy can serve as a biased indicator.Davis et al.[Davis and Goadrich(2006)]proposed a new evaluation index named Precision and Recall,some factors are listed in Tab.1.

Precisionrefers to the proportion of positive samples in all predicted positive samples,andRecallrefers to the proportion of all actual positive samples that are being correctly predicted.

3.4 Use penalty items to modify the weights

If samples are dif ficult to sample directly,the method of modifying sample weights can be used.It increases the weight of minority class samples and reduces the weight of the majority class samples.Because the weights of minority class samples are high,they can lead to better classi fication results.The commonly used method is to add a penalty item to the majority class samples each time when training the sample weight.In general,we use the regularization method to add a penalty parameter to a objective function,this reduces the chance of the over fitting[Goodfellow,Bengio and Courville(2017)].The regularized objective function is shown below,

whereαis a parameter which represents the contribution of the penalty item and the objective function.The penalty can be adjusted by controllingα.Ifα=0,there is no penalty,otherwise the larger theα,the greater the penalty.

After we chose an appropriate penalty,the training regularize the objective function.In this way,the data error and the parameter scale can be reduced,the computation efficiency can be improved.But in practice,how to select the optimal penalty item is a complicated problem,which needs more tests.

3.5 Kernel-based methods

Towards general classi fication problem,we can assume that the sample data can be classi fieddirectlybylinearmodel.Inotherwords,thereisahyperplanethatcanseparatethe samples and ensure that the classi fication is correct.However,in practice,there is usually no such a hyperplane to partition the original data correctly,which means that the data are not linearly separable.For such a problem,we can consider preprocessing data.Using the principle of support vector machine,data in the low-dimensional space are transformed into the high dimensional space through nonlinear transformation,so that they can be linearly separable[Zhou(2016)].Using this method,the relationship between data samples can be written as dot product.For example,the linear regression function can be rewritten as follows,

wherex(i)is the training data.αis the coef ficient vector.Replacing the dot product with a function of the kernelk(x,x(i))=φ(x)·φ(x(i)),we can get,

This function is nonlinear with respect tox,while it is linear with respect toφ(x).

Kernel function can deal with nonlinear unbalanced classi fication well.It uses a convex optimization technique to address nonlinear problems in a linear manner.At the same time,this method can guarantee convergence and improve the accuracy of classi fication.And there is some simpli fication in parameter determination.In addition,it is much more efficient to use the kernel function to transform data into a transformation function[Goodfellow,Bengio and Courville(2017)].

SVM can convert sample data into high dimensional feature space through a kernel function.According to the principle of maximum spacing of SVM,the hyperplane of the optimal classi fication can be constructed in the characteristic space of high dimension to realizetheclassi fication.Iftheintervalofclassi ficationcanbeextended,especiallybetween minority class samples and the optimal classi fication hyperplane,the generalization performance of the classi fier and the accuracy of classes with small samples can be effectively improved.This enables the correct classi fication of unbalanced data[Liu,Huang,Zhu et al.(2009)].

4 Improved method of kernel density estimation model for multi-classi fication

We extend the DLR model to solve the multi-classi fication problem and design an improved multi-classi fication algorithm.Assuming there areCclasses,fork=1,2,...,C,the DLR model is de fined as follows,

wherewk=(wk1,wk2,...,wkD)is the feature weighting parameter of classk,andφk=(φk1,φk2,...,φkD)is the characteristic transformation function of classk.

According to the Nadaraya-Watson estimator,the probability formula of classkis obtained as follows:

Finally,we need to minimize the loss function,

where,1yi=kis 1 if and only ifyi=k,otherwise it takes value 0.

Now we present the process of evaluating the gradient of the Loss function with respect towk,

We adjust the weightwkaccording to the direction of the gradient descent,until thewkconverges andwkin the model is well trained.During the testing,the same kernel function transformation is performed on the testing data.The transformedφ(x)and trainedwkare substituted into Eq.(25).Then we compare the probability of the different classes and choose the class with the largest probability as the result category.At this point,we have completed the generalization of the logistic regression to multi-classi fication based on kernel density function.

To show the difference between kernel density estimation logistic regression and classical logistic regression,we will compare the corresponding algorithms later.

In the DLR algorithm,the inputxis given a feature transformation to getφbefore calculating the probability in Eq.(25).And then substituteφforxas the input to the probability formula.At the same time,the probability formula is changed from the Sigmoid function to the SoftMax function.

After conducting experiments,we have found that the differences ofφamong different labels obtained using the DLR algorithm are small.There is a large error in the final classi fication result.And the minority class samples cannot be discriminated at all.And the value of loss function is not reduced by training.Therefore,in the process of constructing the bandwidth of kernel function and preprocessing the data,we improve it by the following scheme.

Figure 2:The process of searching for the optimal coef ficient

First,We try to train the parameters of the kernel function by modifying the weight values on the basis of Eq.(14).We conducted 16 groups of experiments,as shown in Fig.2.In the previous experiment,since the value ofwwas too large,the characteristics of the input dataXitself were dif ficult to distinguish.Properly reducingcan limit the complexity of the model,thereby improving the generalization performance of the model.Through comparison experiments,we found that changing 1.06 in Eq.(14)to 0.02 can signi ficantly improve the accuracy of the model.According to Fig.2,we reduce the bandwidth of kernel function in Eq.(14).

In this way,the difference ofhdhas been improved.However,it may cause the value ofybecome too large and over fl ow in subsequent calculations.Feature scaling is a crucial step in the data preprocessing process.For most machine learning and optimization algorithms,scaling the values of features to the same interval can make their performance even better.In order to accelerate loss function convergence rate,we normalizeφusing the min-max method.

The training process of the improved model is established in Algorithm 3.

In the next section,we will conduct a comparative test to analyze the relationship between test results and training results after using Algorithm 3.

5 Application of improved algorithm:datasets and veri fication analysis

In particular,we have implemented the following methods for testing.

1)N-logistic model,One-vs-Rest methods,abbreviated as NLR.

2)N-logistic model,One-vs-Rest methods,combined with the oversampling method,abbreviated as NLR_Sample.

3)N-logistic model,One-vs-Rest methods,combined with the Smote method,abbreviated as NLR_Smote.

4)SoftMax model.

5)SoftMax model combined with Algorithm 3,abbreviated as DLR++.

We choose three datasets for testing.The first one is the fitting datasetNumbconstructed by us.In this dataset,each data element contains 10 fl oating point values,ranging from 0 to 5.The data distribution is divided into three categories:GroupA,GroupBandGroupC.The second dataset is theIrisfrom UCI.There are four features,including calyx length,calyx width and petal width,and the eigenvalue is fl oating-point number.The target value is the classi fication result of irises,includingvirginica,versicolor,andsetosa.The third dataset is theWinefrom UCI,which uses the various parameters of theWineto predict the quality of the Wine.There are 11 characteristic values,including volatileacidity,non-volatile acid,citric acid,residual sugar,chlorine,total sulfur dioxide,free of sulfur dioxide,sulfate,concentration,PH and alcohol.There are three quality classes:1,2,or 3.

Table 2:Accuracy(%)of different methods on three datasets

Table 3:Time(s)for different methods on three datasets

Table 4:The number of iterations of training Loss convergence on three datasets

In order to keep the data more versatile,and the classi fication results more persuasive,we use k-fold cross validation and assign the dataset to the training set and testing set according to the ratio 7:3.The test results are given as follows.

From Tab.2 to Tab.4,we can see that the DLR++algorithm shows better prediction accuracy.In the three datasets,Numbis linear,whileIrisandWineare non-linear.We can see from the results that both N-logistic and SoftMax models can solve the multi-classi fication problem well.Both oversampling and smote sampling method can be used to improve the classi fication results of the sample imbalance problem with the accuracy rate increased by 1.34%and 3.92%respectively.The improved DLR++model based on kernel density is the best among all these methods,and it has an advantage in solving nonlinear multi-classi fication problems.From Tab.2 to Tab.4,we can see that the improved DLR++model converges faster than the original logistic model,using only 1/20 of the training times.At the same time,the accuracy rate has been increased 7.04%,at the cost of a higher operation time.

From Tab.5 to Tab.6,we can see that the improved DLR++model has a better performance on datasets of large scales and multiple categories.It offers an accuracy of 93.0%while LRoffers an accuracy of 47.0%for 10-classi fication problems.

Table 5:Performance of DLR++on different scales of datasets

Table 6:Performance of DLR++on different number of categories

6 Conclusion

In this paper,we propose an improved logistic regression model based on kernel density estimation,and it can be applied to solve nonlinear multi-classi fication problems.We have compared and tested several common algorithms for logistic regression.For the experimental results,we found that the sampling method[Gao,Ding and Han(2008);Chawla,Bowyer,Hall et al.(2002)]can improve the classi fication accuracy,but the training samples obtained are very different from the original samples,which destroys the data characteristics inherently in the original sample.However in contrast,our improved model guarantees the integrity of the samples,it has obvious advantages in classi fication accuracy,and has good generalization ability with an ideal training speed.But there is still room for optimization in training,especially in the matrix operation stage.In the future,we will reduce the size of the matrix and block calculation,expected to decline training time and improve efficiency.Combining application to document retrieval[Xiong and Wang(2018);Xiong,Shen,Wang et al.(2018)],we will also expect to check the improved method in this paper is effect to document classi fication which is interested by us.

Acknowledgement:The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work was supported by National Natural Science Foundation of China(Grant No.61379103).