Research Paper Payload Encoding Representation from Transformer for Encrypted Tr

时间：2024-05-19

HE Hongye, YANG Zhiguo, CHEN Xiangning

DOI： 10.12142/ZTECOM.202104010

https：//kns.cnki.net/kcms/detail/34.1294. TN.20211104.1636.002.html， published online November 5， 2021

Manuscript received： 2021-02-23

Abstract： Traffic identification becomes more important， yet more challenging as related encryptiontechniquesarerapidlydevelopingnowadays. Unlikerecentdeeplearning methods that apply image processing to solve such encrypted traffic problems ， in this pa? per，weproposeamethodnamedPayloadEncodingRepresentationfromTransformer （PERT） to performautomatic traffic featureextraction usingastate-of-the-artdynamic word embedding technique. By implementing traffic classification experiments on a pub ? lic encrypted traffic data set and our captured Android HTTPS traffic， we prove the pro? posed method can achieve an obvious better effectiveness than other compared baselines. To the best of our knowledge， this is the first time the encrypted traffic classification with the dynamic word embedding has been addressed.

Keywords： traffic identification; encrypted traffic classification; natural language process ? ing; deep learning; dynamic word embedding

Citation （IEEE Format）： H. Y. He， Z. G. Yang， and X. N. Chen， “Payload encoding representation from transformer for encrypted traffic clas? sification， ”ZTE Communications， vol. 19， no. 4， pp. 90 –97， Dec. 2021. doi： 10. 12142/ZTECOM.202104010.

Traffic classification， a task to identify certain catego ? ries of network traffic， is crucial for Internet services providers（ISP） to track the source of network traffic and to further ensure their quality of service（QoS）. Also， traffic classification is widely applied in some specific missions，likemalwaretrafficidentificationandnetworkat? tackdetection.However，thisisachallengesincenetwork traffic nowadays is more likely to be hidden with several en ? cryption techniques， making detection hard with a traditional approach.

Typically， there are the following widely applied traffic clas ? sificationmethods：1） The port-basedmethod， whichsimply identifies traffic data using specific port numbers， is suscepti? ble to the port number changing and port disguise. 2） Deep packet inspection （DPI）， a method which aims to locate pat? ternsandkeywordsfromtrafficpackets，isnotsuitablefor identifying encrypted traffic because it heavily relies on unen ? cryptedinformation.3）Themachinelearning（ML） -based method focuses on using manually designed traffic statistical features to fit a machine learning model for categorization[ 1]. 4） The deep learning （DL） -based method is an extension of theML-based approach where neural networks are applied for au ? tomatic traffic feature extraction.

Although encrypted traffic packets are hard to identify， an encryptedtrafficflow（aflowisaconsecutivesequenceof packets with the same source IP， source port， destination IP， destination port and protocol） is still analyzable because the first few packets of a flow may contain visible information like handshake details[2]. In this way， the ML-based and DL-based methods are considered ideal for encrypted traffic classifica? tion since they both extract common features from the traffic data. In fact， the ML-based and DL-based methods share the same concept that traffic flows could be vectorized for super? vised training according to their feature extraction strategy.

Rather than extracting hand-designed features from the traf? fic as the ML-based method does， the DL-based method uses a neural network to perform representation learning （RL） for the traffic bytes which allow it to avoid complex feature engineer? ing.Itprovidesanend-to-endsolutionforencryptedtraffic classification where the direct relationship between raw traffic data and its categories is learned. The classification effect of a DL-based method is highly related to its capacity of represen ? tation learning.

In this paper， we propose a new DL-based solution named PayloadEncodingRepresentationsfromTransformers （PERT）inwhichadynamicwordembeddingtechnique called the Bidirectional Encoder Representations from Trans ? formers （BERT） [3]is applied during the traffic representation learning phrase. Our work is inspired from the great improve ? ments in the natural language processing （NLP） domain that dynamic word embedding brings. We believe that computer communicationprotocolsandnaturallanguagehavesome commoncharacteristics. Accordingtothispoint，weshall prove that such a strong embedding technique can also be ap ? plied to encode traffic payload bytes and provide substantial enhancement while addressing the encrypted traffic classifi ? cation task.

2 Related Work

We shall introduce some related traffic identification works that involve the DL domain， and further categorize them into two major groups.

1）Forfeature-engineering：Basically，thesemethodsstill use hand-designed features but utilize the DL as a measure of feature processing. For example， JAVAID et al. proposed an approach using the deep belief network （DBN） to make a fea? ture selection before the ML classification[4]. HOCHST et al.introduced the auto-encoder network to perform dimension re ? duction for the manually extracted flow features[5]. REZAEI et al. applied a similar pre-training strategy as we do[6]. What is different is that this work introduced neural networks to recon ? struct time series features. Its pre-training plays a role of re- processing the hand-designed features. Ours instead， is to per? form a representation learning for the raw traffic.

2）Forrepresentationlearning：TheseworksapplyDLto learn the encoding representation from raw traffic bytes with ? out manual feature-engineering. These works are also consid? ered as end-to-end implements of traffic classification. WANG et al. proposed this encrypted traffic classification framework for the first time[7]. They transformed payload data to grayscale imagesandappliedconvolutionalneuralnetworks（CNN） to perform image processing. Afterward， the emergence of a se ? ries of CNN-based works， such as Ref. [8]， proved the validity of such an end-to-end classification. LOPEZ-MARTIN et al. further discussed a possible combination for traffic identifica? tion where the CNN was still used for representation learning， but a long short-term memory （LSTM） network was introduced to learn the flow behaviors[9]. It inspired the hierarchical spa? tial-temporal features-based （HAST） models which obtained a state-of-the-art result in the intrusion detection domain[ 10].

Nevertheless， for end-to-end encrypted traffic classification nowadays， CNN is still the mainstream whereas the NLP-relat? ed network only works as an supplement to do jobs such as capturingflowinformation.Wecanhardlyfindafull-NLP scheme similar to ours， let alone one which applies current dynamic word embedding techniques.

3 Model Architecture

3.1 Payload Tokenization

According to Ref. [2]， the payload bytes of a packet are like ? ly to expose some visible information， especially for the first few packets of a traffic flow. Thus， mostDL-based methods use this byte data to construct traffic images as the inputs of a CNN model. This is because the byte data is ideal for generat? ing pixel images as its value ranges from 0 to 255， which is just fit for a grayscale image. Rather than applying such an im ? age processing strategy， we treat the payload bytes of a packet as a language-like string for introducing NLP processing.

However， the range of byte value is rather small considering the size of a common NLP vocabulary. To extend the vocab size of traffic bytes， we introduce a tokenization which takes pairs of bytes （with a value range of 0 to 65 535） as basic char? acter units to generate bigram strings （Fig.1）. Afterward， the NLP related encoding methods can be directly applied to the tokenized traffic bytes. Thus， the encrypted traffic identifica? tion is transformed to an NLP classification task.

3.2 Representation Learning

While performing representation learning in an NLP task， thewordembeddingiswidelyutilized.Recently，abreak? through was made in this research area as the dynamic word embedding technique overcame the drawback that the tradi ? tional word embedding methods such as the Word2Vec[ 11]are only capable of mapping words to unchangeable vectors. By contrast， vectors trained by dynamic word embedding can be adjusted according to its context inputs， making it more power? ful to learn detailed contextual information. This is just what we need for extracting complex contextual features from theencrypted traffic data.

Current popular dynamic word embedding methods such as BERT could be considered as a stack of a certain type of en ? coding layers. Each encoder takes the outputs of its former lay ? er as inputs and further learns a more abstract representation. In another word， word embedding will be dynamically adjust? ed while passing through its next encoding layer.

Inourwork，wetakethetokenizedpayloadstring[w1 ， w2 ，...， wk] as our original inputs. The first group of word em ? bedding vectors [x1， x2，...， xk] at the bottom of the network is randomly initialized. After N times of dynamic encoding， we obtain the final word embedding outputs [hN1， hN2，...， hNk] that imply extremely abstract contextual information of the original payload.

The illustration of our representation learning is shown in Fig. 2.

EarlierdynamicwordembeddingcalledtheEmbeddings from Language Models （Elmo）[ 12]uses the bidirectional LSTM as its encoder unit， which is not suitable for large-scale train? ingsince theLSTMhasabadsupport for parallelcalcula? tions. To solve this problem， the LSTM was replaced with a self-attention encoder that was firstly applied in the transform ? ermodel[ 13] ，andthisembeddingmodelwasnamedBERT. This is what we also use for encoding the encrypted payload as shown in Fig. 3. Taking our first embedding vectors [x1， x2，...， xk] as examples， there are several steps of the transformer en ? coding as follows.

1） Linear projections： Each embedding vector xi will be pro? jected to three vectors using linear transformations：where WK， WQ and WV are the three groups of linear parameters.

2） Self-attention and optional masking： The purpose of linear projections is to generate the inputs for the self-attention mecha? nism. Generally speaking， self-attention is to figure the compati ? bility between each input xi and all the other inputs x1–xk via a similarity function， and further to calculate a weighted sum for xi which implies its overall contextual information. In detail， our self-attention is calculated as follows：

The similarity between xiand xjis figured by a scaled dot- product operator， where dkis the dimension of Kj， and Z is the normalization factor. It should be noticed that not every input vectorisneededforself-attentioncalculation.Anoptional masking strategy that randomly ignores a few inputs while gen ? erating attention vectors is allowed to avoid the over-fitting.

3） Multi-head attention： In order to grant encoders the abili ? ty of reaching more contextual information， the transformer en? coding applies a multi-head attention mechanism. Specifical? ly， linear projections will be done for M times for each xito generate multiple attention vectors [atti，1， atti，2，...， atti，M]. Af? terward， a concatenation operator is utilized to obtain the final attention vector：

4） Feed-forward network （FFN）： A fully-connected network is used to provide the output of current encoder. For xi， it is as follows：

where W1， b1， W2and b2are the full-connection parameters and max（0， x） is a Rectified Linear Unit （ReLU） activation function.

Finally， we get the dynamic embedding hiwhich is encoded from xi. It can be further encoded by the next encoding layer or be directly used in downward tasks. Similar to the naming of BERT， we name our encoding network as the PERT consid ? ering the application of a transformer encoder.

3.3 Packet-Level Pre-Training

A key factor that makes BERT and its extensive models continuouslyachievestate-of-the-artresultsamongawide range of NLP tasks is their“pre-training + fine-tuning”strate? gy. To the best of our knowledge， our work is the first to intro? duce such a strategy to an end-to-end encrypted traffic classifi ? cation architecture. To perform pre-training is to initialize the encoding network and to give it the ability of contextual infor? mation extraction before it is applied to downward tasks. The unsupervised language model （LM） is widely used for word em ? beddingpre-training[ 14]. BERT，specifically，proposesa masked LM which hides several words from the original string with a unique symbol ‘unk ， and uses the rest of the words to predict those hidden ones.

To demonstrate the procedure of the masked LM， we give a masked traffic bigram string as w = [w1 ，...， ‘unk ，...， ‘unk ，...， wk] and a list msk = [i1 ， i2 ，...， im] which indicates the position of bigram units that are masked. After the encoding， for each embedding vector hithat is encoded from the i-th position of the original input， a full connection is followed：

where tanh is an activation function like the Relu. The size of the output vector oi= [oi， 1 ，...， oi，|V|] is the vocab size|V|. It stores all the likelihoods about what the corresponding traffic bigram is at the i-th position.

In the end， the masked LM uses partial outputs {oi， i?msk} to perform a large softmax classification with the class number of vocab size. The objective is to maximize the predicted probabili ? ties of all the masked bigrams， which can be simply written as：

where θ represents the parameters of the entire network.

The LM is considered as a powerful initialization approachfor the encoding network using large-scale unlabeled data， yetit is very time-consuming. Even if we want to perform a flow- level classification for encrypted traffic， we argue that the pre- training should be packet-level considering the possible calcu ? lation costs. Particularly， we collect raw traffic packets from the Internet despite their sources and extract their payload bytes to generate an unsupervised data set. Then， the extract? ed payload bytes are tokenized as bigram strings and are uti ? lized to perform a PERT pre-training. After the training con? verges， we save the adjusted encoding network.

3.4 Flow-Level Classification

While implementing a certain task like classification， the pre-trained encoding network will be totally reused and be fur? ther adjusted to learn the real relationship between the inputs and a specific task objective. This is the concept of “fine-tun? ing”， where a network is trained based on a proper initializa? tion to achieve a boosted effect in downward tasks.

Fig. 4 shows our encrypted traffic classification framework. Below are the detailed descriptions：

1） Packets extraction： While classifying an encrypted traffic flow， only the first M packets （3 for example in Fig. 4） need to beused. Thebigramtokenizationisperformed for payload bytes in each packet to generate a list of tokenized payload strings [str1， str2，...， strM].

2） Encoding for packets： Before classification， the encoding network of the classifier with the pre-trained counterpart is ini ? tialized.Astheencodingnetworkispacket-level，eachto? kenized string will be individually transported to the encoders. According to Ref. [3]， while carrying out a classification with BERT， a unique token ‘cls should be added at the beginning of the input as the classification mark. For the i-th packet， its tokenized string will be modified as stri = [cls， wi， 1 ， wi，2 ，...， wi，k].Afterencoding，aseriesof embeddingvectors[hNi，CLS ， hNi，1，...， hNi，k] is outputted， yet only the hNi，CLS will be picked as the further classification input. We simply represent hNi，CLSas embi. In order to make use of all of the information extracted from the first M packets， we apply a concatenation to merge the encoded packets：

3） Final classification： In the end， a softmax classification layer is used to learn the probability distribution of the input flows among possible traffic classes. The objective of the flow- level classification can be written as：

where Rflowrepresents the flow-level training set. Given a flow sample f， yf represents its true label （class） and emb（f） indicates its concatenated embedding. P is the conditional probability that the softmax layer provides. In a manner of speaking， the ob? jective is to maximize the probability that each encoded flowsample is predicted as its corresponding category. The flow-lev?el information is involved in the final softmax classifier， and thus will be used to fine-tune the packet-level encoding net? work during the back propagation. The main point of such a fine-tuning strategy is to separate the learning of the packets re?lationship from the time-consuming pre-training procedures.

4 Experiments

4.1 Experiment Settings

4.1.1 Data Sets

1） Unlabeled traffic data set： The data set is utilized for the pre-training of our PERT encoding network. To generate this data set， we capture a large amount of raw traffic data from dif?ferent sources using different devices through a network sniff? er. Typically， there is no special requirement for the unlabeled traffic data except you should make sure your collected sam?ples can cover the mainstream protocols， as many as possible.

2） Information Security Centre of Excellence （ISCX） data set： We chose ISCX2016 VPN-nonVPN1 ， a popular encryptedtrafficdataset，tomakeourclassification evaluationsmorepersuasive.However，this data set only marks where its encrypted traf? ficdataiscapturedfromandwhetherthe capturingis throughaVPNsessionor not， whichmeansafurtherlabelingshouldbe performed. The ISCX data set is utilized in several works yet the results are rather differ? entevenwhenthesamemodelisap ? plied[7 –8]. This is mainly due to how the raw dataisprocessedandlabeled.Because WANG et al. [7]have provided their pre-pro? cessingandlabelingproceduresintheir github2 ， we follow this open source project to process the rawISCXdatasetand labelit with 12 classes.

3） Androidapplication dataset： We find the ISCXdataset is not entirely encrypted as it also contains data of some unencrypted protocols like Domain Name System（DNS）. To make a better evaluation， in this work， we manuallycapturetrafficsamplesfrom100 Android applications via the Android devic ? es and network sniffer tool-kit. All the cap? tured data belong to the top activated appli ? cations of the Chinese Android app markets.Afterward，weexclusivelypicktheHTTPS flows to ensure only the encrypted data remain.

4. 1.2 Parameters

1）Pre-training：Firstofall，toperformthepacket-level PERT pre-training for our unlabeled traffic data， we introduce public Python library transfomers3which provide implements of theoriginalBERTmodelandseveralrecentlypublished modified models. In practice， we chose the optimized BERT implement named A Lite BERT （ALBERT） [15]， which is more efficientandlessresource-consuming.However，eventobe properly optimized， current BERT pre-training is very costly when we use 4 Nvidia Tesla P100 GPU cards.

Table 1 shows the settings of our pre-training and the corre? sponding description of each parameter. Such settings refer to what common NLP works with BERT encoding use. After suffi? cient training， the encoding network is saved as a Pytorch4for? mat which can be reused in our classification networks. Also，all of our other networks are implemented using the Pytorch.

2） Classification： The encoding network used at the classifi ? cationstagestrictlyshares thesamestructure with the pre- trainedone.Othersettingsoftheclassificationlayersare shown in Table 2. As fine-tuning the encoding network in a classificationtaskisrelativelyinexpensive[3] ，asingleGPU card will be just enough.

4. 1.3 Baselines

Belowarethebaselineclassificationmethodsweuseforcomparison：

1） ML-based： We refer to Ref. [ 16] to implement our ML- based method using the decision tree classifier （named ML- 1）. However，itonlycontainsbasic flow-statistical featuresthat weconsiderasnotthemostoptimizedML-basedmethod. Thus， based onML- 1， we further addsome time series fea? tures as the source ports， destination ports， directions， packet lengths and arrival time intervals of the first10 packets in a flow to generate the ML-2 model.

2） CNN： The two types of CNN models are the 1D-CNN and the 2D-CNN， provided by Ref. [7]. They both use the first 784 bytes of a traffic flow to perform the classification.

3） HAST： The two HAST models proposed by Ref. [ 10] are the state-of-art end-to-end methods for intrusion detection. HAST-I uses the first 784 bytes of a flow for direct representation learn ? ing. HAST-II， however， only performs packet-level encoding. It further introduces an LSTM to merge the encoded packets.

During the evaluation， we randomly chose 90% of samples from the data set as the training set， and the remaining10% for validation. Then， three widely used classification metrics are applied：

Take a class yias an example， the TPiis the number of sam ? ples correctly classified as yi， FPiis the number of samples mistakenly classified as yi， FNiis the number of samples mis ? takenly classified as nor-yi. As for the overall evaluation for all classes， we use the average values of those metrics.

4.2 Overall Analysis

1） Results on ISCX Data Set

This group of experiments are used to discuss the classifica? tion based on the consistent data settings of Ref. [7]. As we canseeinTable3，ourflow-levelPERTclassification achievesthebestclassificationresultswheretheprecision reaches93.27%andtherecallreaches93.22%.Itproves PERT is a power representation learning method for encrypted traffic classification.

As for other models， using the same manner of data prepro ? cessing， the CNN classification results are pretty close to what is provided byRef.[7]. TheCNN methods obviously obtain higher precision and recall than the ML- 1 that is implemented based on Ref. [15]. However， the ML- 1 can still be improved. When the time series features are added， the precision of the ML-2classificationcanexceed89%whichismuchbetter than what the basic CNN methods get. In other words， the ba? sic CNN methods actually have no absolute advantage while classifying the ISCX data set.

HAST-IachievesbetterresultsthantypicalCNNmodels yet HAST-II with an LSTM works relatively worse. In fact， we think using the first few bytes of a flow to performa direct deep learning （like HAST-I and CNN- 1D） is considered better than merging the packet-level encoded vectors， since the rep? resentationlearningcandirectlycapture flow-levelinforma? tion. However， the encoding costs on a long string are not af? fordable for complex dynamic word embedding. At the current stage， the“packet-level encoding + flow-level merging”is the best option for our PERT classification.

2） Results on Android Data Set

These experiments are based on full HTTPS traffic to evalu ? ate theactualencrypted trafficclassificationabilityof each method. As all the data here are HTTPS flows， in comparison with the ISCX data set whose data cover several traffic proto ? cols， it is harder to distinctly locate different flow behaviors among the chosen applications. Consequently， the ML-based methods that strongly rely on flow statistics features work ex ? tremely weakly. Even when enhanced by time series features，the ML-2 still obtains a worse result than basic DL methods.As for the original ML- 1， we find it is entirely not capable ofaddressing this 100-class HTTPS classification that we ignoreits result in Table 4.

The resultson the Androiddatasetdemonstrate that theDL-basedmethodsaremoresuitable for processing fullen ? crypted traffic data. More importantly， PERT again shows its superiorityasitintroducesamorepowerfulrepresentation learning strategy. Its F1-score on the 100-class encrypted traf? ficclassificationexceeds90%whereastheHASTcanonly achieve a result of 81.67%.

4.3 Discussion： Selection of the Packet Number

In a flow-level classification model， the increase in the use of packets will cause significant costs. This is particularly true when the representation learning is applied to traffic packets.

We perform PERT classification multiple times on the two data sets with different settings of the “packet_num”and the results are shown in Fig. 5. As we can see， at the beginning， the classification result on each data set is greatly improved with more packets used. However such increase is slight after the continuous adding of packets. For example， the F1-score is shown to reach 91.35% while classifying the Android data set with 20 packets. However， this result is merely boosted by 1.28% in comparison with using five packets， so it is not rec ? ommendedconsideringthecostsofPERTencodingforso many packets and such minor further improvements.

We point out that using 5 – 10 packets for our PERT classi? ficationwillbesufficient.Similarconclusionscanbealso found in other flow-level classification research works such as Refs. [9] and [10].

4.4 Discussion： Merging of Encoded Packets

A major difference between our PERT classification and the most flow-level DL-based methods such as HAST-II is how the encodedpacketsaremerged.HAST-IIconstructsa2-layer LSTM after encoding the packet data whereas we simply apply a concatenation. To make a comparison between these two ap ? proaches， we modify our PERT model and the HAST-II model.

Firstly， we refer to HAST-II and construct the PERT_lstm model by using a 2-layer LSTM to follow our PERT encoded packets. Then， we remove the LSTM layer from HAST-II andfurther generate theHAST_conbyconcatenating theHASTencoded packets to fit an ordinary softmax classifier， just as our original PERT model does. For all the compared methods， we consistently select 5 packets for classification based on our former discussion.

We perform validation every training epoch for each classi ? ficationexperimentandrecordcorrespondingF1-scoresfor evaluation.AsillustratedinFig.6，wecannotactuallytell which merging approach is better for classification accuracy. Whether using theconcatenationor theLSTMapproach for merging， it does not have a major influence on the final classi ? fication results.

However， using different merging approaches will have an obvious impact on the converging speed of the classification training. In Fig. 6， it always takes less training rounds before themodelconvergeswhileintroducingtheconcatenation merging. We believe the LSTM is not a satisfactory option for merging the encoded packets as applying a simple concatena? tion can reach a very close classification result， yet it is much faster.

5 Conclusions

After a thorough analysis of the possibility of applying the full-NLP scheme for encrypted traffic classification， we point out that the byte data of raw traffic packets can be transformed to character strings by proper tokenization. Based on this， we propose a new method named PERT to encode the encrypted traffic data and to serve as an automatic traffic feature extrac ? tor. In addition， we discuss the pre-training strategy of dynam ? ic word embedding in a condition of the flow-level encrypted trafficclassification.Inaccordancewithaseriesof experi ? ments on the public ISCX data set and Android HTTPS traffic， our proposed classification framework can provide significant? ly better results than current DL-based methods and tradition? al ML-based methods.

References

[ 1] VELAN P， CERMAK M， CELEDA P， et al. A survey of methods for encrypted traffic classification and analysis [J]. International journal of network manage ? ment， 2015， 25（5）： 355 –374. DOI： 10. 1002/nem. 1901

[2] REZAEI S， LIU X. Deep learning for encrypted traffic classification： an over? view [J]. IEEE communications magazine， 2019， 57（5）： 76 – 81. DOI： 10. 1109/MCOM.2019. 1800819

[3] DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirec? tional transformers for language understanding [C]//Proceedings of 2019 Confer? ence of the North American Chapter of the Association for Computational Lin ? guistics：HumanLanguageTechnologies.Minneapolis，USA：AssociationforComputational Linguistics， 2019： 4171 –4186. DOI： 10. 18653/v1/N19- 1423

[4] JAVAID A， NIYAZ Q， SUN W Q， et al. A deep learning approach for network intrusion detection system [C]//Proceedings of the 9th EAI International Confer? ence on Bio-Inspired Information and Communications Technologies. Brussels， Belgium： ICST， 2016： 21 –26. DOI： 10.4108/eai.3- 12-2015.2262516

[5] HOCHST J， BAUMGARTNER L， HOLLICK M， et al. Unsupervised traffic flow classification using a neural autoencoder [C]//42nd Conference on Local Com? puterNetworks（LCN）.Singapore，Singapore：IEEE，2017：523– 526.DOI：10. 1109/LCN.2017.57

[6] REZAEI S， LIU X. How to achieve high classification accuracy with just a few labels： a semi-supervised approach using sampled packets [EB/OL]. （2020-05- 16）[2020-06-01]. https：//arxiv.org/abs/1812.09761v2

[7] WANG W， ZHU M， WANG J J， et al. End-to-end encrypted traffic classification with one-dimensional convolution neural networks [C]//IEEE International Con? ferenceonIntelligenceandSecurityInformatics（ISI）.Beijing，China：IEEE， 2017： 43 –48. DOI： 10. 1109/ISI.2017.8004872

[8] LOTFOLLAHI M， SIAVOSHANI M J， ZADE R S H， et al. Deep packet： a novel approach for encrypted traffic classification using deep learning [J]. Soft comput? ing， 2020， 24： 1999 –2012. DOI： 10. 1007/s00500-019-04030-2

[9] LOPEZ-MARTIN M， CARRO B， SANCHEZ-ESGUEVILLAS A， et al. Network traf? fic classifier with convolutional and recurrent neural networks for internet of things [J]. IEEE access， 2017， 5： 18042 – 18050. DOI： 10.1109/ACCESS.2017.2747560

[ 10] WANG W， SHENG Y Q， WANG J L， et al. HAST-IDS： learning hierarchical spatial-temporal features using deep neural networks to improve intrusion de ? tection[J].IEEEaccess，2017，6：1792–1806.DOI：10. 1109/AC?CESS.2017.2780250

[ 11] MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word rep ? resentations in vector space [C]//International Conference on Learning Repre? sentation. Scottsdale， USA： ICLR， 2013

[ 12] PETERS M E， NEUMANN M， IYYER M， et al. Deep contextualized word represen ? tations [EB/OL]. （2018-03-22）[2020-06-01]. https：//arxiv.org/abs/1802.05365v1#

[ 13] VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need [EB/ OL]. （2018-03-22）[2020-06-01]. https：//arxiv.org/abs/1706.03762

[ 14] BENGIO Y， DUCHARME R， VINCENT P， et al. A neural probabilistic lan? guage model [J]. The journal of machine learning research， 2000， 3：1137 –1155

[15] LAN Z Z， CHEN M D， GOODMAN S， et al. ALBERT： a lite BERT for self-su? pervised learning of language representations [EB/OL]. （2020-02-09）[2020-06- 01]. https：//arxiv.org/abs/1909.11942v3

[ 16] DRAPER-GIL G， LASHKARI A H， MAMUN M S I， et al. Characterization of encrypted and VPN traffic using time-related features[C]//2nd International Conference on Information Systems Security and Privacy （ICISSP）. Rome， Ita? ly： INSTICC， 2016

Biographies

HE Hongye（he. hongye@zte. com. cn） received his M. S. degree fromCentral South University， China in 2018. He is currently an algorithm engineer working with ZTE Corporation. His research interests include artificial intelligence and network traffic identification.

YANGZhiguoreceivedhisM. S.degreefromHunanUniversity，Chinain 2015. He is a senior software engineer at ZTE Corporation. His current research interests include Internet traffic identification and network security.

CHENXiangningreceivedhisbachelorsdegreeincommunicationengi? neering from Hunan University， China in 2004. He is a software engineer at ZTE Corporation. His research interests include big data technology and AI applications.