MBGM:AGraph⁃Mining ToolBased on sed on MapReduceand BSP nd BSP

时间：2024-05-19

Zhenjiang Dong,Lixia Liu,Bin W u,and Yang Liu(.ZTE Corporation，Nanjing 00，China;.Beijing University of Posts and Telecommunications，Beijing 00876，China)

Zhenjiang Dong1,Lixia Liu1,Bin W u2,and Yang Liu2
(1.ZTE Corporation，Nanjing 210012，China;
2.Beijing University of Posts and Telecommunications，Beijing 100876，China)

This paper proposes an analyticalmining tool for big graph data based on MapReduce and bulk synchronous parallel(BSP)com⁃putingmodel.The tool is named Mapreduce and BSP based Graph⁃mining tool(MBGM).The core of thismining system are four sets of parallel graph⁃mining algorithms programmed in the BSP parallelmodel and one set of data extraction⁃transformation⁃load⁃ing(ETL)algorithms implemented in MapReduce.To invoke these algorithm sets，we designed a workflow engine which optimized for cloud computing.Finally，a well⁃designed datamanagement function enables users to view，delete and input data in the Ha⁃doop distributed file system(HDFS).Experiments on artificial data show that the components of graph⁃mining algorithm in MBGMare efficient.

cloud computing;parallel algorithms;graph data analysis;datamining;social network analysis

1 Introduction

R apid development in fields such as the Internet of Things，cloud computing，social networking，mo⁃bile communication and socialmedia，has ushered in the era of big data.Now，mostbig data exists as a graph or can be expressed using graph structure，which is one of themostwidely used abstractdata structures in comput⁃er science.Graph structure enablesmore complex，comprehen⁃sive presentation of data than link tables or tree structures. Graph⁃mining theories and techniques are improving all the time;however，the amount of information is growing exponen⁃tially，and the scale of graph⁃based data is increasing signifi⁃cantly.The InternetData Center in the United Stateshaspoint⁃ed out the amountof data on the Internet isexpanding by 50% each year，and more than 90%of dataworldwidewas generat⁃ed in the past two years[1].Millions of smartmobile devices，both enterprise and personal，are produce data in everywhere and at any time.YouTube users upload about 70，000 hours of video every day.According to statistics from the China Inter⁃netNetwork Information Center(CNNIC)，at the end ofDecem⁃ber 2013，the number ofwebpages in Chinawas 150 billion，a 22.2%increase over lastyear.At the same time，therewere ap⁃ proximately 308millionmicroblog users，approximately 54.7% of all Internet users[2].The amount of big graph data being generated ishuge，and itwillbe challenging to efficiently ana⁃lyze all this data.We therefore propose a system combining MapReduce and a BSP⁃based graph⁃mining tool(MBGM).This system has a series of parallel graph⁃mining algorithms based on the bulk synchronous parallel(BSP)method.It also has a setof data extract⁃transformation⁃loading algorithms based on MapReduce.Distributed file systemsare also used to store and manage the graph data，and a workflow engine is used to in⁃voke the algorithms.

The remainder of this paper is structured as follows.In sec⁃tion 2，we review related works.In section 3，we describe the MBGMsystem architecture.In section 4，we discuss the imple⁃mentation of several typicalgraph⁃analysis algorithmsand dis⁃cuss an application.In section 5，we discuss the testing and performance of the proposed system.In section 6，we discuss future directions.

2 Related W ork

MBGMis closely related to parallel⁃computing platforms and graph⁃mining tools.

Message⁃passing interface(MPI)is amodel for passing on messages.Many companies and universities have implemented jobs that can be run on almost any type of parallel computer.Such computers supportallexisting graph algorithms[3].How⁃ever，because the MPImodel uses a communicationmethod to integrate computing resources，this model has several draw⁃backs.For example，the inefficiency of parallel computing and high consumption ofmemorymakes itdifficult to properlyman⁃age the resourcesand communication.

MapReducewas created by Google[4].The best known and most successful open⁃source implementation ofMapReduce is Hadoop.An application based on the MapReduce framework can run on large⁃scale clusters in a parallel，fault⁃tolerantway. A MapReduce program has two main steps:map and reduce. Each phase comprises an input，computation，and output step: the outputof each phase is the input for the next phase.When a phase is finished，every machine writes its output data to shared memory，and the data is synchronized.Therefore，each machine is allowed to read the data written in the previous phase.The MapReduce framework relies on the pair to transfer data between the steps.A user can implement their own map and reduce function to finish the computing task.The input and output files are stored in a distributed file system.MapReduce is suitable for processing large⁃scale data and executing algorithms thatdo notneedmuch iteration.How⁃ever，itperforms badlywith algorithms thatare highly iterative. Thismeans that it isnotsuitable formostgraph algorithms.

Bulk synchronous parallel(BSP)[5]is also a widely used parallel computing framework.Itovercomes some of theweak⁃nesses of MapReduce and performs wellwhen a program has much iteration or requires a lot of communication.A BSP pro⁃gram can be divided into several super⁃steps，each of which consists of three stages:local computation，communication，and barrier synchronization.A BSP system has a number of computers with localmemory and disks.Each computer can run several computing processes called peers.In the local com⁃putation stage，each peer is computed using locally stored da⁃ta.After finishing local computation，each peer communicates only necessary data to other peers.When a peer finishes the communication stage，itwaits untilall the peers reach the bar⁃riersynchronization，and a super⁃step is completed.

Spark[6]is newly proposed parallel computing framework. The basis of Spark is the resilient distributed dataset(RDD)，which is a read⁃only collection of objects partitioned across a setofmachines thatcan be rebuilt ifa partition is lost.Theele⁃ments of an RDD do not need to exist in physical storage.A handle to an RDD contains enough information to compute the RDD starting from data in reliable storage.Another feature of Spark is the abstract of parallel operation.Instead ofmap⁃re⁃duce in theMapReducemodeland super⁃step in the BSPmod⁃el，the computing elements in Spark are reduce，collect and foreach.Spark performswellwith iterative jobs and interactive analytics.

Apache Mahout[7]supports supply classification，cluster⁃ing，pattern mining，regression，and dimension reduction，and machine⁃learning algorithms.However，it lacks a graph⁃min⁃ ing function.GraphLab[8]improves on MapReduce abstrac⁃tion by compactly expressing asynchronous iterative algo⁃rithms with sparse computational dependencies.However，the asynchronousmechanism may cause non⁃convergence problem while implementsynchronous iterative graph algorithm.PEGA⁃SUS[9]isa largeopen⁃sourcegraph⁃mining system implement⁃ed on Hadoop.The key idea of PEGASUS is to convertgraph⁃mining operations into iterativematrix⁃vectormultiplications. PEGASUS supports large⁃scale graph data;however，in prac⁃tice，not all the graph⁃mining algorithms can be modeled by matrix⁃vectormultiplications.Dryad[10]is a general parallel computing platform proposed by MicrosoftResearch.This plat⁃form abstracts the computing and communication in data⁃min⁃ing operations into vertexes and edges in order to form a data⁃flow graph.The platform executes the vertexes on work nodes and refines the dataflow graph to optimize the running process. Big Cloud parallel datamining(BC⁃PDM)[11]，developed by China Mobile Research Institute(CMRI)，provides visualiza⁃tion for datamining and analysis of graph data.However，it is based on Hadoop，so the graph⁃mining algorithms do not per⁃form well.Pregel[12]was premised on BSP and implemented by Google.It is a complete solution for large⁃scale graph com⁃puting but has not been published in the public domain.BC⁃BSP[13]is another implementation of the BSP parallel plat⁃form.Although most BSP platforms use memory to exchange the temporary data，BC⁃BSPa data⁃spillmechanism(including static data and dynamic data)on the local disk.This improves data processing for a small cluster，but the management and updating of this data⁃spillmechanism requires extra communi⁃cation and system resources and creates new defects in the platform.GraphX[14]，a resilient distribute graph system based on Spark，was proposed by UCBerkeley.The system has a Spark computing engine and uses the vertex⁃cutmethod in⁃stead of the edge⁃cutmethod for data partitioning.

3 System Architecture

This system focuses on big⁃graph management and graph⁃mining.We noticed that，while the original data is huge，but most graph⁃mining application using part of the original data. For example，the user phone call data provided bymobile com⁃munication companies isGB sized，and containsmuch informa⁃tion about the detail of caller and answerer，butmost graph analysisdoesnotcalculate all the information.We useMapRe⁃duce to extract the graph data from original data and construct the graph.To manage graph data and original data，we de⁃signed a data I/Omanagement component in the parallel plat⁃form layer and a data⁃management component in the logical layer.The algorithm layer is divided into two parts:extraction⁃transformation⁃loading(ETL)algorithm set and graph⁃mining algorithm set.We built the graph⁃mining algorithm set to im⁃plementgraph ranking algorithms，graph clustering algorithms，and graph attribute analysis algorithms and graph partition al⁃gorithms as the foundation of the graph analysis.Finally，we made this system extendable to enable the addition ofother al⁃gorithm components.An overview of the architecture ofMBGMis shown in Fig.1.The system consists of four layers，the func⁃ tion ofwhich are described here.

▲Figure1.ArchitectureofMBGM.

3.1 Parallel Platform Layer

The parallel⁃platform Layer comprises distributed graph file system，YARN，parallel computing engine，and graph I/OMan⁃agement component.We used Hadoop Distributed File System (HDFS)to construct the Distributed Graph File System，en⁃abling the storage of big graph data.YARN，a framework for cluster resource management and job scheduling，comprises an application master(AM)，which should integrate multiple computing frameworks，e.g.，MR，BSP.Because the BSPmodel performs well with graph⁃mining algorithms，we chose Hama BSP[15]as the parallel computing engine and used it to han⁃dlemessage communication，data distribution，and fault toler⁃ance.We used MapReduce as the graph data pre⁃processing engine to extract graph information from original data.The graph I/Omanagement component transfers data from the data⁃base into the graph data form that the MBGMcan handle and then exports theMBGMdata.

3.2 Algorithm Layer

The algorithm layer is themain layer of MBGM.This layer can be roughly divided into two parts:ETL algorithm and graph⁃mining algorithm.In the graph⁃mining algorithm part，we im⁃plemented four sets of 20 graph⁃mining algorithm components in the BSP parallelmodel and four group of data ETL algo⁃rithm for transform original into graph data.The ETL algorithm set comprises data⁃cleaning set for detecting and removing er⁃ ror value，data transform set for transform value into the format user needed，data extractsetand data update set.Those graph⁃mining algorithm components can be divided into four sets: graph ranking set comprises PageRank[16]，Hyperlink⁃In⁃duced Topic Search(HITS)[17]，and Random Walk with Re⁃start(RWR)[18]algorithms components，the graph clustering setcomprisesGauss-Newton

(GN)[19]，Clauset，Newman，and Moore(CNM)[20]，Clique Percolation method(CPM)[21]，and Label Propagation Algo⁃rithm(LPA)[22]algorithm components，and the K⁃meansalgo⁃rithm isused for the processingofgeneraldata.Thegraph attri⁃bute analysis set contains the graph diameter，closeness cen⁃trality，clustering coefficient，network density，betweenness centrality，and fiveotheralgorithm components.The graph par⁃tition set contains components thatare based on the Multilevel Subgraph Patrolling(MSP)[23]algorithm component and the Metis[24]algorithm component.These algorithm components can be run either in the consoleor can be invoked in a user⁃de⁃fined workflow from theuser interface.

3.3 Logical Layer

The logical layer is based on Open Service Gateway Initia⁃tive(OSGI)and is used to implement a stable，efficient，scal⁃able system[25].This is service platform and manager all kinds of services thatare supported by this layer，such as User Management Service，Data(HDFS)Management Service，Al⁃gorithm Service(Each algorithm is a kind of service)，and the workflow engine Service.

Fig.2 shows the operation of the logical layer.Getting the start command from tomcat servletwhich is a container in the interface layer，OSGi container will execute a train of opera⁃tions，starting the lifecycle，activating and registering the each service that hosted in bundle⁃servicesmanagement.With the tomcat servlet invoking one of services，Service Register que⁃ries and gets service that is requested by the tomcat from the services pool，and then execution environment(EE)calls theworkflow engine service to perform the requested service which isultimately implemented in the algorithm layer.

▲Figure2.Operation flow of the logical layer.

3.4 The Interface Layer

The interface layer is built in HTML，and flex provides an interactive interface with which the user can login to the MB⁃GM，management information，and most importantly，use the graph algorithms’mining graph data.

11.return p.pr; 12.}

4 Im p lementation of ParallelGraph Analysis Algorithm

Parallel graph⁃mining algorithms based on the BSPmodel are themost importantaspectsof the BPGM.In the BSP paral⁃lelmodel，graph⁃mining algorithms are“think as vertex.”Pro⁃grammers use the perspective of the vertex tomanage the gath⁃ering of information from other vertexes and to send informa⁃tion toothervertexes.

4.1 Im plementation of PageRank in BSP

PageRank is one of the most important and famous rank algorithm in graph analysis algo⁃rithm.Thebasic idea of PageRank is the impor⁃tance of a vertex is depending on the quantity and quality of its neighbor vertex.The formal of PageRank computation is:

The N is vertex quantity of the graph，M(pj) is the indegree set of vertex pj，and L() pjis the outdegree of vertex pj，d is an empirical value，in this i⁃mplementation we set d=0.85.The pseudo⁃code of PageR⁃ank in BSPmethod described in Algorithm 1.

In the Algorithm 1，p.pr is the PageRank value of vertex p，and k is the iteration limited.While compute PageRank value in a smallgraph data，it is necessary to seta threshold of Pag⁃eRank value change between super⁃steps，if a vertex reached the threshold itvote to haltand when all vertex vote to halt，the procession is finished.But in the big graph data，it is hard to determine the threshold，sowe simply limited themaximal su⁃perstep number，usually set k=30.The application scenario of PageRank is showed in section 5，and performance over vari⁃ousdata sets is listed in Table1.

4.2 Im p lementation of LPA in BSP

Community⁃detection and analysis isan importantmethodol⁃ogy for understanding the organization of various real⁃world networks and has applications in problems as diverse as con⁃sensus formation in social communities or identification of functionalmodules in biochemical networks.Most community⁃detection algorithms in large⁃scale real⁃world networks require a priori information，such as the number and sizes of communi⁃ties，orare computationally expensive.

▼Table1.Runtimeofsomegraph⁃m iningalgorithm components(s)

Algorithm 1.Implementation of PageRank in BSP Input：G(V，E)G is thegraph data，V is the vertex set，E is the edge set; Output:PageRank value ofeach vertex PageRank: 1.foreach p∈V{ 2.p.pr=1N; 3.for superstep from 1 tok{ 4.foreach p∈V 5.sum=0.0; 6.foreach q∈M(p){ 7.sum+= 8.p.pr= 9.} 10.} 1-d N+d·sum; q.pr L(q);

The label propagation algorithm(LPA)is a community⁃de⁃tection algorithm based on label propagation.Suppose a vertex v has neighbors v1，v2，v3，...，vkand that each neighbor vertex carries a label denoting the community to which they belong. Then v determines its community based on the label of its neighbors，assuming thateach vertex in the network chooses to join the community towhich themaximum numberof itsneigh⁃bors belong.The pseudo⁃code of LPA in BSPmethod is de⁃scribed in Algorithm 2.

Algorithm 2.Implementation of LPA in BSP Input:G(V，E)G is thegraph data，V is the vertex set，E is theedge set; Output:Vertex setwith community label LPA: 1.foreach p∈V{

2.v.setLable(RandomLabel); 3.} 4.while(Iteration

In Algorithm 2，line 2 initial ever vertex in graph with ran⁃dom labelwhich indicate the community it belong to.Line 7，sendLabel operator，is based on the operation of“sendMessag⁃eToNeighbors”provided by Hama and sends amessage of ver⁃tex label to its neighbors，and in line 11，each vertex received messages from neighbors，and update its labelas themostcom⁃munity label it received in line 12.If a vertex received an equalnumber of label types，itshould random ly pick one as its new label.The Max_Iteration is the threshold of the superstep number，and we set it as 60.The performance of the LPA on various scale of graph data and different scale of cluster is shown in Fig.3.

4.3 Imp lementation of CPMin BSP

A real⁃life graph tends to have a very complicated structure，and a vertex always not only belongs to one community，most traditional community detection algorithm cannot catch this feature.In response to this challenge，researchers have pro⁃posed the concept of overlapping community(CPM).The clique percolation method is one of the most important algo⁃rithms in this field.

▲Figure3.LPA performanceofMBGMon Cluster A and Cluster B.

The CPMdefines a k⁃clique⁃community as a union of all k⁃cliques(complete subgraphs of size k)that can be reached from each other through a series of adjacent k⁃cliques(where adja⁃cencymeans sharing k-1 vertex).This definition is aimed at representing the fact that it is an essential feature ofa commu⁃nity that itsmembers can be reached through well⁃connected subsets of nodes.There are other parts of the whole network that are not reachable from a particular k⁃clique，but they po⁃tentially contain further k⁃clique⁃communities.In turn，a single node can belong to several communities.All these can be ex⁃plored systematically and can result in a large number ofover⁃lapping communities.

The CPMmainly has two steps，locate all cliques and find community structure.The pseudo⁃code of CPMin BSPmethod described in A lgorithm 3.

A lgorithm 3.Implementation of CPMin BSP Input：G(V，E)G is the graph data，V is the vertex set，E is the edge set;k is the target clique size Output:Vertex setwith community label LPA: locate k⁃clique 1.for each p∈V{ 2.A=Ø; 3.B=getAdjacentVertex(p); 4.forea ch v∈B{ 5.A=A∪v; 6.B.removeVertex(v); 7.B.removeVertexConnect(A); 8.if(||A

In algorithm 2，line 2 the function getAdjacentVertex()return the k⁃1 hops adjacent vertex of p because to complete a k⁃clique，only k⁃1 different adjacent vertex information is need⁃ed.This strategy helps graph data spread over work node in clusterand reduce communication.The line 7 function remove⁃VertexConnect()is to remove all the vertex connect to the ver⁃tex in set A，tomake sure no vertex is compute duplicate.To keep the pseudo⁃code brief，we skipped the steps between lo⁃cate k⁃clique and find community structure that is build the clique setC and put the resultcliquesof“locate k⁃clique”into C.In the“find community structure”process，we use a chro⁃matic method to prevent recomputed cliques and use setColor and getColor to implement thismethod.The function isOverLap ()in line 31 used for compute whether two k⁃cliques have k⁃l identicalvertex，whichmeansoverlapped.

5 Performance

We tested MBGMfor functionality，reliability，usability，effi⁃ciency，maintainability，and portability.The evaluation was performed on clusters with 9 nodes，each of which comprises two Intel(R)Xeon(R)CPU E5530 processors，48 GB main memory，and 1024GB hard drive.Theevaluation data isa ran⁃domly generated graph data setscale ranging from 10，000 edg⁃es to 2000，000 edges.We also deployed a BC⁃PDMon the same cluster and ran some social network analysis algorithms using Google web data.Some of the results are shown in Fig. 4.We compared MBGMand BC⁃BSPwith the PageRank algo⁃rithm on a 4⁃node cluster，butwhere the nodes have the same hardware.The results are recorded in Fig.5.Finally，we use two clusters to test the performance of LPA algorithm，cluster A is the 9 nodes cluster described previous，and cluster B uses 4 nodes of cluster A.The characteristics of these graphs’data are shown in Table2.

The results show thatmostgraph⁃mining jobs can be accom⁃plished in a short timeand benefit from well⁃designed architec⁃ture.Also，theMBGMhasa higher performance than BC⁃PDMand BC⁃BSP.Fig.3 shows the LPA has good performance and scalability to handle variousgraph data.

6 Discussion

The main reason that MBGMperformed better than BC⁃PDMand BC⁃BSP is the different parallel computing engine. MBGMuses BSP as the graph computing engine and MapRe⁃duce as pre⁃processing computing engine.However，BC⁃PDMuses MapReduce as the computing engine and BC⁃BSP uses a modified BSPmodelas the computingengine.

Mostgraph computing algorithms require quantity communi⁃cation among vertex andmuch iteration.The MapReducemod⁃el computing engine can compute big data，but needs to write temporary results to disk every Map⁃Reduce phase.This limits the iterative computing ability.BC⁃BSP is based on BSP com⁃ putingmodel，butmodified the mechanism of read and write data tomemory in order to extend the data capability.We be⁃lieve thismechanism also increase the complexity of BC⁃BSP and reduced the performance.

The MBGMtakes advantage of both MapReduce and BSPcomputingmodels.The BSPmodel parallel computing engine ensures the graph algorithm performance and the MapReduce pre⁃processing engine to reduce the original data scale which also improves the data computing capability of MBGM.Also，with the help of operation flow，the MBGMcan run several graph⁃mining tasks in user defined orderwithoutpeoplewatch⁃ing.This savesmanpower and，compared with other command⁃line parallel computing tool，saves the time of operator config⁃ure tasksoneafterpreviousone finished.

▲Figure4.Comparison ofMBGMand BC⁃PDMon data_set_5.

▲Figure5.PageRank performanceofMBGMand BC⁃BSP.

▼Table2.Networksbasic structuralproperties

7 Conclusion and FutureW ork

In this study，we introduced MBGMbased on Cloud comput⁃ing.Ithas the ability to analyze big graph data and achieved a better performance than the Hadoop⁃based datamining tools BC⁃PDMand BSP⁃based parallelplatform BC⁃BSP.We expect⁃ed to mix more parallel computing model to achieve a higher performance ofgraph⁃mining both in data scale and computing speed.

Acknow ledgment

We thank the following individuals who contributed ideas，feedback，and guidance:Wu Bin，Yang Juan and Wang Bai. We are very grateful of assistance and research that Technolo⁃gy Software EngineeringGroup hasprovided.

[1]Z.H.Liu and Q.L.Zhang，“Research overview ofbig data technology，”Journal ofZhejiang University(Engineering Science)，vol.48，no.6，pp.957-972，2014.

[2]CNNIC.(2014 Jan.).Statistical report on internet development in China.[On⁃line].Available:http://cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/201403/ P020140305346585959798.pdf

[3]M.Snir，S.Otto，S.Huss⁃Lederman，D.Walker，and J.Dongarra，MPI:TheCom⁃pleteReference.Cambridge，USA:MITpress，1995.

[4]J.Dean and S.Ghemawat，“MapReduce:Simplified data processing on large clusters，”Communicationsofthe ACM，vol.51，no.1，pp.107-113，Jan.2008.

[5]L.G.Valiant，“A bridgingmodel for parallel computation，”Communications of the ACM，vol.33，no.8，pp.103-111，Aug.1990.doi:10.1145/79173.79181.

[6]M.Zaharia，M.Chowdhury，M.Franklin，S.shenker，and I.Stoica，“Spark:Clus⁃ter computing with working sets，”in Proc.2nd USENIX Conf.on Hot Topics in Cloud Computing，Boston，USA，Jun.2010.

[7]S.Owen，R.Anil，T.Dunning，and E.Friedman，Mahout in Action.Greenwich，USA:Manning Publications，2011.

[8]Y.Low，J.Gonzalez，A.Kyrola，D.Bickson，C.Guestrin，and J.M.Hellerstein，“Graphlab:A new framework for parallelmachine learning，”CMU Tech.Rep.，GraphLab ProjectarXiv:1006.4990，2010.

[9]U.Kang，C.E.Tsourakakis，and C.Faloutsos，“PEGASUS:A peta⁃scale graph⁃mining system implementation and observations，”in Ninth IEEE Int.Conf.on Data Mining，Miami，USA，Dec.2009，pp.229-238.doi:10.1109/ ICDM.2009.14.

[10]M.Isard，M.Budiu，Y.Yu，A.Birrell，and D.Fetterly，“Dryad:Distributed data⁃parallel programs from sequential building blocks，”ACMSIGOPSOperating Systems Review，vol.41，no.3，pp.59-72，Jun.2007.doi:10.1145/ 1272998.1273005.

[11]L.Yu，J.Zheng，W.Shen，B.Wu，B.Wang，L.Qian，and B.Zhang，“BC⁃PDM: Datamining，social network analysis and textmining system based on cloud computing，”presented at18th ACMSIGKDD Int.Conf.on Knowledge Discov⁃ery and DataMining，Beijing，China，2012.

[12]G.Malewicz，M.H.Austern，A.J.C.Bik，J.C.Dehnert，I.Horn，N.Leiser，and G.Czajkowski，“Pregel:A system forlarge⁃scale graph processing，”in Proc. SIGMOD’10，Indianapolis，USA，pp.135-145.

[13]Y.Bao，Z.Wang，Y.Gu，G.Yu，F.Leng，H.Zhang，B.Chen，C.Deng，and L. Guo，“BC⁃BSP:A BSP⁃based parallel iterative processing system for big data on cloud architecture，”in Proc.DASFAAWorkshops 2013，Wuhan，China，pp. 31-45.doi:10.1007/978⁃3⁃642⁃40270⁃8_3.

[14]R.S.Xin，J.E.Gonzalez，M.J.Franklin，and Ion Stoica，“Graphx:A resilient distributed graph system on spark，”in First InternationalWorkshop on Graph Data Management Experiencesand Systems，New York，USA，2013，No.2.doi: 10.1145/2484425.2484427.

[15]S.Seo，E.J.Yoon，J.Kim，S.Jin，J.Kim，and S.Maeng，“Hama:An efficient matrix computation with the MapReduce framework，”in 2010 IEEE Second Int.Conf.on Cloud Computing Technology and Science，Indianapolis，USA，2010，pp.721-726.doi:10.1109/CloudCom.2010.17.

[16]L.Page，S.Brin，R.Motwani，and T.Winograd，“The PageRank citation rank⁃ing:Bringing order to theweb，”Stanford InfoLab，Tech.Rep.SIDL⁃WP⁃1999⁃0120，1999.

[17]H.Tong，C.Faloutsos，and J.Pan.“Fast random walk with restartand its appli⁃cations，”in Proc.6th IEEE Int.Conf.on DataMining,Hong Kong，China，2006，pp.613-622.

[18]J.M.Kleinberg，“Authoritative sources in a hyperlinked environment，”Journal of the ACM(JACM)，vol.46，no.5，pp.604-632，Sept.1999.doi:10.1145/ 324133.324140.

[19]M.Girvan and M.E.J.Newman，“Community structurein social and biological networks，”Proc.Natl Acad Sci USA，vol.99，no.12，pp.7821-7826，Jun. 2002.

[20]A.Clauset，M.E.J.Newman，and C.Moore，“Finding community structure in very large networks，”Physical Review E，vol.70，no.6，2004.doi:10.1103/ PhysRevE.70.066111.

[21]M.E.J.Newman and M.Girvan，“Finding and evaluating community structure in networks，”Physical Review E，vol.69，no.2，2004.doi:10.1103/Phys⁃RevE.69.026113.

[22]U.N.Raghavan，R.Albert，and S.Kumara，“Near linear time algorithm to de⁃tect community structures in largescale networks，”Physical Review E，vol.76，no.3，2007.doi:10.1103/PhysRevE.76.036106.

[23]Z.Zeng，B.Wu，and H.Wang，“A parallel graph partitioning algorithm to speed up the largescale distributed graph⁃mining，”in Proc.1st Int.Workshop on Big Data,Streams and Heterogeneous Source Mining:Algorithms,Systems, Programming Modelsand Applications，Beijing，China，2012，pp.61-68.doi: 10.1145/2351316.2351325.

[24]G.Karypis and V.Kumar，“Metis⁃unstructured graph partitioning and sparse matrix ordering system，version 2.0，”Dept.Computer Science，Univ.ofMinne⁃sota，Tech.Rep.，1995.

[25]J.Yang and D.Zhang，“Lightweightworkflow engine based on Hadoop and OS⁃GI，”presented at5th IEEE Int.Conf.on Broadband Network&Multimedia Technology，Beijing，China，2013.

Biographies phies

Zhenjiang Dong(dong.zhengjiang@zte.com.cn)reveived hisMS degree in telecom⁃munication from Harbin Instituteof Technology in 1996.He is the deputy head of the Service Institute of ZTECorporation，.His research interests include cloud com⁃putingandmobile internet.

Lixia Liu(liu.lixia@zte.com.cn)isa senior engineer in the pre⁃research department of ZTE.She received her MS degree from Ocean University of China in 2008.Her research interests include natural language processing，textmining，data mining，machine learning，mathematical statisticsand cloud computing.

Bin W u(wubin@bupt.edu.cn)received his PhD degree from the InstituteofComput⁃ing Technology，Chinese Academy of Science，Beijing，in 2002.He isa seniormem⁃ber of CCF.He is current a professor at the School of Computer Science，Beijing University of Posts and Telecommunications，China.His research interests include datamining，complex network，and cloud computing.He has published more than 100 papers in refereed journalsand conferences.

Yang Liu(liuyang1984@bupt.edu.cn)received his BS degree in computer science from Henan University of Technology in 2007.He is currently a PhD candidate at the School of Computer Science，Beijing University of Posts and Telecommunica⁃tion，China.His research interests include socialnetwork analysis，datamining，and cloud computing.

2014⁃04⁃04

10.3969/j.issn.1673-5188.2014.04.003

http://www.cnki.net/kcm s/detail/34.1294.TN.20141127.1518.002.htm l,pub lished online November 27,2014

This wo rk is suppo rted by ZTE Industry-Academ ia-Research Cooperaton Funds.