Paper Title
An Enhancing XML Big Data Mining Approach on Spark System
Abstract
With the development of cloud computing, intelligent mobile applications, and IoT, XML-type data has changed
into large-volume data sets since XML emerged as a popular standard for data exchange among them. XML is a kind of
semi-structured data and can be modeled as a tree. As the concept of data sharing becomes popular, the XML features such as
the parent-child or ancestor-descendant relationships are widely used to share information in XML big data. Through the
parent-child and ancestor-descendant relationships, XML big data exhibits big and massive tree structures, which makes the
behaviors on XML big data mining more unconstrained. Users can query data in the tree-structured XML big data through
multiple access paths. However, this situation makes more difficult to mine frequent patterns in them. Therefore, how to
enhance the performance to find out the frequent patterns among tree-structured XML big data has become an important issue.
Several XML pattern mining researches have been proposed focus on enhancing the XML mining performance. However,
these researches model XML data as a tree and thus cannot improve the mining performance of big XML data. Also, these
researches do not consider the concept of inclusion exclusion principle in combinatorial mathematics to reduce the mining time
and I/O costs of generating candidate XML patterns. Thus, the mining performance of tree-structured XML big data cannot to
be enhanced effectively. In addition, the existing researches do not consider their algorithms to mine XML big data on the
framework of cloud computing and thus damage the system performance. As a result, our research will propose a new
approach to mine effective XML frequent patterns on Spark system. Based on Spark’s system, the higher mining and query
performance can be achieved for XML big data.
Index Terms - Cloud computing, XML frequent patterns, Spark, Hadoop, XML mining.