Dr. O. Aly
Computer Science
Introduction
The purpose of this discussion is to discuss how XML is used to represent Big Data and in various forms. The discussion begins with some basic information about XML, followed by three pillars of XML, XML elements and Attributes, and document-centric vs. the data-centric view of XML. Big Data and XML representation are also addressed in this discussion, followed by the XML processing efficiency and Hadoop technology.
What is XML
XML stands for eXtensible Markup Language, which can be utilized to describe data in a meaningful way (Fawcett, Ayers, & Quin, 2012). It has gained a good reputation due to its ability for interoperability among various applications and passing data between different components (Fawcett et al., 2012). It has been used to describe documents and data in a text-based standardized format what can be transferred via Internet standard protocol (Benz & Durant, 2004; Fawcett et al., 2012). Various standardized formats for XML are available, known as “schemas,” which represent various types of data such as medical records, financial transactions, and GPS (Fawcett et al., 2012).
XML has no tags of its own like HMTL (Howard, 2010). However, XML allows users to write the XML creating their own tags as needed provided that these tags follow the rules of the XML specifications (Howard, 2010). These rules include root element, closing tags, properly nested elements, case matters, quotation-marked values (Howard, 2010). Figure 1 summarizes these rules of the XML. Document type definition (DTD) or schema enforces these rules.

Figure 1. A Summary of the XML Rules.
Like HTML, XML is based on Standard Generalized Markup Language (SGML) (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar, Lacroix, Bressan, Lee, & Li, 2002). SGML was developed in 1974 as part of the IBM document-sharing project and was officially standardized by the International Organization for Standardization (ISO) in 1986 (Benz & Durant, 2004). Although SGML was developed to define various types of markup, it was found complicated, and hence few applications could read SGML (Fawcett et al., 2012; Nambiar et al., 2002). Hyper Text Markup Language (HTML) was the first adoption of the SGML (Benz & Durant, 2004). HTML was explicitly designed to describe documents for display in a Web browser. However, with the explosion of the Web and the need for more than just displaying data in a Web browser, the developers struggled with the effectiveness of the HTML and strived to find a method which could describe data more effectively than HTML on the web (Benz & Durant, 2004).
In 1998, the World Wide Web Consortium (W3C) has combined the basic characteristics which separate data from the format in SGML, with the extension of the HTML tag formats which were used for the Web and developed the first XML Recommendation (Benz & Durant, 2004). (Myer, 2005) have described HTML as a presentation language, while XML as a data-description language. In brief, XML was designed to overcome the limitation of both SGML and HTML (Benz & Durant, 2004; Fawcett et al., 2012; Nambiar et al., 2002).
Three Pillars of XML
(Benz & Durant, 2004) have identified three pillars for XML: extensibility, structure, and validity (Figure 2). The extensibility pillar reflects the ability of the XML to describe structured data as text, and the format is open to extension, meaning that any data which can be described as a text and can be nested in XML tags can be generated as an XML file. The structure is the second pillar of the XML as it is described to be complicated for a human to follow, however, the file is designed to be read by the application. XML parsers and other types of tools which can read XML are designed to read XML format easily. The data representations using XML are much larger than their original format (Benz & Durant, 2004; Fawcett et al., 2012). The validity pillar of XML where the data of the XML file can be optionally validated for structure and content, based on two data validation standards: data type definition (DTD), and XML schema standard (Benz & Durant, 2004).

Figure 2. Three Pillar of XML.
XML Elements and Attributes
XML is developed to describe data and documents more effectively than using HTML, the W3C XML Recommendation provided strict instruction on the format requirements that will distinguish a text file that has various tags and the XML file which has distinguished tags (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010).
There are two main features of the XML file, known as “elements,” and “attributes” (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010). In the example below, the “applicationUsers” and “user” reflect the elements feature of the XML. The “firstName” and “lastName” reflect the attributes feature of the XML. Text can be placed between the opening and closing tags of an element to represent the actual data associated with the elements and attributes surrounding the text (Benz & Durant, 2004; Fawcett et al., 2012; Howard, 2010). Figure 3 shows the elements and attributes in a simple way. Figure 4 illustrates a simple XML document showing the first line, which determines what version of the W3C XML recommendation that the document should adhere to, in addition to XML rules such as root element.

Figure 3. XML Elements and Features.

Figure 4. Simple XML Document Format Adapted from (Benz & Durant, 2004).
Document-Centric vs. Data-Centric View of XML
XML documents use DTD to derive their structures. Thus, XML documents can take any structure, while relational and object-relational data models have a fixed pre-defined structure (Bourret, 2010; Nambiar et al., 2002). Thus, structured data using XML is a combination with DTDs or XML schema specification is described as data-centric format or characteristic of XML (Bourret, 2010; Nambiar et al., 2002). This type of data-centric view of XML is highly structured similarly to the relational database where the order of sibling elements is not essential in such documents. Various query languages were developed for the data-centric format of XML such as XML-QL, LOREL and XQL which require data to be fully structured (Nambiar et al., 2002). The database is said to be XML-enabled, or third-party software such as middleware, data integration software, or a Web application server (Bourret, 2010).
However, the document-centric format of XML is highly unstructured (Nambiar et al., 2002). The data in the document-centric format of XML can be stored and retrieved using a native-XML database or document management system (Bourret, 2010). Furthermore, the implicit and explicit order of the elements matters in such XML documents (Nambiar et al., 2002). The implicit order is represented by order of the elements within a file in a tree-like representation, while the explicit order is represented by an attribute or a tag in the document (Nambiar et al., 2002). The explicit order can be expressed in a relational database, whereas the capture of the implicit order while converting the document-centric XML document into the relational database was a challenge (Nambiar et al., 2002). In addition to the implicit order challenge, XML documents differ from a relational representation by allowing deep nesting and hyper-linked components (Nambiar et al., 2002). The transformation of implicit order, nesting, and hyperlinks into tables can be a solution. However, such a transformation is costly regarding time and space (Nambiar et al., 2002). Thus, the XML processing efficiency was a challenge.
Big Data and XML Representation
Big Data was first defined using the well-known 3V features reflecting the volume, velocity, and variety (Wang, Kung, & Byrd, 2018; Ylijoki & Porras, 2016). The volume feature reflects the magnitude and the size of the data from terabytes to exabytes. The velocity reflects the speed of the data growth and the speed of the processing of the data from batch to real-time and streaming. The variety feature of Big Data reflects the various types of the data from text to the graph to include structured data as well as unstructured and semi-structured (Wang et al., 2018; Ylijoki & Porras, 2016).
Big Data development has gone through an evolutionary phase as well as the revolutionary phase. The evolutionary phase of the Big Data development has gone through the period of 2001 and 2008 (Wang et al., 2018). During that evolutionary period, it became possible for sophisticated software to meet the needs and requirements of dealing with the explosive growth of the data (Wang et al., 2018). Analytics modules were added using software and application developments like XML web services, database management systems, and Hadoop, in addition to the functions which were added to core modules which focused on enhancing usability for end users (Wang et al., 2018). These software application developments like XML web services, database management systems and Hadoop enabled users to process a large amount of data within and across organizations collaboratively as well as in real-time (Wang et al., 2018). During the 2000s, XML became the standard formatting language for semi-structured data, mostly for an online purpose, which led to the development of XML database, which was regarded as a new generation of the database (Verheij, 2013).
Healthcare organization, at the same time, began to digitalize the medical records and aggregate clinical data in the substantial electronic database (Wang et al., 2018). Such development of software and application like XML web services, database management systems, and Hadoop made the significant volume of the healthcare data storable, usable, searchable, and actionable, and assisted the healthcare providers to practice medicine more effectively (Wang et al., 2018).
Starting from 2009, Big Data Analytics entered an advanced phase which is the revolutionary phase, where the computing of the big data became a breakthrough innovation for Business Intelligence (Wang et al., 2018). Besides, the data management and its techniques were predicted to shift from structured data into unstructured data, and from a static environment to ubiquitous cloud-based environment (Wang et al., 2018). The data for healthcare industry continued to grow, and as of 2011, the stored data for healthcare reached 150 exabytes (1 EB = 118 bytes) worldwide, mainly in the form of electronic health records (Wang et al., 2018). Other industries of Big Data Analytics pioneers include banks and e-commerce started to experience the impact on the business process improvement, workforce effectiveness, cost reduction, and new customer attraction (Wang et al., 2018).
The data management approaches for Big Data include various types of databases such as columnar, document stores, key-value/tuple stores, graph, multimodal, object, grid and cloud database solutions, XML-databases, multi-dimensional, and multi-value (Williams, 2016). Big Data analytics systems are distinguished from the traditional data management systems as they have the capabilities to analyze semi-structured or unstructured data which lack in the traditional data management systems (Williams, 2016). XML as a textual language for exchanging data on the Web is regarded as a typical example of semi-structured data (Benz & Durant, 2004; Gandomi & Haider, 2015; Nambiar et al., 2002).
In various industries such as healthcare, the semi-structured and unstructured data refer to information which cannot be stored in a traditional database and cannot fit into predefined data models. Example of this semi-structured and unstructured healthcare database include XML-based electronic healthcare records, clinical images, medical transcripts, and results of the labs. As an example of the case study, (Luo, Wu, Gopukumar, & Zhao, 2016) have referenced a case study where a hybrid XML database and Hadoop/HBase infrastructure were used to design the Clinical Data Managing and Analyzing System.
XML Processing Efficiency and Hadoop Technology
Organizations can derive value from XML documents which reflect semi-structured data (Aravind & Agrawal, 2014). To derive value from these semi-structured XML documents, the XML data needs to be ingested into Hadoop for the analytic purpose (Aravind & Agrawal, 2014). However, Hadoop technology does not offer a standard XML “RecordReader,” while XML is one of the standard file formats for the MapReduce (Lublinsky, Smith, & Yakubovich, 2013).
There is an increasing demand for efficient processing of large volume of data stored in XML using Apache Hadoop Map/Reduce (Vasilenko & Kurapati, 2014). Various approaches have been used for XML processing efficiency. The ETL process for extracting data is one approach (Vasilenko & Kurapati, 2014). The transformations of XML into other formats that are natively supported by Hive is another technique for efficient processing of XML (Vasilenko & Kurapati, 2014). Another approach is to use Apache Hive XPath UDFs. However, these functions can only be used in Hive views, and SELECT statements and not in the table CREATE DDL (Vasilenko & Kurapati, 2014). (Subhashini & Arya, 2012) have provided few attempts from various researchers such as the use of a generic XML-based Web information extraction solution using two key technologies. XML-based Web data conversion technology, and the XSLT (Extensible Stylesheet Language Transformation). The XML-based Web data conversion technology is used to convert the HTML into XHTML document using XML rules and develop XMLDOM tree, and DOM-based XPath algorithm is generating XPath expression for the desired information nodes when the information points are marked by the users. The XSLT is used to extract the required information from the XHTML document, and the result of the extraction are expressed in XML (Subhashini & Arya, 2012). XSLT is regarded to be one of the most important of the XML technologies to consider in solving information processing issues (Holman, 2017). Other attempts included the use of a wrapper based on XBRL (eXtensible Business Reporting Language)-GL taxonomy to extract financial data from the web (Subhashini & Arya, 2012). Those are a few attempts to solve the processing issues outlined in (Subhashini & Arya, 2012).
References
Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.
Benz, B., & Durant, J. R. (2004). XML programming bible (Vol. 129): John Wiley & Sons.
Bourret, R. (2010). XML Database Products. Retrieved from http://www.rpbourret.com/xml/XMLDatabaseProds.htm.
Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.
Holman, G. K. (2017). What is XSLT? Retreived from https://www.xml.com/articles/2017/01/01/what-is-xslt/.
Howard, G. K. (2010). Xml: Visual Quickstart Guide, 2/E: Pearson Education India.
Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.
Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Myer, T. (2005). A Really, Really, Really Good Introduction to XML. Retreived from https://www.sitepoint.com/really-good-introduction-xml/.
Nambiar, U., Lacroix, Z., Bressan, S., Lee, M. L., & Li, Y. G. (2002). Efficient XML data management: an analysis. Paper presented at the International Conference on Electronic Commerce and Web Technologies.
Subhashini, C., & Arya, A. (2012). A Framework For Extracting Information From Web Using VTD-XML’s XPath. International Journal on Computer Science and Engineering, 4(3), 463.
Vasilenko, D., & Kurapati, M. (2014). Efficient processing of xml documents in hadoop map reduce.
Verheij, B. (2013). The process of big data solution adoption. TU Delft, Delft University of Technology.
Wang, Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change, 126, 3-13.
Williams, S. (2016). Business intelligence strategy and big data analytics: a general management perspective: Morgan Kaufmann.
Ylijoki, O., & Porras, J. (2016). Perspectives to definition of big data: a mapping study and discussion. Journal of Innovation Management, 4(1), 69-91.