Extract Knowledge from a Large-Scale Dataset

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze one research work which involves knowledge extraction from a large-scale dataset with specific real-world semantics by applying machine learning. The discussion begins with an overview of Linked Open Data, Semantic Web Application, and Machine Learning. The discussion also addresses Knowledge Extraction from Linked Data, its eight-phase lifecycle, and examples of Linked Data-driven applications. The discussion focuses on DBpedia as an example of such applications.

Linked Open Data, Semantic Web Application, and Machine Learning

The term Linked Data, as indicated in (Ngomo, Auer, Lehmann, & Zaveri, 2014), refers to a set of best practices for publishing and interlinking structured data on the Web. The Linked Open Data (LOD) is regarded to be the next generation systems recommended on the web (Sakr & Gaber, 2014). The LOD describes the principles which are designated by Tim Berners-Lee to publish and connect the data on the web (Bizer, Heath, & Berners-Lee, 2011; Sakr & Gaber, 2014). These Linked Data principles include the use of URIs as names for things, HTTP URIs, RDF, SPARQL, and the Link of RDF to other URIs (Bizer et al., 2011; Ngomo et al., 2014; Sakr & Gaber, 2014). The LOD cloud, as indicated in (Sakr & Gaber, 2014), covers more than an estimated fifty billion facts from various domains like geography, media, biology, chemistry, economy, energy and so forth. Semantic web offers cost-effective techniques to publish data in a distributed environment (Sakr & Gaber, 2014). The Semantic Web technologies are used to publish structured data on the web, set links between data from one data source to data within other data sources (Bizer et al., 2011). The LOD implements the Semantic Web concepts where organizations can upload their data on the web for open use of their information (Sakr & Gaber, 2014). The underlying technologies behind the LOD involves the Uniform Resource Identifiers (URI), HyperText Transfer Protocol (HTTP), Resource Description Framework (RDF). The properties of the LOD include any data, anyone can publish on the Web of Data, Data publishers are not constrained in choice of vocabularies to present data, and entities are connected by RDF links (Bizer et al., 2011). The LOD applications include Linked Data Browsers, The Search Engines, and Domain-Specific Applications (Bizer et al., 2011). Examples of the Linked Data Browsers are Tabulator Browser (MIT, USA), Marbles (FU Berlin, DE), OpenLink RDF Browser (OpenLink, UK), Zitgist RDF Browser (Zitgist, USA), Humboldt (HP Labs, UK), Disco Hyperdata Browser (FU Berlin, DE), and Fenfire (DERI, Irland) (Bizer et al., 2011). Examples of the Search Engines include two types, (1) Human-Oriented Search Engines, and (2) Application-Oriented Indexes. The Human-Oriented Search Engines examples are Falcons (IWS, China), and “Sig.ma” (DERI, Ireland). The Application-Oriented Indexes examples include Swoogle (UMBC, USA), VisiNav (DERI, Irland), and Watson (Open University, UK). The Domain-Specific Applications examples include a “mashing up” data from various Linked Data sources such as Revyu, DBpedia Mobile, Talis Aspire, BBC Programs and Music, and DERI Pipes (Bizer et al., 2011). There are various challenges for the LOP applications such as user Interfaces and Interaction Paradigms; Application Architectures; Schema Mapping and Data Fusion; Link Maintenance; Licensing; Trust, Quality, and Relevance; and Privacy (Bizer et al., 2011).

Machine Learning is described by (Holzinger, 2017) as the fastest growing technical field. The primary goal of the Machine Learning (ML) is to develop software which can learn from the previous experience, where a usable intelligence can be reached. ML is categorized into “supervised,” and “unsupervised” learning techniques depending on whether the output values are required to be present in the training data (Baştanlar & Özuysal, 2014). The “unsupervised” learning technique requires only the input feature values in the training data, and the learning algorithm discovers hidden structure in the training data based on them (Baştanlar & Özuysal, 2014). The “Clustering” techniques which partition the data into coherent groups fall into this category of “unsupervised” learning (Baştanlar & Özuysal, 2014). The “unsupervised” techniques are used in the biometrics for problems such as microarray and gene expression analysis (Baştanlar & Özuysal, 2014). Market segment analysis, grouping people according to their social behavior, and categorization of articles according to their topics are the common tasks which involve clustering and “unsupervised” learning (Baştanlar & Özuysal, 2014). Typical clustering algorithms are K-means, Hierarchical Clustering, and Spectral Clustering (Baştanlar & Özuysal, 2014). The “supervised” learning techniques require the value of the output variable for each training sample to be known (Baştanlar & Özuysal, 2014).

To improve the structure, semantic richness and quality of Linked Data, the existing algorithm of the ML need to be extended from basic Description Logics such as ALC to expressive ones such as SROIQ(D) to serve as the basis of OWL 2 (Auer, 2010). The algorithms need to be optimized for processing very large-scale knowledge bases (Auer, 2010). Moreover, tools and algorithms must be developed for user-friendly knowledge representation, maintenance, and repair which enable detecting and fixing any inconsistency and modeling errors (Auer, 2010).

Knowledge Extraction from Linked Data

There is a difference between Data Mining (DM) and Knowledge Discovery (KD) (Holzinger & Jurisica, 2014). Data Mining is described in (Holzinger & Jurisica, 2014) as “methods, algorithms, and tools to extract patterns from data by combining methods from computational statistics and machine learning: Data mining is about solving problems by analyzing data present in databases” (Holzinger & Jurisica, 2014). Knowledge Discovery, on the other hands, is described in (Holzinger & Jurisica, 2014) as “Exploratory analysis and modeling of data and the organized process of identifying valid, novel, useful and understandable patterns from these data sets.” However, some researchers argue there is no difference between DM and KD. In (Holzinger & Jurisica, 2014), Knowledge Discovery and Data Mining (KDD) are of equal importance and necessary in combination.

Linked Data has a lifecycle with eight phases: (1) Extraction, (2) Storage and Querying, (3) Authoring, (4) Linking, (5) Enrichment, (6) Quality Analysis, (7) Evolution and Repair, and (8) Search, Browsing, and Exploration (Auer, 2010; Ngomo et al., 2014). The information which is represented in unstructured forms or semi-structured forms must be mapped to the RDF data model which reflects the Extraction first phase (Auer, 2010; Ngomo et al., 2014). When there is a critical mass of RDF data, techniques must be implemented to store, index and query this RDF data efficiently which reflects the Storage and Querying the second phase. New structured information or correction and extension to the existing information can be implemented at the third phase called “Authoring” phase (Auer, 2010; Ngomo et al., 2014). If information about the same or related entities is published by different publishers, links between those different information assets have to be established in the Linking phase. During the Linking phase, there is a lack of classification, structure and schema information, which can be tackled in the Enrichment phase by enriching data with the higher-level structure to aggregate and query the data more efficiently. The Data Web, similar to the Document Web can contain a variety of information of different quality, and hence strategies for quality evaluation and assessment of the data published must be established which is implemented at the Quality Analysis phase. When a quality problem is detected, strategies to repair the problem and to support the evolution of Linked Data must be implemented at the Evolution and Repair phase. The last phase of the Linked Data is for users to browse, search and explore the structured information available on the Data Web fast and in a user-friendly fashion (Auer, 2010; Ngomo et al., 2014).

The Linked Data-driven applications are categorized into four categories: (1) Content reuse applications such as BBC’s Musing store which reuses metadata from DBPedia and MusicBrainz, (2) Semantic Tagging and Rating applications such as Faviki which employs unambiguous identifiers from DBPedia, (3) Integrated Question-Answering Systems such as DBPedia Mobile with the ability to indicate locations from the DBPedia dataset in the user’s vicinity, and (4) Event Data Management Systems such as Virtuoso’s ODS-Calendar with the ability to organize events, tasks, and notes (Konstantinou, Spanos, Stavrou, & Mitrou, 2010).

Integrative, and interactive Machine Learning is the future for Knowledge Discovery and Data Mining. This concept is demonstrated by (Sakr & Gaber, 2014) using the environmental domain, and in (Holzinger & Jurisica, 2014) using the biomedical domain. Other applications of LOD include DBpedia which reflects the Linked Data version of Wikipedia, BBC’s Platform for the World Cup 2010 and the 2012 Olympic game (Kaoudi & Manolescu, 2015). RDF dataset can involve a large volume of data such as data.gov which contains more than five billion triples, while the latest version of DBpedia corresponding to more than 2 billion triples (Kaoudi & Manolescu, 2015). The Linked Cancer Genome Atlas dataset consists of 7.36 billion triples and is estimated to reach 20 billion as cited in (Kaoudi & Manolescu, 2015). Such triple atoms have been reported to be frequent in real-world SPARQL queries, reaching about 78% of the DBpedia query log and 99.5% of the Semantic Web Dog query log as cited in (Kaoudi & Manolescu, 2015). The focus of this discussion is on DBPedia to extract out the knowledge from a large-scale dataset with real-world semantics by applying machine learning.

DBpedia

DBpedia is a community project which extracts structured, multi-lingual knowledge from Wikipedia and makes it available on the Web using Semantic Web and Linked Data technologies (Lehmann et al., 2015; Morsey, Lehmann, Auer, Stadler, & Hellmann, 2012). DBpedia is interlinked with several external data sets following the Linked Data principles (Lehmann et al., 2015). It develops a large-scale, multi-lingual knowledge based by extracting structured data from Wikipedia editions (Lehmann et al., 2015; Morsey et al., 2012) in 111 languages (Lehmann et al., 2015). Using DBpedia knowledge base, several tools have been developed such as DBpedia Mobile, Query Builder, Relation Finder, and Navigator (Morsey et al., 2012). DBpedia is used for several commercial applications such as Muddy Boots, Open Calais, Faviki, Zemanta, LODr, and TopBraid Composer (Bizer et al., 2009; Morsey et al., 2012).

The technical framework of DBpedia extraction involves four phases of Input, Parsing, Extraction, and Output (Lehmann et al., 2015). The input phase involves Wikipedia pages which are read from an external source using either a dump from Wikipedia or MediaWiki API (Lehmann et al., 2015). The Parsing phase involves parsing Wikipedia pages using wiki parser which transforms the source code of a Wikipedia page into an Abstract Syntax Tree (Lehmann et al., 2015). In the Extraction phase, the Abstract Syntax Tree of each Wikipedia page is forwarded to the extractor to extract many things such as labels, abstracts, or geographical coordinates (Lehmann et al., 2015). A set of RDF statements is yielded by each extractor which consumes an Abstract Syntax Tree (Lehmann et al., 2015). The Output phase involves writing the collected RDF statements to a sink supporting different formats such as N-Tiples (Lehmann et al., 2015). The DBpedia extraction framework uses various extractors for translating different parts of Wikipedia pages to RDF statements (Lehmann et al., 2015). The extraction framework of DBpedia is divided into four categories: Mapping-Based Infobox Extraction, Raw Infobox Extraction, Feature Extraction, and Statistical Extraction (Lehmann et al., 2015). The DBpedia extraction framework involves two workflows: Dump-based Extraction, and the Live Extraction. The Dump-based Extraction workflow employs the Database Wikipedia page collection as the source of article texts, and the N-Triples serializer as the output destination (Bizer et al., 2009). The knowledge base result is made available as Linked Data for download and via DBpedia’s main SPARQL endpoint (Bizer et al., 2009). The Live Extraction workflow employs the update stream to extract new RDF whenever an article is changed in Wikipedia (Bizer et al., 2009). The Live Wikipedia page collection access the article text to obtain the current version of the article encoded according to the OAI-PMH protocol (The Open Archives Initiative Protocol for Metadata Harvesting) (Bizer et al., 2009). The SPARQL-Update Destination removes the existing information and inserts new triples into a separate triple store (Bizer et al., 2009). The time for DBpedia to reflect the latest update of Wikipedia is between one to two minutes by (Bizer et al., 2009). The update stream causes performance issue and the bottleneck as the changes need more than one minutes to arrive from Wikipedia (Bizer et al., 2009).

References

Auer, S. (2010). Towards creating knowledge out of interlinked data.

Baştanlar, Y., & Özuysal, M. (2014). Introduction to machine learning miRNomics: MicroRNA Biology and Computational Analysis (pp. 105-128): Springer.

Bizer, C., Heath, T., & Berners-Lee, T. (2011). The linked data-the story so far.

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., & Hellmann, S. (2009). DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services, and Agents on the World Wide Web, 7(3), 154-165.

Holzinger, A. (2017). Introduction To MAchine Learning & Knowledge Extraction (MAKE). Machine Learning & Knowledge Extraction.

Holzinger, A., & Jurisica, I. (2014). Knowledge discovery and data mining in biomedical informatics: The future is in integrative, interactive machine learning solutions Interactive knowledge discovery and data mining in biomedical informatics (pp. 1-18): Springer.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Konstantinou, N., Spanos, D.-E., Stavrou, P., & Mitrou, N. (2010). Technically approaching the semantic web bottleneck. International Journal of Web Engineering and Technology, 6(1), 83-111.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., . . . Auer, S. (2015). DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2), 167-195.

Morsey, M., Lehmann, J., Auer, S., Stadler, C., & Hellmann, S. (2012). Dbpedia and the live extraction of structured data from Wikipedia. The program, 46(2), 157-181.

Ngomo, A.-C. N., Auer, S., Lehmann, J., & Zaveri, A. (2014). Introduction to linked data and its lifecycle on the web. Paper presented at the Reasoning Web International Summer School.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Share this:

Related

Published by Think and Knowledge Tank