data mining from document-append nosql
Transcription
data mining from document-append nosql
International Journal of Services Computing (ISSN 2330-4472) Vol. 2, No. 2, April-June 2014 DATA MINING FROM DOCUMENT-APPEND NOSQL [Richard K. Lomotey, Ralph Deters] [University of Saskatchewan, Saskatoon, Canada] [richard.lomotey@usask.ca. deters@cs.usask.ca] Abstract Due to the unstructured nature of modern digital data, NoSQL storages have been adopted by some enterprises as the preferred storage facility. NoSQL storages can store schema-oriented, semi-structured, schema-less data. A type of NoSQL storage is the document-append storage (e.g., CouchDB and Mongo) which has received high adoption due to its flexibility to store JSON-based data and files as attachment. However, the ability to perform data mining tasks from such storages remains a challenge and the required tools are generally lacking. Even though there is growing interest in textual data mining, there is huge gap in the engineering solutions that can be applied to document-append storage sources. In this work, we propose a data mining tool for term association detection. The flexibility of our proposed tool is the ability to perform data mining tasks from the document-source directly via HTTP without any copying or formatting of the existing JSON data. We adapt the Kalman filter algorithm to accomplish macro tasks such as topic extraction, term organization, term classification and term clustering. The work is evaluated in comparison with existing textual mining tools such as Apache Mahout and R with promissory result on term extraction accuracy. Keywords: Data mining; NoSQL; Kalman filter; Unstructured data; Big data; Association rule __________________________________________________________________________________________________________________ data. Further, the data stored is treated as a resource and 1. INTRODUCTION There is huge amount of digital data today that can be used to achieve several goals. This data economy is referred to as “Big data” (Zikopoulos et al., 2012). The highdimensional data comes in heterogeneous formats and at high streaming velocity. This situation has rendered existing Relational Database Management Systems (RDBMS) inefficient especially at the storage level. While RDBMS systems are designed to store schema-based data in certain predefined formats, the data being generated nowadays have no schema. The greater portion of the generated digital content has no particular structure and the data comes in several formats such as multimedia, files, data, and so on. Thus, the NoSQL (NoSQL, http://nosql-database.org) storages have been proposed as a measure to accommodate the growing style of data in unstructured formats. Different types of NoSQL storages exist and some examples are multimodal databases, graph databases, object databases, and so on. In this work, we shall focus on the document-append style NoSQL storage. This style of storage supports structured, semi-structured, and unstructured data. The testing environment will be CouchDB which supports upload of files as attachments and actual data in JSON format. Identical to Web services, CouchDB can be interacted with using the HTTP verbs following the REpresentational State Transfer (REST) style such as GET – to retrieve data, POST – to create new records, PUT – to update an existing data, DELETE – to remove an existing record, and HEAD - to retrieve the meta-info of a stored http://hipore.com/ijsc have unique identifiers per document. The problem however is that, we are faced with the challenge of performing data mining tasks from such storage facilities. Even though the interaction with such data storage is facilitated as Web service (and can be accessed via the browser), we are not able to employ tools such as Web crawlers to mine data from them. This is because, crawlers rely on the HTML tags within the DOM of a webpage to retrieve information. The presence of anchor and href tags (i.e., <a href=””></a>) on a page enables the crawler to traverse all hyperlinks that are found within the search space. In the case of document-append storages, there are no such tags; rather, the data is JSON or XML per document. Also, while RDBMS storages have a structure and support cross-database queries based on keys, the document-append storage has no keys and relies on map reduce. This situation has rendered data mining tools that have been built for RDBMS and web crawlers inefficient in the world of document-append storages (Fig. 1 compares the two). In this paper, we propose a data mining tool that can be employed to perform analytics of terms and topics from the document-append storage using CouchDB as case study. Users are enabled to perform data mining tasks directly from CouchDB via HTTP requests without any copying or formatting of the existing JSON data. We propose the Kalman filter algorithm to achieve requirements such as topic extraction, term organization, term classification and term clustering. The work is evaluated in comparison with existing textual mining tools such as Apache Mahout and R with promissory result on term extraction accuracy. 17 International Journal of Services Computing (ISSN 2330-4472) { “PatientID”: “SK102”, “LNAME”: “DOE”, “FNAME”: “JOHN”, “DOC_ID”: “PH200”, “OFFICE_LOC”: “ROOM 61”, “SPECIALTY”: “GEN SURGEON” } Vol. 2, No. 2, April-June 2014 Doctors Information Patient Information PatientID LNAME FNAME DOC_ID DOC_ID OFFICE_LOC SPECIALTY SK100 LOMOTEY RICH PH200 PH200 ROOM 61 GEN SURGEON SK101 DETERS RALPH PH201 PH201 ROOM 78 PEDIATRICIAN SK102 DOE JOHN PH200 PH202 ROOM 21 ORTHODONTIST SK103 GOOD JOE PH202 PH203 ROOM 98 PEDIATRICIAN Figure 1. The Representation of RDBMS (left) in a Document-Oriented NoSQL (right). Here are the highlights of this work: Proposed a data mining framework for DocumentStyle NoSQL Proposed the Kalman filter algorithm to determine term association between topics in the data source The framework supports topic extraction, term organization, term classification and term clustering In comparison to Apache Mahout and R, the proposed framework is: o more accurate o more flexible to adapt to other data sources o more scalable to accommodate several data sources o less scalable than Apache Mahout in terms of the processing job per time. The remaining sections of this work are structured as follows. Section II reviews some works on unstructured data mining, Section III describes our proposed search tool, Section IV details the adapted Kalman filter algorithm and the preliminary evaluation of the proposed tool is carried out in Section V. The paper concludes in Section VI with our findings and future work. 2. BACKGROUND WORKS 2.1 NoSQL Databases Today, digital content generation is influenced by multiple stakeholders such as: end-users, enterprise consumers, services providers, and prosumers. This has also fueled the nature of the data which has shifted from schemabased storages to semi-structured and unstructured data (Rupanagunta et al., 2012). In summary, the concept of unstructured data is described as: o o o o Data Heterogeneity: The data is represented in varying formats (documents, multimedia, emails, blogs, websites, textual contents, etc.) Schema-less: The data has no specific structure because there is no standardization or constraints placed on content generation. Multiple Sources: The data is from diverse sources (e.g., social media, data providers such as Salesforce.com, mobile apps, sensors, etc.) Varying API Types: The data requires multiple APIs to aggregate data from the multiple sources; and the APIs are not standardized nor have SLAs. While the Web has been at the forefront of content accessibility and delivery, it is not a platform for storage. Thus, a revolution in storage has been proposed that shift focus from RDBMS storages to NoSQL databases. There are varying forms of NoSQL storages such as: documentappend NoSQL that supports textual and file attachments (e.g., CouchDB), File-based storages (e.g., DropBox, MEGA, Amazon S3), Wide-Column Storages (e.g., Hadoop, HBase), Graph-based storages (Neo4J, InfoGrid), and so on. While many of the ongoing works on big data and unstructured data mining has targeted Hadoop and its family line of storage, not many works is seen on the document storage front. As of the time of writing, we could not find any tool that is aimed at the document storage style. Our work will therefore explore this option where CouchDB will be our use case environment. The storage allows data to be stored in a JSON format and does not use keys as in the case of RDBMS. Further, each document is uniquely identified by an identifier which can be used to access the document over a hyperlink. In order to query multiple documents, the MapReduce methodology can be adopted. Table I. Text-Based Data Mining Requirements and Proposed Techniques http://hipore.com/ijsc 18 International Journal of Services Computing (ISSN 2330-4472) Information Extraction Association Rules Topics Vol. 2, No. 2, April-June 2014 Terms Document Clustering Document Summarization Re-Usable Dictionaries Goal Data Retrieval, KDD Process Trend discovery in text, document or file Topic Recommendation Establish associative relationships between terms Organize documents into sub-groups with closer identity Noise Filtering in Large Documents For Knowledge Collaboration and Sharing Data Representation Long or short text, semantics and ontologies, keywords Keywords, Long or short sentences Keywords, Lexical chaining Keywords Data, Documents Terms, Topics, Documents Words, Sentences Feature extraction Keyword extraction Lexical chaining Lemmas, Partsof-Speech Feature extraction Lexical chaining Feature extraction Output Documents (Structured or semi-structured) Visualization, Summary report Topic linkages Visualization, crawler depth Visualization Visualization, Summery report Summary report Techniques Community data mining, Tagging, Normalization of data, Tokenizing, Entity detection Ontologies, Semantics, Linguistics Publish/Subscribe, TopicsAssociationsOccurrences Linguistic Preprocessing, Term Generation K-means clustering, Hierarchical clustering Document structure analysis, Document classification Annotations, Tagging Search Space Document, Files, Database Document, Files, Database Document, Files, Databases Topics vector space Documents, Databases Databases, Sets of documents Databases, Sets of documents Similarity and relatedness of Documents, sentences and keywords Similarity, Mapping, Cooccurrence of features, correlations Associations, Similarity, Sensitivity, Specificity Frequency of terms, Frequency variations Sensitivity, Specificity Transition and preceding matrix, Audit trail Word extensibility Lack of research on Data Pedigree Uses varying methods to multi databases E.g., relational, spatial, and transactional data stores Identification of Topics of interest may be challenging Community based so crossdomain adoption is challenging Can be resource intensive therefore needs research on parallelization Not much is available on Data Pedigree Challenges with dictionary adaptation to new words Natural Language Processing Evaluation metrics Challenges and Future Directions While the underlying technology on which documentstyle database are built follows the Web standard, performing data mining will take different dimensions. Web data mining is done using the Web crawler which relies on the HTML tags to interpret data blocks (i.e., content) on a web page. The anchor tag and href attributes also aids the crawler to go to other interconnected pages. The documentstyle storage however does not have tags; rather, its textbased that follows the JSON. Moreover, RDBMS specific data mining tools are not designed for textual mining and storage facility such as CouchDB that allows file attachments further complicates the issue. This creates the need to research on how to perform data mining and knowledge discovery from such storage. Thus, our major goal in this paper is to research and design a data analytics tool that targets the document-style NoSQL. However, to achieve this goal, we need to explore the second major aspect of the work, which is textual and unstructured data mining. 2.2 Unstructured Data Mining http://hipore.com/ijsc Without data mining, we cannot deduce any meaning out of the vast data at our disposal. As already posited, existing data mining techniques that are well-advanced have been designed to work either with schema-oriented databases which are structured (Kuonen, 2003) or Web crawlers. To enhance the data mining process, scientist in both academia and industry are beginning to explore the best methodologies that can aid in the unstructured data mining process; especially in the context of textual data mining. As a result, we have witnessed some requirements such as: Information Retrieval (IR) algorithms based on templates (Hsu and Yih, 1999), Association Rules (Delgado et al., 2002, Zhao and Bhowmick, 2003), Topic tracking and topic maps (Abramowicz et al., 2003), Term crawling (terms) (Janet and Reddy, 2011, Feldman et al., 1998), Document Clustering (Han et al., 2010), Document Summarization (Dey and Haque, 2009), Helmholtz Search Principle (Balinsky et al., 2010), Re-usable Dictionaries (Godbole et al., 2010), and so on. For brevity, the reader is referred to Table I to see the overview of the area including unexplored areas. 19 International Journal of Services Computing (ISSN 2330-4472) There is an ever growing demand for the consumption of high-dimensional data across different enterprises such as Biomedical Engineering, Telecommunications, Geospatial Data, Climate Data and the Earth’s Ecosystems, and Capital Research and Risk Management (Groth and Muntermann, 2010). This has resulted into the employment of some or the combination of the various requirements described in Table I to deploy unstructured data mining tools. Though the application of unstructured data mining tools is broad and some previous works including Scharber (2007) focus on the available tools in general textual mining, we can identify with the fact that the deployment of these tools require the active participation of software engineers as well as software engineering principles. Gonzalez et al., 2012 saw the need to implement a tool that enforces employee data accuracy and completeness. The tool is based on Information Extraction (IE), Adaptive Learning, and Resource Classification. The main advantages of the tool are: the transformation of the extracted data into a planning decision data, the support for precision and recall metrics, and resource utilization and demand economization. While all efforts are being made to extract data, it is still challenging to identify tools that can perform storage and classification of extracted unstructured data into structured databases. What is even more challenging is the extraction of data from unstructured data sources that contain text and other data formats such as multimedia. A good way to deal with such challenging instances is the adoption of a flat-file data storage approach. This is evidenced in the C# oriented framework presented by Abidin et al., 2010 which employs layered architecture to deal with the unstructured data extraction regarding multimedia. The layers consists of the: Interface layer (the view that will facilitate the interaction between the user and the system), source layer (the data sources which can be in a semi-structured or unstructured format), XML layer (the extracted data is transformed into a flat-file XML document format), and multimedia database (the final storage of the structured data). The final storage location in this case must be a data source that supports multimedia data (e.g., Oracle 11g standards). But one of the limitations that we perceive with this type of framework is that, it requires the initial data source to be in a webpage. Technically, webpages are in a form of a structure which is defined within DOM elements. So, passing the DOM structure into XML is straightforward. What is difficult is when the source elements are not in any defined tag and are just in documents. Most tools are yet to address this challenge. Another challenge when faced with data extraction especially in software engineering tasks is the fact that mostly, the developers use jargons which are not part of standard dictionaries. When one observes the CTSS tool (Christy and Thambidurai, 2008) for instance, the tool offers a lot of guarantees in terms of short processing time and the fact that different underlying modeling techniques are employed in its deployment. Some of the techniques are passing and soft matching rules. But http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 since the input source is a URL, the wide array of benefits that the tool offers is limited to only web contents. It is therefore necessary to look for enhanced ways to deploy applications that can accept inputs from nonwebpages as well. Moving away from Web contents, some researchers focus on Request for Comments (RFC) documents (Gurusamy et al., 2002). While RFC is syntactically different from HTML, the authors’ definition of unstructured RFC requires some level of structure or organization of text in the document. The tool proposed takes a document and reads certain semantic tags in the document such as Content, Abstract and so on before generating the required structured document. This also means that some semi-structure initialization is required to commence the mining process. In an attempt to understand email and bug report communication between developers (and knowing that the communication takes unstructured textual format) and not a web content, Panichella et al., 2012 propose the following workflow: (1) Downloading emails and categorizing them into classes, (2) Extracting paragraphs from the reports, (3) Tracing Paragraphs onto Methods, (4) Filtering the Paragraphs, and (5) Computing Textual Similarities Between Paragraphs and Methods. Of course in this approach, the dependency on a semi-structured HTML as data source is alleviated but the challenge is the interpretation of paragraphs in a general textual document. What the authors propose is the extraction of methods as paragraphs which limits the scope of the application in terms of general textual data. But, the uniqueness of this approach is the facilitation of re-documentation of source code artifacts which are extracted from the unstructured data sources to be presented in a summarized format. This approach deviates a little from the overly used language generation techniques to write the summarization through automated techniques. Similarly, Mailpeek (Bacchelli et al., 2012) focuses on text extraction by clustering email contents which are shared by developers. Realizing the fact that developers are saddled with the responsibility of interacting with unstructured text (especially source code), Bettenburg et al., 2011 outlined some technical plans to establish a meaningful communication between developers. The authors propose a semantic detection module that checks and rectifies spelling and grammar of the text. Clearly, the authors have seen that the software engineering landscape is crowded with violation of technical words which are not found in the standard language dictionaries. A good way to enforce spellchecking is to employ morphological language analysis to identify and describe the morphemes (i.e., smallest linguistic units that are semantically meaningful). The use of linguistic can further aid developers to adopt rationale detection. Rationale-containing texts have matching linguistic relationships which means, the extraction of useful information can be done through the identification of rationale features (Rogers et al., 2012). 20 International Journal of Services Computing (ISSN 2330-4472) From the background works, we seek to adapt some of the ideas to design terms mining tool for document-based NoSQL. The next section details the proposed architectural design. 3. THE ARCHITECTURAL DESIGN OF THE DATA MINING TOOL 3.1 The Process Flow Execution The entire system (as depicted in Fig. 2) is made of three layers namely: the view layer, the data mining layer, and the storage layer. The view layer represents a dashboard which allows the user to enter the topics and terms to be mined. This is the output/input layer where the user is enabled to interact with the system. This layer is designed to be deployed on heterogeneous devices such as mobile devices, and desktop (including browser). When the search term is specified from the input layer, the request is sent as to the entry point on the analytics layer which is designated as the Request Parser. At this point, the specified term is treated as the search artefact. Further, the user has to specify the link to the data source (i.e., the Vol. 2, No. 2, April-June 2014 NoSQL storage) in order to facilitate the request parser to guarantee that the search can be carried out. Without the specification of the link (which must be a URL over an HTTP protocol), the system will not perform the data extraction task. Once the initial requirements are met, the search artefact is sent to the analytics layer for Artefact Extraction Definition. The importance of this component is to differentiate topics from terms, and to further determine the exact requirement of the user. Specifically to this work, we are focusing on extracting single word terms as well as two word terms. For example, single word terms may be heart, disease, psychiatry, etc.; while two word terms may include heart disease, psychiatry medication, etc. The successful decoding of the search term activates the Semantic Engine component where the analytics framework tries to understand the terms to be extracted. Initially, we consider the artefact specified by the user as Topics. This aids us to do the traditional information retrieval task which focuses on extracting the exact artefacts specified by the user. This approach also suggests that when a user specifies an artefact, we look for the exact keyword. Document Style NoSQL MAP/REDUCE REST API Artifact Extraction Definition Semantic Engine Topics Search Algorithm Association Rules Topic Parser Output Layer Search Artifact Request Parser Web Browser Mobile (e.g., Notebooks, Tablets) Extracted Artifacts Topics Mapping Thesaurus Dictionary Serializer Dashboard Filtering Tagging Search Log Input Layer Terms Clustering Data (JSON) Display Topics Clustering Terms Clustering Visualization User Figure 2. The architectural design of the data mining platform However, to perform intelligent terms analytics, there is the need to accommodate the concepts of synonyms, http://hipore.com/ijsc antonyms, parts of speech, and lemmatization as proposed by Zhao and Bhowmick, 2003. Thus, the topics are sent to 21 International Journal of Services Computing (ISSN 2330-4472) the Topics Parser to check the meaning of the specified artefact. There are two layers that are designed to check the meanings of the artefacts, the Dictionary and the Thesaurus. The dictionary contains standard keywords, their meanings, and their antonyms. The thesaurus contains jargons that otherwise are not found in the standard dictionary. This is important for situations where the analytics is done in certain technical domains such as the medical field where most of the keywords are not in the standard dictionary. The thesaurus can be updated by the specific domain that is adopting the framework for the analytics task. This further makes the proposed tool highly adaptable and agile. When the artefact is reviewed by the topic parser by looking through the thesaurus and dictionary, the Topics Mapping task is enabled. The mapping tasks include the organization of the specified artefact and possible synonyms as may be found in the semantic engine. The expected artefact and the newly found keywords from the dictionary and the thesaurus are categorized as Terms. For instance, when searching for the artefact haemophilia, the terms formation can be formulated as: {“Hemophilia”: “Bleeding Disorder”} {“Hemophilia”: “Blood Clotting”} {“Hemophilia”: “Pseudohemophilia”} {“Hemophilia”: “Hemogenia”} The successful formation of the terms leads to the activation of the Search Algorithm component. The algorithm we proposed is based on the inference-based Apriori which, will be discussed in great depth in the next section. The search algorithm component is also linked to the main data source where the data is stored. Using the CouchDB framework for instance, the data source can be accessible over REST API. Primarily, the search algorithm interfaces the data source through the HTTP GET method. The data source which is document style stores the data following the JSON format. Also, the Map/Reduce feature is employed to query multiple documents within a single database. In our framework, the documents are extracted and written in a flat text file. This aids us to easily read through the entire storage repository regardless of the number of documents. The extracted terms are then reviewed to determine the association between the terms. We apply Association Rules based on the similarities and diversity of the extracted terms. Then, we employ the Filtering methodology to remove the pollutants (or noise) from the data. In most instances, the extracted artifacts are bound to contain keywords that appear valuable but are not actually relevant to the term mining tasks (e.g., antonyms, or similar written words but mean different things). The next task is tagging, which is applied to the filtered data. Tagging is employed to determine the existing relationship between the filtered data and how they are connected to the user specified artifact. At http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 this stage, the Serializer component checks the format of the categorized data for visual presentation. The data must be modelled in JSON format before it can be parsed to the visual display. Once the data clustering is done in JSON, the output is sent to the visualization layer as evidenced in Appendix I. The user can then see the output of the result on the dashboard. 4. THE KALMAN FILTER ALGORITHM While the Kalman filter algorithm has been used extensively in areas such as navigation control systems, the adaptation of the methodology in data mining is limited (Pietruszkiewicz, 2008). For further mathematical details and formalisms on the Kalman filter, the intelligent reader is referred to van der Merwe and Wan, 2003. In this section however, we shall discuss how we employ a variance of the Kalman filter which we model as a Bayesian estimate. First of all, we consider the process of storing data in the NoSQL as time dependent due to the fact that most storage accepts streaming data. The Kalman filter can use recursive estimates to determine states based on incoming data. This is identical to the recursive Bayesian estimation that is an unobserved Markov process, and the data states (i.e., measurements) can follow the hidden Markov model (HMM). The algorithm relies on the search history of the system users to filter the search result. From the specified terms to be extracted, we can establish the following base relationships: K ⊂ Dic T ⊂ Db to mean that the specified keyword (K) is a subset of the vocabularies in the dictionary/thesaurus (Dic), and the constructed terms (T) from the semantic engine is a subset of the dataset in the NoSQL (Db). In a situation where there are no vocabularies similar (i.e., absence of synonyms or antonyms) in the Dic, then K = T. Also, it is practical to get into situations where K is not found at all in the dictionary and we treat this case also as K = T. The characteristics of T can be defined as follows: T = {n1, n2, n3, n4, …, nn} to mean that T can contain a list of several artifacts, n, that are related to the specified term (described earlier as topics) and in the least, T = {t1} to mean that the specified term has no related keywords. Thus, T = is not supported because this is an indirect way of saying that there is nothing specified to be searched. The existence of T in the data source is assigned a probability of 1 and the nonexistence of T in the data source is represented as 0. The proposed system has a Search Log that keeps the record of the frequency of the keywords being searched. This is a disk storage that can grow in size so we use ranking based on the frequency of hits to prioritize which 22 International Journal of Services Computing (ISSN 2330-4472) records to keep. When the storage is growing out of proportion, less frequently searched keywords (i.e., less prioritized keywords) are deleted. Based on our goal, we expect the Kalman filter will be the best fit to achieve a reliable data mining tasks based on data source that is changing over time. The association rule is designed so that we can determine the presence of frequency of terms, T in the database Db. Further, we model every pile of document as a row, identical to RDBMS systems. Also, it best if we consider the interactions within the proposed system as a finite state machine where we can define the states of the data in the NoSQL per every time setup. Given a term A on a particular NoSQL document, the conditional probability that the term or its dependency keywords B exist(s) in other appended documents or in another NoSQL database can be expressed as: 𝑓𝐴|𝐵 = 𝑓𝐴,𝐵 (𝑎𝑖 , 𝑏𝑗 ) 𝑓𝐵 (𝑏𝑗 ) However, the condition of the existence of terms can be independent and random which follows that: 𝑓𝐴|𝐵 (𝑎𝑖 , 𝑏𝑗 ) = 𝑓𝐴 (𝑎𝑖 ) 𝑋 𝑓𝐵 (𝑏𝑗 ) Which means following the Markov assumption, the above expression can be re-written given the immediately previous state as. 𝑃(𝐴𝑛 | 𝐴0 , . . . , 𝐴𝑛−1 ) = 𝑃(𝐴𝑛 | 𝐴𝑛−1 ) Similarly the measurement of the terms in the NoSQL at the n-th timestamp is dependent only upon the current state of available list of terms and is conditionally independent of all other states given the current state. 𝑃(𝐵𝑛 | 𝐴0 , . . . , 𝐴𝑛 ) = 𝑃(𝐵𝑛 | 𝐴𝑛 ) These assumptions can then be extended to design the hidden Markov model using the probability distribution: 𝑃(𝐴0 , . . . , 𝐴𝑛 , 𝐵1 , . . . , 𝐵𝑛 ) 𝑛 = 𝑃(𝐴0 ) ∏ 𝑃(𝐵𝑖 | 𝐴𝑖 )𝑃(𝐴𝑖 | 𝐴𝑖−1 ) 𝑖=1 But, the probability distribution of interest in the estimation of A is conditioned on the availability of data in the current timestamp if the Kalman filter methodology is employed. This requires that we predict and update the next steps. The predicted state can be derived as the sum (integral) http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 of the products of the probability distribution associated with the transition from the (n − 1)-th timestamp to the n-th and the probability distribution associated with the previous state, over all possible An-1. 𝑃(𝐴𝑛 | 𝐵𝑛−1 ) = ∫ 𝑃(𝐴𝑛 | 𝐴𝑛−1 )𝑃(𝐴𝑛−1 | 𝐵𝑛−1 )𝑑𝐴𝑛−1 The recorded terms within a given timestamp can be measured in different time slots since the data is streaming as: 𝐵𝑡 = {𝐵1 , . . . , 𝐵𝑡 } The product of the measurement likelihood and the next predicted state (of possible terms within the document) is a probability distribution: 𝑃(𝐴𝑛 | 𝐵𝑛 ) = 𝑃(𝐵𝑛 | 𝐴𝑛 )𝑃(𝐴𝑛 | 𝑩𝑛−1 ) 𝑃(𝐵𝑛 | 𝑩𝑛−1 ) Where the normalization term is formalised as: 𝑃(𝐵𝑛 | 𝐵𝑛−1 ) = ∫ 𝑃(𝐵𝑛 | 𝐴𝑛 )𝑃(𝐴𝑛 | 𝑩𝑘−1 )𝑑𝐴𝑘 The Kalman filter algorithm aided us to recursively go through the documents in the NoSQL at various timestamps and the result for the term miming is updated as new readings are recorded. 5. EVALUATION Two datasets from our medical partners (the Saskatchewan Health Region) are analyzed based on the proposed algorithms. These data is on Hemophilia and Psychiatry. We deploy our framework on Amazon EC2 instance with the following specifications: Processor: Intel Xeon, CPU E5410 @ 2.14GHz, 2.37GHz, RAM: 1.70GB, System 32-bit operating system. The output layer is deployed on a personal computer with the following specifications: Windows 7 System 32, 4.12 3.98 GHz, 16GB RAM, 1TB HDD. The size of the training dataset containing the Hemophilia related record is 2 terabyte and the Psychiatry related data is 10 terabyte. The dataset is spread across multiple NoSQL databases that are hosted on separate EC2 instances so that there is not so much latency experienced during the testing. The dictionary and thesaurus are populated with almost 6000 medical jargons specific to Psychiatry and Hemophilia. 23 International Journal of Services Computing (ISSN 2330-4472) For the purpose of testing, we reviewed some of the existing tools on textual data mining and the prominent ones are highlighted in Table II. Table II. Some Textual Mining Tools. Tool Orange Major Goal Statistical, Machine Learning, Visualization WEKA Machine Learning, Knowledge Analysis Apache Mahout Machine Learning RapidMiner R Machine Learning Text Mining Statistical Computing and Graphics In this paper, we benchmark the performance of our framework against Apache Mahout (https://mahout.apache.org/) and R (http://www.rproject.org/). In the experimentation, we focus on four major stages of the data mining process as represented in Figure 3, starting with the basic requirement on topic extraction. Vol. 2, No. 2, April-June 2014 added to existing groups if they belong to that group; otherwise, a new group has to be created to properly classify those topics. Term Clustering: At this stage, the terms are grouped and we try to understand the best-fit relationship between the topics in the group. 5.1 Accuracy Index In Tables IV and V, we report the results of the experiments on the accuracy of the various mining tasks. To validate the result, we compared the reliability of the proposed framework to Apache Mahout and R as shown. There are four base factors that are measured which are: True Positive (TP) - refers to the extraction of expected terms (also referred to as success ratio), False Positive (FP) refers to the extraction of the desired terms plus unwanted results, True Negative (TN) - is when the term is not found because it does not exist in the NoSQL database, and False Negative (FN) - is when the term could not be found but it exists in the data source. The four base factors are extended to calculate the Precision, Recall, Specificity, Accuracy, and F1 Score. The formalisms are provided in Table III. Table III. Calculated Parameters 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 Precision Recall (Sensitivity) Specificity 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 Accuracy 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 Figure 3. The term mining steps F1 Score 2∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Topic Extraction: This is the stage where the specified keyword (artefact) is searched for in the NoSQL and the result is returned with several other For clarity, we shall refer to our proposed framework in keywords. This is the initial step considered before this paper as NoSQL Miner. Considering the two data proceeding with the other activities. sources, we studied the performance of each of the tools Term Organization: This is topics ranking based on under discussion. In both tables IV and V, the following relevance. When the topics are extracted, there is no activities are recorded: Topic Extraction (i.e., Topic Ext.), guarantee that all the keywords are actually (truly) Term Organization (i.e., Term Org.), Term Classification relevant. Term organization is where the topics are (i.e., term Class.), and Term Clustering (i.e., Term Clust.). ranked and non-relevant ones are either eliminated or We observed that the results for both data sets are rated low. The determination of relevance is based on symmetrical so the nature of data is not a determining factor co-occurrence of terms at a particular timestamp. for performance. This creates the need to study the details of Term Classification: Since the terms are organized in the output. groups from initial stages, newly identified topics are Table IV. 2 Terabyte JSON Text on Hemophilia Related Records in CouchD http://hipore.com/ijsc 24 International Journal of Services Computing (ISSN 2330-4472) True Positive False Positive True Negative False Negative Precision Recall (Sensitivity) Specificity Accuracy F1 Score Topic Ext. (%) 95.77 NoSQL Miner Term Term Org. Class. (%) (%) 92.87 90.33 Vol. 2, No. 2, April-June 2014 Apache Mahout R Term Clust. (%) 89.22 Topic Ext. (%) 96.66 Term Org. (%) 94.80 Term Class. (%) 92.41 Term Clust. (%) 90.13 Topic Ext. (%) 97.45 Term Org. (%) 94.93 Term Class. (%) 93.01 Term Clust. (%) 91.44 9.19 13.10 15.21 17.09 16.22 23.16 28.12 30.23 15.23 21.77 25.90 28.80 96.32 97.04 94.71 94.11 96.43 96.98 95.82 95.23 97.91 96.87 96.31 96.55 6.71 3.66 6.98 9.45 11.60 14.35 16.33 18.21 9.22 10.04 12.75 17.33 91.24 87.64 85.59 83.92 85.63 80.37 76.67 74.88 86.48 81.35 78.22 76.05 93.45 96.21 92.83 90.42 89.29 86.85 84.98 83.19 91.36 90.44 87.94 84.07 91.29 92.36 92.34 88.11 91.89 91.72 86.16 89.29 89.06 84.63 87.35 87.05 85.60 87.41 87.42 80.72 83.64 83.48 77.31 80.90 80.61 75.90 79.28 78.82 86.54 88.88 88.85 81.65 85.77 85.65 78.81 83.05 82.80 77.02 80.30 79.86 Table V. 10 Terabyte JSON Text on Psychiatry Related Records in CouchDB True Positive False Positive True Negative False Negative Precision Recall (Sensitivity) Specificity Accuracy F1 Score Topic Ext. (%) 93.19 NoSQL Miner Term Term Org. Class. (%) (%) 91.11 90.20 Apache Mahout R Term Clust. (%) 87.32 Topic Ext. (%) 96.00 Term Org. (%) 93.22 Term Class. (%) 92.98 Term Clust. (%) 91.00 Topic Ext. (%) 96.68 Term Org. (%) 95.30 Term Class. (%) 93.79 Term Clust. (%) 93.61 8.33 12.80 16.33 18.43 15.33 24.98 27.17 31.00 14.21 18.32 24.43 29.33 95.20 95.01 94.81 92.87 96.00 95.23 94.67 94.02 98.02 96.55 95.89 95.81 5.70 6.09 6.98 9.45 12.13 16.10 18.59 21.07 11.32 13.63 15.22 19.01 91.79 87.68 84.67 82.57 86.23 78.87 77.39 74.59 87.19 83.88 79.34 76.14 94.24 93.73 92.82 90.23 88.78 85.27 83.34 81.20 89.52 87.49 86.04 83.12 91.95 93.07 93.00 88.13 90.79 90.61 85.31 88.81 88.56 83.44 86.60 86.23 86.23 87.49 87.49 79.22 82.10 81.94 77.70 80.40 80.25 75.20 78.04 77.75 87.34 88.41 88.34 84.05 85.72 85.64 79.70 82.71 82.55 76.56 79.67 79.48 For brevity, the accuracy index of each of the tools is plotted in Figure 5. Clearly, our proposed NoSQL Miner shows better accuracy result than Apache Mahout and R regarding extraction, organization, classification, and clustering. But before we discuss the accuracy, there is the need to understand the level of falsehood and truthfulness in the result. These factors are the underlying principles for the measurement of accuracy. To measure the truthfulness, we http://hipore.com/ijsc found the sum of the true positive and true negative values. That is, truthfulness is formalized as: 𝑇𝑟𝑢𝑡ℎ𝑓𝑢𝑙𝑛𝑒𝑠𝑠 = 𝑇𝑃 + 𝑇𝑁 Since we record the TP and TN values as percentages, the expected maximum value for truthfulness is 200. This is plotted in Figure 4. 25 International Journal of Services Computing (ISSN 2330-4472) Figure 4. (a) Truthulness plot for Hemophilia Figure 5. (a) Accuracy plot for Hemophilia Figure 6. (a) Falsehood index plot for Hemophilia It is interesting to observe that both Apache Mahout and R outperform our proposed NoSQL Miner in terms of truthfulness. This means for a given data set, the other tools are likely to retrieve more desired topics (i.e., success ratio) and also ignore unlikely topics. This behavior is good and desired in certain systems where the focus is not on the overall accuracy of the extracted terms but rather on the success ratio of extraction. However, if the truthfulness of http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 (b) Truthfulness plot for Psychiatry (b) Accuracy plot for Psychiatry (b) Falsehood index plot for Psychiatry Apache Mahout and R are better, how come they have least accuracy? This question leads us to analyze the level of falsehood. This is the case where the systems exhibit their levels of wrong perception about topics and terms. In other words, falsehood is the rate at which the tool is wrong. Formally, we express this as: 𝐹𝑎𝑙𝑠𝑒ℎ𝑜𝑜𝑑 = 𝐹𝑁 + 𝐹𝑃 26 International Journal of Services Computing (ISSN 2330-4472) The level of falsehood in Apache Mahout and R is high. Especially, the false positive results are high. This situation explains why though both tools have high truthfulness result, the final accuracy is poor. The result of both truthfulness and falsehood give us an idea about how Apache Mahout and R are probably doing the topic extraction. When the data source is parsed to Apache Mahout and R, they try to extract as much topics as possible based on their underlying algorithms. This can be advantageous as more topics (i.e., keywords/artefacts) can be captured which tends to increase the true positive result. At the same time, this behavior can lower the chances of ignoring a correct topic in the data source. As already posited, the attempt to increase the true positive value can be good for some systems. However, the behavior that these tools exhibit shows that while extracting more topics, the false positive value is also increasing. This is because keywords in the data source that are not needed for a particular query result get extracted and added to the terms. Since we evaluate the data mining activities in four steps, it is important to state that errors are carried over from one step to the next. So, errors in the topic extraction step are carried to the term organization task. And errors in term organization are carried to term classification, and finally to terms clustering. Thus, if we observe the results in Table IV and V, the false positive and false negative values increase from one step to next. 5.2 Search Duration and Data Scale One of the key contributions of this paper is offering support for streaming data into document-based NoSQL systems. Currently, we are not able to see such support for Apache Mahout and R in this regard. Moreover, these tools require some specific data formats to process the data, For example, the R system uses a package called “rjson” which converts the JSON objects into R objects. Though some effort is required, it can be automated to retrieve data from incoming sources. Considering the search duration, our proposed NoSQL Miner is faster. This is because we designed our system to interact with the data source directly and process the JSON data in its format without any further conversion. In the case of the other two frameworks, we have to observe the conversion period and when there are errors the procedure has to restart. The search duration is the overall time observed for each tool to perform topic extraction, term organization, term classification, and term clustering. After repeating the experiments over six (6) times, we realized that to process data (of approximately 30 million topics) on average, the NoSQL Miner spends 2718 seconds while Apache Mahout and R spend 3958.8 and 4438.8 seconds respectively. Another observation is the fact that Apache Mahout is the most scalable in terms of job processing per time. The framework processes lots of topics per time. In case we are considering batch data, then Apache Mahout will have been http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 the fastest in terms of accomplishing the unstructured data mining tasks. However, when the data is streaming and the processing is done over time, the NoSQL Miner is the fastest. The reason is most of the bottlenecks with object conversion are not present in the NoSQL Miner. 5.3 Limitations Though our proposed NoSQL Miner outperforms the Apache Mahout and R tools, we cannot conclude that the underlying algorithms of the various tools make the difference. For instance, it is not clear whether we can conclude that the Kalman filter algorithm in the case of the NoSQL Miner is better than the Fuzzy K-Means clustering, Naive Bayes classifier, and Random forest classifier in the case of Apache Mahout. The study of the underlying algorithms requires independent research and it is best if these algorithms are tested in the same framework. However, the proposed NoSQL Miner can be considered as a preferred choice within the context of our research goal which focuses on mining unstructured streaming data at different times and re-adjusting the classifier based on incoming data. This goal naturally fits into the Kalman filter algorithm and since we model this algorithm as a Bayesian estimate (Section 4), we are able to attain good results. It will be interesting to know how the result will be affected if the engineering challenges of automatically mining streaming data in Apache Mahout and R are overcome. 6. CONCLUSIONS There is increasing amount of unstructured data and NoSQL databases have been proposed as preferred storage facilities. While several NoSQL databases exist, our work focuses on the document-append NoSQL database. This type of NoSQL supports actual data and files. Some examples are CouchDB (which is tested in this paper) and MongoDB. The challenge however is that, the support for mining data from this type of storage is lacking from both engineering and data mining perspectives. Most of the textual mining tools available are not easily applicable to document-based NoSQL. Even though the data from document-based NoSQL can be accessed via the browser, Web crawlers are not a good fit for mining data from this storage facility. This is because Web crawlers rely on html tags and attributes to understand contents on the Web while data retrieved from document-based NoSQL is just text (e.g., JSON). This paper therefore propose and design a data mining tool for document-based NoSQL systems using the Kalman filter as the underlying algorithm. Two main aspects of big data, variety and velocity, are considered in the design of the tool. The performance of the proposed tool is compared with well-established frameworks for textual mining such as Apache Mahout and R. 27 International Journal of Services Computing (ISSN 2330-4472) In summary, the work accomplishes the following in the area of unstructured data mining in the big data era: Proposed a data mining framework for DocumentStyle NoSQL We proved that the Kalman filter algorithm though used extensively in areas such signal processing can be used in textual mining of streaming data in documentbased NoSQL systems. The proposed framework supports topic extraction, term organization, term classification and term clustering In comparison to Apache Mahout and R, the proposed framework is: o More accurate when considering data inflow over some time slots. o More flexible to adapt to other data sources. Though not tested in this work, the framework can process any text or flat database (e.g., XML). o More scalable to accommodate several data sources. This is because the framework can access several data sources via HTTP connection. o Less scalable than Apache Mahout in terms of the processing job per time. In the future, this work can be extended to other areas of analytics outside topics and terms. Works can also be done in the areas of re-usable dictionaries, and file organization. 7. ACKNOWLEDGMENT Thanks to our industry partners from the Saskatchewan Health Region, Canada who are the enterprise consumers of this work. Final big thanks to Prof. Patrick Huang (University of Ontario Institute of Technology), Prof. Jia Zhang (Carnegie Mellon University) and the entire editorial team of the International Journal of Services Computing (IJSC). 8. REFERENCES Abidin, S. Z. Z., Idris, N. M., and Husain, A. H. (2010). Extraction and classification of unstructured data in WebPages for structured multimedia database via XML. Information Retrieval & Knowledge Management (CAMP), 2010 International Conference on , vol., no., pp.44-49, 17-18 March 2010, doi: 10.1109/INFRKM.2010.5466948 Abramowicz, W., Kaczmarek, T., and Kowalkiewicz, M. (2003). Supporting Topic Map Creation Using Data Mining Techniques. Australasian Journal of Information Systems, Special Issue 2003/2004, Vol 11, No 1. Bacchelli, A., Sasso, T. D. D'Ambros, M., and Lanza, M. (2012). Content classification of development emails. In Proceedings of the 2012 International Conference on Software Engineering (ICSE 2012). IEEE Press, Piscataway, NJ, USA, 375-385. Balinsky, A., Balinsky, H., and Simske, S. (2010). On the Helmholtz Principle for Data Mining. Hewlett-Packard Development Company, L.P. Bettenburg, N., Adams, B., Hassan, A. E., and Smidt, M. (2011). A Lightweight Approach to Uncover Technical Artifacts in Unstructured http://hipore.com/ijsc Vol. 2, No. 2, April-June 2014 Data. Program Comprehension (ICPC), 2011 IEEE 19th International Conference, no., pp.185-188, 22-24 June 2011, doi: 10.1109/ICPC.2011.36 Christy, A., and Thambidurai, P. (2008). CTSS: A Tool for Efficient Information Extraction with Soft Matching Rules for Text Mining (2008). Journal of Computer Science, 4 (5): 375-381, 2008, ISSN 1549-3636 Delgado, M., Martí n-Bautista, M., Sánchez, D., and Vila, M. (2002). Mining Text Data: Special Features and Patterns. Pattern Detection and Discovery, Lecture Notes in Computer Science, 2002, Volume 2447/2002, 175-186, DOI: 10.1007/3-540-45728-3_11 Dey, L., and Haque, S. K. M. (2009). Studying the effects of noisy text on text mining applications. In Proceedings of the Third Workshop on Analytics for Noisy Unstructured Text Data (AND '09). ACM, New York, NY, USA, 107-114. DOI=10.1145/1568296.1568314 Feldman, R., Fresko, M., Hirsh, H., Aumann, Y., Liphstat, O., Schler, Y., and Rajman, M. (1998). Knowledge Management: A Text Mining Approach. Proc. of the 2nd Int. Conf. on Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, 29-30 Oct. 1998. Godbole, S., Bhattacharya, I., Gupta, A., and Verma, A. (2010). Building re-usable dictionary repositories for real-world text mining. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 1189-1198. DOI=10.1145/1871437.1871588 Gonzalez, T., Santos, P., Orozco, F., Alcaraz, M., Zaldivar, V., Obeso, A. D., and Garcia, A. (2012). Adaptive Employee Profile Classification for Resource Planning Tool. SRII Global Conference (SRII), 2012 Annual , vol., no., pp.544-553, 24-27 July 2012, doi: 10.1109/SRII.2012.67 Groth, S. S., and Muntermann, J. (2010). Discovering Intraday Market Risk Exposures in Unstructured Data Sources: The Case of Corporate Disclosures. HICSS, 1-10, IEEE Computer Society, 2010. Gurusamy, S., Manjula, D., Geetha, T. V. (2002). Text Mining in ‘Request for Comments Document Series’. Language Engineering Conference, p. 147, Language Engineering Conference (LEC'02), 2002 Han, L., Suzek, T. O., Wang, Y., and Bryant, S. H. (2010). The Textmining based PubChem Bioassay neighboring analysis. BMC Bioinformatics 2010, 11:549 doi:10.1186/1471-2105-11-549 Hsu, J. Y., and Yih, W. (1997). Template-Based Information Mining from HTML Documents. American Association for Artificial Intelligence. Janet, B., and Reddy, A. V. (2011). Cube index for unstructured text analysis and mining. In Proceedings of the 2011 International Conference on Communication, Computing & Security (ICCCS '11). ACM, New York, NY, USA, 397-402. DOI=10.1145/1947940.1948023 Kuonen D. (2003). Challenges in Bioinformatics for Statistical Data Miners. Bulletin of the Swiss Statistical Society, Vol. 46 (October 2003), pp. 10-17. Panichella, S., Aponte, J., Di Penta, M., Marcus, A., and Canfora, G. (2012). Mining source code descriptions from developer communications. Program Comprehension (ICPC), 2012 IEEE 20th International Conference, no., pp.63-72, 11-13 June 2012, doi: 10.1109/ICPC.2012.6240510 Pietruszkiewicz, W., (2008). "Dynamical systems and nonlinear Kalman filtering applied in classification," Cybernetic Intelligent Systems, 2008. CIS 2008. 7th IEEE International Conference on , vol., no., pp.1,6, 9-10 Sept. 2008 Rogers, B., Gung, J., Qiao, Y., and Burge, J. E. (2012). Exploring techniques for rationale extraction from existing documents. Software Engineering (ICSE), 2012 34th International Conference, vol., no., pp.1313-1316, 2-9 June 2012, doi: 10.1109/ICSE.2012.6227091 Rupanagunta, K., Zakkam, D., and Rao, H. (2012). How to Mine Unstructured Data. Article in Information Management, June 29 2012, http://www.information-management.com/newsletters/data-miningunstructured-big-data-youtube--10022781-1.html 28 International Journal of Services Computing (ISSN 2330-4472) Scharber, W. (2007). Evaluation of Open Source Text Mining Tools for Cancer Surveillance. NPCR-AERRO Technical Development Team, CDC/NCCDPHP/DCPC van der Merwe, R., and Wan, E. (2003). Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models, Proceedings of the Workshop on Advances in Machine Learning, Montreal 2003. Zhao, Q., and Bhowmick, S. S., (2003). “Association Rule Mining: A Survey,” Technical Report, CAIS, Nanyang Technological University, Singapore, No. 2003116 , 2003. Zikopoulos, P. C., Eaton, C., deRoos, D., Deutsch, T., and Lapis, G., (2012). “Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data,” Published by McGraw-Hill Companies, 2012. https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/ wiki/Big%20Data%20University/page/FREE%20ebook%20%20Understanding%20Big%20Data . Authors Vol. 2, No. 2, April-June 2014 Richard K. Lomotey is currently pursuing his Ph.D. in Computer Science at the University of Saskatchewan, Canada, under the supervision of Prof. Ralph Deters. He has been actively researching topics relating to mobile cloud computing and steps towards faster mobile query and search in today’s vast data economy (big data). Prof. Ralph Deters obtained his Ph.D. in Computer Science (1998) from the Federal Armed Forces University (Munich). He is currently a professor in the department of Computer Science at the University of Saskatchewan (Canada). His research focusses on mobile cloud computing and data management in services ecosystems. Appendix I Visualization on Hemophilia using our Data Mining Tool http://hipore.com/ijsc 29