Data Discovery and Anomaly Detection using Atypicality.

Date
2017-12
Authors
Sabeti, Elyas
Contributor
Advisor
Department
Electrical Engineering
Instructor
Depositor
Speaker
Researcher
Consultant
Interviewer
Annotator
Journal Title
Journal ISSN
Volume Title
Publisher
Volume
Number/Issue
Starting Page
Ending Page
Alternative Title
Abstract
One characteristic of modern era is the exponential growth of information, and the ready availability of this information through networks, including the Internet { \Big Data." The question is what to do with this enormous amount of information. One possibility is to characterize it through statistics { think averages. The perspective of our approach is the opposite, namely that most of the value in the information is in the parts that deviate from the average, that are unusual, atypical. Think of art: the valuable paintings or writings are those that deviate from the norms, that are atypical. The same could be true for venture development and scienti c research. The aim of atypicality is to extract small, rare, unusual and interesting pieces out of big data, which complements statistics about typical data. We de ne atypicality as follows: a sequence is atypical if it can be described (coded) with fewer bits in itself rather than using the (optimum) code for typical sequences. This de nition is based on the ability of a universal source coder (atypical encoder) to encode a sequence (or a subsequence) in fewer bits in comparison to the optimum encoder of typical data (typical encoder). An inaccurate typical model raises atypicality ag for all the sequences, and a naive atypical encoder can never catch an atypical sequence. So for a sequence, the di erence between performance of an optimum typical encoder and a universal atypical encoders shows how atypical that sequence is. We measure the performance of an encoder by the number of bits it requires to describe a sequence. Thus atypicality can also be deduced as measure that depends on the performance of the optimum coder and a universal coder in describing a sequence. This is closely related to Rissanen's Minimum Descriptive Length (MDL). In this work after de ning the notion of atyicality, we rst setup a framework for binary model to analyze our atypicality measure and verify its properties, then we extend our approach to real-valued models. In our approach for real-valued models, we start with introducing two new predictive encoders for accurate description length called Normalized Likelihood Method (NLM) and Su cient Statistic Method (SSM) which improves the redundancy of predictive MDL, then we use asymptotic MDL to derive analytical results. Our algorithms has been applied to various sources of Big Data such as heart rate Holter monitoring, DNA, 15 years of stock market and 2 years of oceanographic data to nd arrhythmias, viral and bacterial infections, unusual stock market behavior and whale vocalization, respectively.
Description
Keywords
Citation
Extent
Format
Geographic Location
Time Period
Related To
Table of Contents
Rights
All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Rights Holder
Local Contexts
Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.