Data Discovery and Anomaly Detection using Atypicality.
Data Discovery and Anomaly Detection using Atypicality.
Date
2017-12
Authors
Sabeti, Elyas
Contributor
Advisor
Department
Electrical Engineering
Instructor
Depositor
Speaker
Researcher
Consultant
Interviewer
Annotator
Journal Title
Journal ISSN
Volume Title
Publisher
Volume
Number/Issue
Starting Page
Ending Page
Alternative Title
Abstract
One characteristic of modern era is the exponential growth of information, and the ready
availability of this information through networks, including the Internet { \Big Data." The
question is what to do with this enormous amount of information. One possibility is to
characterize it through statistics { think averages. The perspective of our approach is the
opposite, namely that most of the value in the information is in the parts that deviate from
the average, that are unusual, atypical. Think of art: the valuable paintings or writings are
those that deviate from the norms, that are atypical. The same could be true for venture
development and scienti c research. The aim of atypicality is to extract small, rare, unusual
and interesting pieces out of big data, which complements statistics about typical data.
We de ne atypicality as follows: a sequence is atypical if it can be described (coded) with
fewer bits in itself rather than using the (optimum) code for typical sequences. This de nition
is based on the ability of a universal source coder (atypical encoder) to encode a sequence
(or a subsequence) in fewer bits in comparison to the optimum encoder of typical data (typical
encoder). An inaccurate typical model raises atypicality
ag for all the sequences, and a
naive atypical encoder can never catch an atypical sequence. So for a sequence, the di erence
between performance of an optimum typical encoder and a universal atypical encoders shows
how atypical that sequence is. We measure the performance of an encoder by the number
of bits it requires to describe a sequence. Thus atypicality can also be deduced as measure
that depends on the performance of the optimum coder and a universal coder in describing
a sequence. This is closely related to Rissanen's Minimum Descriptive Length (MDL).
In this work after de ning the notion of atyicality, we rst setup a framework for binary model
to analyze our atypicality measure and verify its properties, then we extend our approach
to real-valued models. In our approach for real-valued models, we start with introducing
two new predictive encoders for accurate description length called Normalized Likelihood
Method (NLM) and Su cient Statistic Method (SSM) which improves the redundancy of
predictive MDL, then we use asymptotic MDL to derive analytical results. Our algorithms
has been applied to various sources of Big Data such as heart rate Holter monitoring, DNA,
15 years of stock market and 2 years of oceanographic data to nd arrhythmias, viral and
bacterial infections, unusual stock market behavior and whale vocalization, respectively.
Description
Keywords
Citation
Extent
Format
Geographic Location
Time Period
Related To
Table of Contents
Rights
All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Rights Holder
Local Contexts
Collections
Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.