Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/62395

Data Discovery and Anomaly Detection using Atypicality.

File Size Format  
2017-12-phd-sabeti.pdf 2.66 MB Adobe PDF View/Open

Item Summary

Title:Data Discovery and Anomaly Detection using Atypicality.
Authors:Sabeti, Elyas
Contributors:Electrical Engineering (department)
Date Issued:Dec 2017
Publisher:University of Hawaiʻi at Mānoa
Abstract:One characteristic of modern era is the exponential growth of information, and the ready
availability of this information through networks, including the Internet { \Big Data." The
question is what to do with this enormous amount of information. One possibility is to
characterize it through statistics { think averages. The perspective of our approach is the
opposite, namely that most of the value in the information is in the parts that deviate from
the average, that are unusual, atypical. Think of art: the valuable paintings or writings are
those that deviate from the norms, that are atypical. The same could be true for venture
development and scienti c research. The aim of atypicality is to extract small, rare, unusual
and interesting pieces out of big data, which complements statistics about typical data.
We de ne atypicality as follows: a sequence is atypical if it can be described (coded) with
fewer bits in itself rather than using the (optimum) code for typical sequences. This de nition
is based on the ability of a universal source coder (atypical encoder) to encode a sequence
(or a subsequence) in fewer bits in comparison to the optimum encoder of typical data (typical
encoder). An inaccurate typical model raises atypicality
ag for all the sequences, and a
naive atypical encoder can never catch an atypical sequence. So for a sequence, the di erence
between performance of an optimum typical encoder and a universal atypical encoders shows
how atypical that sequence is. We measure the performance of an encoder by the number
of bits it requires to describe a sequence. Thus atypicality can also be deduced as measure
that depends on the performance of the optimum coder and a universal coder in describing
a sequence. This is closely related to Rissanen's Minimum Descriptive Length (MDL).
In this work after de ning the notion of atyicality, we rst setup a framework for binary model
to analyze our atypicality measure and verify its properties, then we extend our approach
to real-valued models. In our approach for real-valued models, we start with introducing
two new predictive encoders for accurate description length called Normalized Likelihood
Method (NLM) and Su cient Statistic Method (SSM) which improves the redundancy of
predictive MDL, then we use asymptotic MDL to derive analytical results. Our algorithms
has been applied to various sources of Big Data such as heart rate Holter monitoring, DNA,
15 years of stock market and 2 years of oceanographic data to nd arrhythmias, viral and
bacterial infections, unusual stock market behavior and whale vocalization, respectively.
Description:Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017.
URI:http://hdl.handle.net/10125/62395
Rights:All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Appears in Collections: Ph.D. - Electrical Engineering


Please email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.

Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.