Data Discovery and Anomaly Detection using Atypicality.

Loading...
Thumbnail Image

Contributor

Advisor

Editor

Performer

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Interviewee

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

University of Hawaii at Manoa

Journal Name

Volume

Number/Issue

Starting Page

Ending Page

Alternative Title

Abstract

One characteristic of modern era is the exponential growth of information, and the ready availability of this information through networks, including the Internet { \Big Data." The question is what to do with this enormous amount of information. One possibility is to characterize it through statistics { think averages. The perspective of our approach is the opposite, namely that most of the value in the information is in the parts that deviate from the average, that are unusual, atypical. Think of art: the valuable paintings or writings are those that deviate from the norms, that are atypical. The same could be true for venture development and scienti c research. The aim of atypicality is to extract small, rare, unusual and interesting pieces out of big data, which complements statistics about typical data. We de ne atypicality as follows: a sequence is atypical if it can be described (coded) with fewer bits in itself rather than using the (optimum) code for typical sequences. This de nition is based on the ability of a universal source coder (atypical encoder) to encode a sequence (or a subsequence) in fewer bits in comparison to the optimum encoder of typical data (typical encoder). An inaccurate typical model raises atypicality ag for all the sequences, and a naive atypical encoder can never catch an atypical sequence. So for a sequence, the di erence between performance of an optimum typical encoder and a universal atypical encoders shows how atypical that sequence is. We measure the performance of an encoder by the number of bits it requires to describe a sequence. Thus atypicality can also be deduced as measure that depends on the performance of the optimum coder and a universal coder in describing a sequence. This is closely related to Rissanen's Minimum Descriptive Length (MDL). In this work after de ning the notion of atyicality, we rst setup a framework for binary model to analyze our atypicality measure and verify its properties, then we extend our approach to real-valued models. In our approach for real-valued models, we start with introducing two new predictive encoders for accurate description length called Normalized Likelihood Method (NLM) and Su cient Statistic Method (SSM) which improves the redundancy of predictive MDL, then we use asymptotic MDL to derive analytical results. Our algorithms has been applied to various sources of Big Data such as heart rate Holter monitoring, DNA, 15 years of stock market and 2 years of oceanographic data to nd arrhythmias, viral and bacterial infections, unusual stock market behavior and whale vocalization, respectively.

Description

Keywords

Citation

DOI

Extent

Format

Type

Thesis

Geographic Location

Time Period

Related To

Related To (URI)

Table of Contents

Rights

Rights Holder

Catalog Record

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.