Prevent Low-Quality Analytics by Automatic Selection of the Best-Fitting Training Data

Kiefer, Cornelia; Reimann, Peter; Mitschang, Bernhard

Prevent Low-Quality Analytics by Automatic Selection of the Best-Fitting Training Data

dc.contributor.author	Kiefer, Cornelia
dc.contributor.author	Reimann, Peter
dc.contributor.author	Mitschang, Bernhard
dc.date.accessioned	2020-01-04T07:21:27Z
dc.date.available	2020-01-04T07:21:27Z
dc.date.issued	2020-01-07
dc.description.abstract	Data analysis pipelines consist of a sequence of various analysis tools. Most of these tools are based on supervised machine learning techniques and thus rely on labeled training data. Selecting appropriate training data has a crucial impact on analytics quality. Yet, most of the times, domain experts who construct analysis pipelines neglect the task of selecting appropriate training data. They rely on default training data sets, e.g., since they do not know which other training data sets exist and what they are used for. Yet, default training data sets may be very different from the domain-specific input data that is to be analyzed, leading to low-quality results. Moreover, these input data sets are usually unlabeled. Thus, information on analytics quality is not measurable with evaluation metrics. Our contribution comprises a method that (1) indicates the expected quality to the domain expert while constructing the analysis pipeline, without need for labels and (2) automatically selects the best-fitting training data. It is based on a measurement of the similarity between input and training data. In our evaluation, we consider the part-of-speech tagger tool and show that Latent Semantic Analysis (LSA) and Cosine Similarity are suited as indicators for the quality of analysis results and as basis for an automatic selection of the best-fitting training data.
dc.format.extent	10 pages
dc.identifier.doi	10.24251/HICSS.2020.129
dc.identifier.isbn	978-0-9981331-3-3
dc.identifier.uri	http://hdl.handle.net/10125/63868
dc.language.iso	eng
dc.relation.ispartof	Proceedings of the 53rd Hawaii International Conference on System Sciences
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Data, Text, and Web Mining for Business Analytics
dc.subject	data quality
dc.subject	domain-specific data analysis
dc.subject	text analysis
dc.subject	text similarity
dc.subject	training data
dc.title	Prevent Low-Quality Analytics by Automatic Selection of the Best-Fitting Training Data
dc.type	Conference Paper
dc.type.dcmi	Text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0103.pdf
Size:: 878.68 KB
Format:: Adobe Portable Document Format

Download

Collections

Data, Text, and Web Mining for Business Analytics