Item Description

Show full item record

Title: Class document frequency as a learned feature for text categorization 
Author: Sharma, Anand
Date: 2008
Abstract: With the increase in online information, which are mostly in text document form, there is a need to organize them so that management and retrieval by search engine become easier. Manual organization of these documents is very difficult and prone to error. Machine learning algorithms can be used for classification and then organization because they are quick, relatively more accurate and less costly. However, documents need to have feature representations that are suitable for training machine learning algorithms for document classification. Machine learning algorithms for document classification use different types of word weightings as features for representation of documents. In our findings we find the class document frequency, die, of a word is the most important feature in document classification. Machine learning algorithms trained with die of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of die is further verified when simple algorithm Algd1 developed solely on the basis of die shows performance that compares closely with that of Algt1 and other more complex machine learning algorithms. The importance of high die is verified when Algd2 performs comparably with Algt2 and other complex algorithms. This also implies that term frequency does not contribute much to the classification of documents compared to class document frequency. We also find improved performance when the link information of documents in a class is used along with the word attributes of the document. The contribution of term frequency of link and class document frequency of link are similar in their classification performance. This shows the importance of class document frequency as the learned feature that learning algorithms use for effective text categorization. We compared the algorithms for showing the importance of die on the Reuters-21578 text categorization test classification set, Cora data set and Citeseer data set.
Description: Thesis (M.S.)--University of Hawaii at Manoa, 2008. Includes bibliographical references (leaves 56-58). vii, 58 leaves, bound 29 cm
URI: http://hdl.handle.net/10125/20575
Rights: All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.

Item File(s)

Description Files Size Format View
Restricted for viewing only MS_Q111.H3_4288_r.pdf 1.507Mb PDF View/Open
For UH users only MS_Q111.H3_4288_uh.pdf 1.503Mb PDF View/Open

This item appears in the following Collection(s)

Search


Advanced Search

Browse

My Account

Statistics

About