Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/20575

Class document frequency as a learned feature for text categorization

File Description SizeFormat 
MS_Q111.H3_4288_uh.pdfVersion for UH users1.54 MBAdobe PDFView/Open
MS_Q111.H3_4288_r.pdfVersion for non-UH users. Copying/Printing is not permitted1.54 MBAdobe PDFView/Open

Item Summary

Title: Class document frequency as a learned feature for text categorization
Authors: Sharma, Anand
Issue Date: 2008
Abstract: With the increase in online information, which are mostly in text document form, there is a need to organize them so that management and retrieval by search engine become easier. Manual organization of these documents is very difficult and prone to error. Machine learning algorithms can be used for classification and then organization because they are quick, relatively more accurate and less costly. However, documents need to have feature representations that are suitable for training machine learning algorithms for document classification. Machine learning algorithms for document classification use different types of word weightings as features for representation of documents. In our findings we find the class document frequency, die, of a word is the most important feature in document classification. Machine learning algorithms trained with die of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of die is further verified when simple algorithm Algd1 developed solely on the basis of die shows performance that compares closely with that of Algt1 and other more complex machine learning algorithms. The importance of high die is verified when Algd2 performs comparably with Algt2 and other complex algorithms. This also implies that term frequency does not contribute much to the classification of documents compared to class document frequency. We also find improved performance when the link information of documents in a class is used along with the word attributes of the document. The contribution of term frequency of link and class document frequency of link are similar in their classification performance. This shows the importance of class document frequency as the learned feature that learning algorithms use for effective text categorization. We compared the algorithms for showing the importance of die on the Reuters-21578 text categorization test classification set, Cora data set and Citeseer data set.
Description: Thesis (M.S.)--University of Hawaii at Manoa, 2008.
Includes bibliographical references (leaves 56-58).
vii, 58 leaves, bound 29 cm
URI/DOI: http://hdl.handle.net/10125/20575
Rights: All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Appears in Collections:M.S. - Electrical Engineering



Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.