Class document frequency as a learned feature for text categorization

Sharma, Anand

Class document frequency as a learned feature for text categorization

Files

MS_Q111.H3_4288_uh.pdf (1.5 MB)

MS_Q111.H3_4288_r.pdf (1.51 MB)

Date

2008

Authors

Sharma, Anand

Abstract

With the increase in online information, which are mostly in text document form, there is a need to organize them so that management and retrieval by search engine become easier. Manual organization of these documents is very difficult and prone to error. Machine learning algorithms can be used for classification and then organization because they are quick, relatively more accurate and less costly. However, documents need to have feature representations that are suitable for training machine learning algorithms for document classification. Machine learning algorithms for document classification use different types of word weightings as features for representation of documents. In our findings we find the class document frequency, die, of a word is the most important feature in document classification. Machine learning algorithms trained with die of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of die is further verified when simple algorithm Algd1 developed solely on the basis of die shows performance that compares closely with that of Algt1 and other more complex machine learning algorithms. The importance of high die is verified when Algd2 performs comparably with Algt2 and other complex algorithms. This also implies that term frequency does not contribute much to the classification of documents compared to class document frequency. We also find improved performance when the link information of documents in a class is used along with the word attributes of the document. The contribution of term frequency of link and class document frequency of link are similar in their classification performance. This shows the importance of class document frequency as the learned feature that learning algorithms use for effective text categorization. We compared the algorithms for showing the importance of die on the Reuters-21578 text categorization test classification set, Cora data set and Citeseer data set.

Description

Thesis (M.S.)--University of Hawaii at Manoa, 2008.
Includes bibliographical references (leaves 56-58).
vii, 58 leaves, bound 29 cm

URI

http://hdl.handle.net/10125/20575

Related To

Theses for the degree of Master of Science (University of Hawaii at Manoa). Electrical Engineering; no. 4288

Rights

All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.

Collections

M.S. - Electrical Engineering

Full item page

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.