Class document frequency as a learned feature for text categorization

Date
2008
Authors
Sharma, Anand
Contributor
Advisor
Department
Instructor
Depositor
Speaker
Researcher
Consultant
Interviewer
Annotator
Journal Title
Journal ISSN
Volume Title
Publisher
Volume
Number/Issue
Starting Page
Ending Page
Alternative Title
Abstract
With the increase in online information, which are mostly in text document form, there is a need to organize them so that management and retrieval by search engine become easier. Manual organization of these documents is very difficult and prone to error. Machine learning algorithms can be used for classification and then organization because they are quick, relatively more accurate and less costly. However, documents need to have feature representations that are suitable for training machine learning algorithms for document classification. Machine learning algorithms for document classification use different types of word weightings as features for representation of documents. In our findings we find the class document frequency, die, of a word is the most important feature in document classification. Machine learning algorithms trained with die of words show similar performance in terms of correct classification of test documents when compared to more complicated features. The importance of die is further verified when simple algorithm Algd1 developed solely on the basis of die shows performance that compares closely with that of Algt1 and other more complex machine learning algorithms. The importance of high die is verified when Algd2 performs comparably with Algt2 and other complex algorithms. This also implies that term frequency does not contribute much to the classification of documents compared to class document frequency. We also find improved performance when the link information of documents in a class is used along with the word attributes of the document. The contribution of term frequency of link and class document frequency of link are similar in their classification performance. This shows the importance of class document frequency as the learned feature that learning algorithms use for effective text categorization. We compared the algorithms for showing the importance of die on the Reuters-21578 text categorization test classification set, Cora data set and Citeseer data set.
Description
Thesis (M.S.)--University of Hawaii at Manoa, 2008.
Includes bibliographical references (leaves 56-58).
vii, 58 leaves, bound 29 cm
Keywords
Citation
Extent
Format
Geographic Location
Time Period
Related To
Theses for the degree of Master of Science (University of Hawaii at Manoa). Electrical Engineering; no. 4288
Table of Contents
Rights
All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
Rights Holder
Local Contexts
Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.