Information theoretic clustering of astrobiology documents
dc.contributor.author | Miller, Lisa Jane | |
dc.date.accessioned | 2016-02-19T22:15:51Z | |
dc.date.available | 2016-02-19T22:15:51Z | |
dc.date.issued | 2012-08 | |
dc.description | M.S. University of Hawaii at Manoa 2012. | |
dc.description | Includes bibliographical references. | |
dc.description.abstract | Astrobiology is a new and highly interdisciplinary field encompassing research from a diversity of disciplines including astrophysics, biology, chemistry, and geology. The AIRFrame project has been funded by NASA as part of an attempt to connect astrobiology research and researchers across disciplinary and institutional boundaries. One of the major tasks in building the AIRFrame system is to identify the major topics of research in astrobiology across disciplines and how existing work fits into these topics. While there are two astrobiology-specific scholarly journals, most researchers choose to publish in journals of their own discipline. In this work, an unsupervised learning method was applied to a corpus of astrobiology-related journal articles originating from a variety of disciplines with a goal of discovering common themes and topics. Unsupervised learning, or clustering, discovers groupings within a dataset without the aid of labeled data samples. The Information Bottleneck method [43] was employed for this project because it has been shown to be one of the most accurate and robust methods of clustering unlabeled multi-dimensional data such as text [40, 4]. Within this same framework, it also is possible to determine the maximum number of meaningful clusters that can be resolved in a finite dataset [41]. This work was the first application of this method to document clustering. Additionally, a new related algorithm was developed for preprocessing data. This new method was evaluated on its ability to indicate which words from the document data are best for use in clustering. These methods were combined to produce a dataset grouped by common topics present in 479 abstracts and full-text articles listed as publications in the NASA Astrobiology Institute 2009 Annual Report. The resulting clusters revealed several themes present in the data and groups of documents that are strongly connected on many levels through different numbers of clusters. | |
dc.identifier.uri | http://hdl.handle.net/10125/100993 | |
dc.language.iso | eng | |
dc.publisher | [Honolulu] : [University of Hawaii at Manoa], [August 2012] | |
dc.relation | Theses for the degree of Master of Science (University of Hawaii at Manoa). Computer Science. | |
dc.subject | clustering | |
dc.subject | information theoretic clustering | |
dc.title | Information theoretic clustering of astrobiology documents | |
dc.type | Thesis | |
dc.type.dcmi | Text |
Files
Original bundle
1 - 2 of 2
No Thumbnail Available
- Name:
- Miller_Lisa_r.pdf
- Size:
- 2.09 MB
- Format:
- Adobe Portable Document Format
- Description:
- Version for non-UH users. Copying/Printing is not permitted
No Thumbnail Available
- Name:
- Miller_Lisa_uh.pdf
- Size:
- 2.22 MB
- Format:
- Adobe Portable Document Format
- Description:
- Version for UH users