Book contents
- Frontmatter
- Dedication
- Contents
- List of Figures
- List of Tables
- Preface
- Acknowledgments
- 1 Beginning with Machine Learning
- 2 Introduction to Data Mining
- 3 Beginning with Weka and R Language
- 4 Data Preprocessing
- 5 Classification
- 6 Implementing Classification in Weka and R
- 7 Cluster Analysis
- 8 Implementing Clustering with Weka and R
- 9 Association Mining
- 10 Implementing Association Mining with Weka and R
- 11 Web Mining and Search Engines
- 12 Data Warehouse
- 13 Data Warehouse Schema
- 14 Online Analytical Processing
- 15 Big Data and NoSQL
- Index
- Colour Plates
8 - Implementing Clustering with Weka and R
Published online by Cambridge University Press: 26 April 2019
- Frontmatter
- Dedication
- Contents
- List of Figures
- List of Tables
- Preface
- Acknowledgments
- 1 Beginning with Machine Learning
- 2 Introduction to Data Mining
- 3 Beginning with Weka and R Language
- 4 Data Preprocessing
- 5 Classification
- 6 Implementing Classification in Weka and R
- 7 Cluster Analysis
- 8 Implementing Clustering with Weka and R
- 9 Association Mining
- 10 Implementing Association Mining with Weka and R
- 11 Web Mining and Search Engines
- 12 Data Warehouse
- 13 Data Warehouse Schema
- 14 Online Analytical Processing
- 15 Big Data and NoSQL
- Index
- Colour Plates
Summary
Chapter Objectives
✓ To apply the K-means algorithm in Weka and R language
✓ To interpret the results of clustering
✓ To identify the optimum number of clusters
✓ To apply classification on un-labeled data by using clustering as an intermediate step
Introduction
As discussed earlier, if data is not labeled then we can analyze this data by performing a clustering analysis, where clustering refers to the task of grouping a set of objects into classes of similar objects.
In this chapter, we will apply clustering on Fisher's Iris dataset. We will use clustering algorithms to group flower samples into clusters with similar flower dimensions. These clusters then become possible ways to group flowers samples into species. We will implement a simple k-means algorithm to cluster numerical attributes with the help of Weka and R.
In the case of classification, we know the attributes and classes of instances. For example, the flower dimensions and classes were already known to us for the Iris dataset. Our goal was to predict the class of an unknown sample as shown in Figure 8.1.
Earlier, we used the Weka J48 classification algorithm to build a decision tree on Fisher's Iris dataset using samples with known class, which helped in predicting the class of unknown samples. We used the flower's Sepal length and width, and the Petal length and width as the specific attributes for this. Based on flower dimensions and using this tree, we can identify an unknown Iris as one of three species, Setosa, Versicolor, and Virginica.
In clustering, we know the attributes for the instances, but we don't know the classes. For example, we know the flower dimensions for samples of the Iris dataset but we don't know what classes exist as shown in Figure 8.2. Therefore, our goal is to group instances into clusters with similar attributes or dimensions and then identify the class.
In this chapter, we will learn what happens if we don't know what classes the samples belong to, or even how many classes there are, or even what defines a class? Since, Fisher's Iris dataset is already labeled, we will first make this dataset unlabeled by removing the class attribute, i.e., the species column. Then, we will apply clustering algorithms to cluster this data on the basis of its input attributes, i.e., Sepal length, Sepal width, Petal length, and Petal width.
- Type
- Chapter
- Information
- Data Mining and Data WarehousingPrinciples and Practical Techniques, pp. 206 - 228Publisher: Cambridge University PressPrint publication year: 2019