Label Encode Algorithm

The Label Encoding Algorithm is a popular preprocessing technique used in machine learning and data science to convert categorical data into numerical values, making it easier for machine learning algorithms to process and analyze. Categorical data usually contains text-based information such as names, labels, or categories, which cannot be directly understood by machine learning models that primarily work with numbers. Label encoding assigns a unique numerical value or integer to each category within a given feature, thereby transforming the categorical data into a format that can be used as input for various machine learning algorithms. The process of label encoding involves analyzing the unique values present in the categorical feature and assigning an integer value to each of these unique categories. For example, if a dataset contains a column named "Color" with three unique values - "Red," "Blue," and "Green" - the label encoding algorithm would assign each color a corresponding numerical value such as 0, 1, and 2, respectively. However, it is essential to note that this encoding method can sometimes introduce ordinal relationships that do not exist in the original data, leading to biased or incorrect predictions in certain cases. In such scenarios, One-Hot Encoding or other advanced encoding techniques can be used to mitigate these issues and create a more accurate representation of the categorical data.
library(tidyverse)
#Divide data into train and test in 70% and 30%
ind<-sample(2,nrow(diamonds),replace = T,prob = c(0.7,0.3))
train.set <- diamonds[ind==1,]
test.set <- diamonds[ind==2,]

#Combine the dataset using rbind function(inbuilt function)
combi <- rbind(train.set, test.set)

##Label Encoding
combi[, cut_num := ifelse(cut == "Fair",0,
                                   ifelse(cut == "Good",1,
                                   ifelse(cut == "Very Good",2,
                                   ifelse(cut == "Premium",3,4))))]
combi[, color_num := ifelse(color == "D",0,
                                   ifelse(color == "E",2,
                                   ifelse(color == "F",3,
                                   ifelse(color == "G",4,
                                   ifelse(color == "H",5,
                                   ifelse(color == "I",6,7))))))]

# Column "clarity" won't be taken in label encoding as it contains more variables.
#The more variables in column in label encoding, the model will perform less.

#Removing categorical variables after label encoding
combi[,c("color", "cut") := NULL)
                                                     
#Divide data back into train and test in 70% and 30%
ind<-sample(2,nrow(combi),replace = T,prob = c(0.7,0.3))
train.set <- combi[ind==1,]
test.set <- combi[ind==2,]

LANGUAGE:

DARK MODE: