xgboost Algorithm
The XGBoost algorithm, which stands for Extreme Gradient Boosting, is a widely popular and efficient machine learning technique used for regression, classification, and ranking tasks. It is an advanced implementation of the gradient boosting algorithm, which builds an ensemble of weak learners, typically decision trees, in a sequential manner. The main idea behind gradient boosting is to iteratively add new models to the ensemble, with each new model aiming to correct the errors made by the previous models. This is achieved by fitting the new models on the residuals, i.e., the differences between the true values and the predictions made by the existing ensemble.
XGBoost offers several improvements over the traditional gradient boosting algorithm in terms of performance, scalability, and computational efficiency. One of the key features of XGBoost is the use of a more regularized model formalization to control overfitting, which results in better predictive accuracy. Additionally, XGBoost employs parallel processing and column block techniques to accelerate tree construction, making it highly scalable and suitable for large-scale datasets. It can also handle missing data and sparsity, allowing it to work effectively with sparse datasets. Furthermore, XGBoost provides built-in support for cross-validation, early stopping, and various evaluation metrics, enabling users to monitor the model's performance and prevent overfitting during the training process.
library(tidyverse)
library(xgboost)
ind<-sample(2,nrow(diamonds),replace = T,prob = c(0.7,0.3))
train.set<-diamonds[ind==1,]
test.set<-diamonds[ind==2,]
xgb.train<-bind_cols(select_if(train.set,is.numeric),model.matrix(~cut-1,train.set) %>% as.tibble(),model.matrix(~color-1,train.set) %>% as.tibble(),model.matrix(~clarity-1,train.set) %>% as.tibble())
xgboost.train<-xgb.DMatrix(data = as.matrix(select(xgb.train,-price)),label=xgb.train$price)
xgb.test<-bind_cols(select_if(test.set,is.numeric),model.matrix(~cut-1,test.set) %>% as.tibble(),model.matrix(~color-1,test.set) %>% as.tibble(),model.matrix(~clarity-1,test.set) %>% as.tibble())
xgboost.test<-xgb.DMatrix(data = select(xgb.test,-price) %>% as.matrix(),label=xgb.test$price)
param<-list(eval_metric='rmse',gamma=1,max_depth=6,nthread = 3)
xg.model<-xgb.train(data = xgboost.train,params = param,watchlist = list(test=xgboost.test),nrounds = 500,early_stopping_rounds = 60,
print_every_n = 30)
xg.predict<-predict(xg.model,xgboost.test)
mse.xgb<-sqrt(mean((test.set$price-xg.predict)^2))
plot((test.set$price-xg.predict))