Creditworthiness of Customers

INTRODUCTION

This study seeks to investigate the creditworthiness of customers in Taiwan, as one of the primary goals of a lending institution is to keenly assess and manage both the customer and the credit by ensuring that the credit is available, evaluating the related risks, optimizing the returns, and most importantly addressing customer’s payment defaults. The majority of lending institutions have aimed to ensure that they have full control of the risks associated with providing loans; here, associating independent third-party evaluations to assess the creditworthiness of clients has helped mitigate the risks related to credit defaults. Evaluating the creditworthiness of an individual is of utmost importance for banks and financial organizations. The complexity of assessing credit risk depends on a number of variables, such as an individual’s financial situation, the state of the economy, and anticipated market volatility. Traditional approaches to assessing credit risk have been employed for an extended period. Still, they may not be well equipped to handle the vast and intricate data that is accessible in the present digital age. This project aims to create and validate a credit risk assessment mechanism using the Support Vector Machine. This learning technique improves banks’ ability to detect credit-related difficulties by effectively utilizing the ample data accessible to individuals. The suggested methodology seeks to leverage machine learning techniques to provide a comprehensive assessment of individuals’ creditworthiness. The objective of this method is to reduce the probability of loan defaults by empowering banks to make informed and tactical lending choices.

Furthermore, the possible implementation of such a credit risk assessment system could provide multiple benefits to financial institutions. The installation of this system is expected to result in enhanced credit risk evaluation, thereby reducing the bank’s non-performing loan ratio and increasing overall profit margins. The analysis will be conducted in SAS enterprise miner, where the results will be presented in both table format and graphs. The dataset for this research was retrieved via the link Default of Credit Card Clients Dataset (kaggle.com). It consisted of 25 features with 30000 observations, of which there was a single binary target, nine nominal inputs, and 14 interval inputs. The descriptive statistics for the continuous variables were as follows.

Table 1 Descriptive Statistics Summary

Variable	Mean	Deviation	Missing	Missing	Minimum	Median	Maximum	Skewness	Kurtosis
AGE	35.4855	9.217904	30000	0	21	34	79	0.732246	0.044303
BILL_AMT1	51223.33	73635.86	30000	0	-165580	22381	964511	2.663861	9.806289
BILL_AMT2	49179.08	71173.77	30000	0	-69777	21197	983931	2.705221	10.30295
BILL_AMT3	47013.15	69349.39	30000	0	-157264	20088	1664089	3.08783	19.78326
BILL_AMT4	43262.95	64332.86	30000	0	-170000	19052	891586	2.821965	11.30932
BILL_AMT5	40311.4	60797.16	30000	0	-81334	18104	927171	2.87638	12.30588
BILL_AMT6	38871.76	59554.11	30000	0	-339603	17068	961664	2.846645	12.27071
LIMIT_BAL	167484.3	129747.7	30000	0	10000	140000	1000000	0.992867	0.536263
PAY_AMT1	5663.581	16563.28	30000	0	0	2100	873552	14.66836	415.2547
PAY_AMT2	5921.164	23040.87	30000	0	0	2009	1684259	30.45382	1641.632
PAY_AMT3	5225.682	17606.96	30000	0	0	1800	896040	17.21664	564.3112
PAY_AMT4	4826.077	15666.16	30000	0	0	1500	621000	12.90498	277.3338
PAY_AMT5	4799.388	15278.31	30000	0	0	1500	426529	11.12742	180.0639
PAY_AMT6	5215.503	17777.47	30000	0	0	1500	528666	10.64073	167.1614

The default to payment next month was our binary target variable. From the dataset, we can see that 77.88% of the customers are not expected to default payment, whereas 22.12% of the customers were expected to default payment. This shows a high imbalance on the target variable, which, once we determine the best SVM model, will assign a 22.12% cutoff, denoting the true positives.

Bar chart of the target variable

Figure 1 Bar chart of the target variable

A chi-square test was conducted to determine whether there was an association between the target variable and the input nominal variable. Where based on the results p-value, we see that the association is statistically significant as the p-values are lesser than the assumed alpha 0.05

Table 2 Chi-square summary

Input	Chi-Square	Degrees of freedom	Prob
PAY_0	5365.9650	10	<.0001
PAY_2	3474.4668	10	<.0001
PAY_3	2622.4621	10	<.0001
PAY_4	2341.4699	10	<.0001
PAY_5	2197.6949	9	<.0001
PAY_6	1886.8353	9	<.0001
EDUCATION	163.2166	6	<.0001
SEX	47.9054	1	<.0001
MARRIAGE	35.6624	3	<.0001

Data Cleaning and Preparation

Support vector machine(SVM) was implemented on the binary target variable default payment for the next month where 0 implied the customer didn’t make any default, whereas 1 is the case where the client actually made a default in payment for the next month. Here, we see that the SVM performance was better with a large number of features and a small sample size, hence the need for data partitioning. The first preparation conducted was to split the data into a stratified training, validation, and testing set. Putting in mind that the target variable was imbalanced, we used a split ratio of 4:3:3 with 40% of the data being allocated to the training set, that is, 12000 observations and 30% of the data being given to both the validation set and testing set being 9000 observations each. Here, the goal was to minimize the data sample as much as we could. The next step was to transform the continuous features by standardizing them so as to assume normality (zero skew and zero kurtosis) and also to place them on a relative scale. Conversion of all the features to numeric variables was also implemented. We then proceeded to check for any missing data present, and we came to the conclusion that there was no missing data current, hence no need to input the data.

Predictive models Developed

This section will look at the results generated for the different SVM models, which varied based on kernels which were the linear kernel and the polynomial kernel both for interior point settings, and the polynomial kernel, the sigmoidal kernel, and the Radial basis kernel for active point setting. Here, we developed two scenarios to test the performances of each of the models. Scenario one was without data preparation, and the second scenario was with data preparation, whose results were as follows.

Table 3 Model summary without preparation

Model Description	Validation Misclassification Rate	Train Average Squared Error	Train Misclassification Rate	Valid Average Squared Error
HP SVM Polynomial Interior	0.18133	0.16269	0.17926	0.16484
HP SVM Polynomial 4 active	0.18156	0.20465	0.17326	0.20708
HP SVM Radial Active	0.18700	0.14835	0.18852	0.14858
HP SVM Linear Interior	0.19056	0.15511	0.19285	0.15487
HP SVM Sigmoid 1	0.29467	0.20380	0.29044	0.20691
HP SVM Sigmoid 2	0.30056	0.21052	0.30144	0.21322

Here, we will look to monitor the model’s performance based on the misclassification rate where the model chosen in this case is the hyper-tuned SVM polynomial interior with a polynomial degree of three, which had a validation misclassification rate of 0.18133. We see that only 18.13% of the validation observations were incorrectly predicted, but this needs to be more accurate as the target variable is heavily imbalanced, hence the introduction of the cutoff node to the best model.

Table 4 Model summary with preparation

Model Description	Validation Misclassification Rate	Train Average Squared Error	Train Misclassification Rate	Valid Average Squared Error
HP SVM interior point-polynomial	0.18311	0.18250	0.17135	0.18727
HP SVM Active set Radial	0.19167	0.14421	0.18660	0.14740
HP SVM active polynomial p=4	0.19344	0.19605	0.14310	0.20595
HP SVM interior point -linear	0.20056	0.15400	0.19910	0.15448
HP SVM Active Sigmoid 1	0.22056	0.18362	0.22152	0.18452
HP SVM Sigmoid 2	0.22122	0.22106	0.22110	0.22118

Similar to the model summary above, we also check on the misclassification rate for the SVM model with data preparation implemented. Here, we see that the hyper-tuned SVM polynomial interior with a polynomial degree of three was also the model with the least misclassification rate of 0.18311.

Results

In this section will look at the accuracies for each model in each scenario with their respective confusion matrix for the best models. The results were as follows

Model Description	Training accuracies without preparation	Training accuracies with preparation
HP SVM interior point-polynomial	0.8207	0.8287
HP SVM Active set Radial	0.8115	0.8134
HP SVM active polynomial p=4	0.8267	0.8569
HP SVM interior point -linear	0.8072	0.8009
HP SVM Active Sigmoid 1	0.7096	0.7785
HP SVM Sigmoid 2	0.6986	0.7789

Here, we see, based on the model summary for each scenario, that the hyper-tuned SVM active polynomial with a polynomial degree of 4 had the highest accuracy of 0.8569, that’s 85.69% accurate predictions.

table of statistics

Conclusion

In conclusion, based on the model result, we see that the best SVM model was the HP SVM active polynomial p=4 as it had the highest accuracy score of 0.8569, which was the fraction of correct predictions and a misclassification score of 0.19344. We also see, based on the classification report, that 81.24% were classified as 1:1, with 83.45% being classified as 0:0. This shows a better balance in the target variable.

Reference

Caplinska, A., & Tvaronavičienė, M. (2020). Creditworthiness is placed in credit theory and its methods of evaluation. Entrepreneurship and sustainability issues, 7(3), 2542.
Shen, G., & Du, Y. (2022, March). Research on the Construction of a Personal Credit Risk Assessment Index System based on PCA. In CIBDA 2022; 3rd International Conference on Computer Information and Big Data Applications(pp. 1-4). VDE.
Tao, W. (2010, May). Evaluation and Construction of Individual credit evaluation system based on a third-party e-commerce transaction platform. In 2010 International Conference on E-Business and E-Government(pp. 283-286). IEEE.

Appendix

Table 5 Data description

Feature	Description
ID	The ID of each client
LIMIT_BAL	Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX	Gender (1=male, 2=female)
EDUCATION	(1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE	Marital status (1=married, 2=single, 3=others)
AGE	Age in years
PAY_0	Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2	Repayment status in August 2005 (scale same as above)
PAY_3	Repayment status in July 2005 (scale same as above)
PAY_4	Repayment status in June 2005 (scale same as above)
PAY_5	Repayment status in May 2005 (scale same as above)
PAY_6	Repayment status in April 2005 (scale same as above)
BILL_AMT1	Amount of bill statement in September 2005 (NT dollar)
BILL_AMT2	Amount of bill statement in August 2005 (NT dollar)
BILL_AMT3	Amount of bill statement in July 2005 (NT dollar)
BILL_AMT4	Amount of bill statement in June 2005 (NT dollar)
BILL_AMT5	Amount of bill statement in May 2005 (NT dollar)
BILL_AMT6	Amount of bill statement in April 2005 (NT dollar)
PAY_AMT1	Amount of previous payment in September 2005 (NT dollars)
PAY_AMT2	Amount f previous payment in August 2005 (NT dollars)
PAY_AMT3	Amount of previous payment in July 2005 (NT dollars)
PAY_AMT4	Amount of previous payment in June 2005 (NT dollars)
PAY_AMT5	Amount of previous payment in May 2005 (NT dollars)
PAY_AMT6	Amount of previous payment in April 2005 (NT dollars)
default payment next month	Default payment (1=yes, 0=no)

INTRODUCTION

Data Cleaning and Preparation

Predictive models Developed

Results

Conclusion

Reference

Appendix

Cite This Work

Related Essays

Generational Tension in the Office

Zynga Inc., Whose Turn Is It?

Ethics, Responsibility and Sustainability

California Budget Process

Common Hazards and Threads in Workplace

Book Review: The Wealth of Nations by Adam Smith

Popular Essay Topics