Data Analysis on Suicide Rates (follow-up analysis after EDA)

4 min readMay 10, 2021

Introduction

For a long time, people around the world had suffered from suicidal thoughts and actions. I did an Exploratory Data Analysis a while ago on the data I found on Kaggle with information like the GPD, suicides, and other data for countries and states. This time, I decided to continue analyzing the data by building a model and test its performance with the data.

Part 1: Exploring/cleaning the data

Since I have previously done the exploration during the EDA, I simply transferred the essential codes that can sum up the process.

I did some cleaning and testing with the libraries I imported.

Part 2: Using Models

Linear Regression Model

The next step after exploring data is to split the data into training and testing according to the model I decided to use. Since I want to analyze the relationship between suicides and GDP, I use them as my X and y and split them into training and testing. After that, I used the Linear Regression model to analyze the relationship.

Scatterplot and Linear Regression on suicides vs. GDP

With the Linear Regression model, I first organize the data into a scatterplot, then, I find the line of best fit or the regression line. With the scatter plot, it seems like that as GDP increases, the suicide rate decreases. But, the regression line is horizontal which does not support the imagery of the scatter plot. Then I find the distance between the sample and the regression line, which is also called the residuals. After that, I found the coefficient of determination R², which is very close to 0. In addition to that, the Lasso and Ridge regression are also 0. Along with all the information I found, I can conclude that this is a constant model that the expected suicide rate is most likely to be about 12.9 and GDP has close to no impact on the suicide rate.

Logistic Regression Model

A different model that I wanted to try is the logistic Regression Model. With the same X and y, I’m going to first make a contingency table with the two variables.

After that, I find the sum of rows and columns of the contingency table along with the percentage of GDP and suicide rate. Then, I moved on to start building a logistic regression model with the data. As you can see, the accuracy of the regression is about 0.26, which is not very high.

properties of contingency table and logistic regression

After that, I did the dummy classifier to double-check the accuracy and it is the same. Next, I visualized the confusion matrix. However, since I decided to only use two variables this is not very effective and accurate.

To conclude the logistic regression model, I finished up with a classification report and the L1 and L2 regularizations. Once again, the classification report shows that the prediction is not as accurate. But, I decided to still perform L1 and L2 regularizations to prevent overfitting and stabilize estimates.

Accuracy of the regression and L1 /L2 regulation

Conclusion:

After carefully building two models with two quantitative variables, there are pros and cons for both of the regressions. The linear regression gives valuable information about the relationship between the two variables and unfortunately, it can only provide a range of suicide rates that was the most frequent. For logistic regression, it does predict the suicide rates based on the percentage to find the predicted value with high probability. However, the accuracy for the predicted rate is low due to X has only one variable. It will definitely be more accurate if more variables are included. I personally prefer Logistic Regression because it can be very accurate if I provided more variables and it does give me unique predicted values back while Linear Regression only returns the y-value that’s more frequent. Last but not least, I found out that GDP and suicide rate have no relation which overthrows the conclusion I made in the EDA article.