Predicting Genetic Disorders

According to a Britannica article on human genetic disease, ”10 percent of adults worldwide have genetic defects” (Fridovich, n.d.). To specify, that’s 790 million adults worldwide who are affected. On top of that, ”about 30 percent of all postnatal infant mortality in developed countries is due to genetic disease” (Fridovich, n.d.). Despite modern advancements in genetic testing, the results still give ”limited information...it cannot tell you whether you will have symptoms, how severe a disease might be, or whether a disease will get worse over time” (U.S. National Library of Medicine, 2021). The call for accurate testing for genetic disorders remains important and critical.

1.1 Dataset Background

As a team, we were interested in working with a dataset that was related to healthcare or the medical field. While searching on Kaggle, we decided to choose ”Predict the Genetic Disorder” by a user named Mukund that was uploaded in 2021. There were actually two different datasets available for this challenge, one called ”Test” and one called ”Train”. We chose the ”Train” dataset to conduct data analysis, as it had all of the features that the ”Test” one had, but also included the variables ’Genetic Disorder’ and ’Disorder Subclass’. Both datasets in general, included medical information on children in the United States with genetic disorders and had variables ranging from test results, to demographic results, and to parental genetic information (Mukund, 2021).

The dataset was originally from a HackerEarth machine learning challenge that was held in 2021 as well. When visiting the challenge website, their ”About Challenge” section states that ”hereditary illnesses are becoming more common due to a lack of understanding about the need for genetic testing. Often kids die as a result of these illnesses, thus genetic testing during pregnancy is critical” (Of genomes and genetics: Hackerearth machine learning challenge, 2021). While these claims were not cited, they do line up with the limitations of genetic testing as discussed above in the introduction.

2. Purpose

Our purpose for this project was to see how accurately we could predict genetic disorders in children using the dataset that we’ve chosen. We also wanted to figure out which algorithm between the three (logistic regression, agglomerative clustering, and random forest) that we’ve explored returns the best prediction accuracy. Finally, we also wanted to figure out which features make the largest impact on the accuracy of the prediction itself.

3. Data Exploration

3.1 General

To start off with our data exploration, we opened up our dataset on Microsoft Excel to understand what features we would be working with and how the data was recorded or written in each column. We found that there were N/A or missing values in many of the columns, this prompted us to continue with some more exploration to understand what we were working with.

Next, we did a simple logistic regression model with all of the dataset’s features and ’Genetic Disorder’ as our y. We took out and added in features one by one based on how it affected the accuracy score to compare impact and relevancy. The dataset originally had 44 variables, but we decided on the 26 features below:

Then, utilizing Fordham’s method to find the amount of missing values in our dataset by the indexed features (2020), we found that the top three features with the most missing values were: ’Family Name’ [8], ’Mother’s Age’ [10], and ’Father’s Age’ [11]. This included two of the features (’Mother’s Age’ and ’Father’s Age’) that we decided were relevant to our algorithm models. The image below is a screenshot of the output of our code utilizing Fordham’s method to display missing values.

The first thing we used the N/A value percentages to do was help determine what features we should drop from our model. We quickly determined that things such as family name, parental consent, and location of hospital they were born at will probably not be helpful in training the model. Given the N/A percentages were pretty similar across the board for our features of interest (with the exception of maternal gene, patient age, and inherited from father), we decided to go ahead and proceed with agglomerative clustering to see if there are any hidden relationships between the features.

3.2 Agglomerative Clustering

At this point, we knew about the missing values within our dataset, but we still wanted to know whether modeling the dataset with or without N/A values would provide a better accuracy. We went with agglomerative clustering to further our exploration in hopes of potentially figure out the relationship between the features.

With the results of our agglomerative clustering, we found that the distance between features were greater without N/A values compared to with N/A values included in our dataset. With agglomerative clustering, the closer the distance is between the different variables/clusters, the more linked the variables and clusters are. Based on this, we can see that when our dataset includes N/A values, that there is a stronger correlation between the variables/clusters as compared to when the dataset does not have N/A values.

When we remove N/A values, we not only remove the cells in which the N/A values appear, but we remove the whole data about the individual (removed the whole row). Because of that, we hypothesize that including N/A values in our dataset provides a stronger correlation due to the fact that the original dataset is still sustained. Additionally, the closer distance of variables/clusters to each other also means that the N/A values affect the similarity of the features themselves. One could assume that the N/A values must be more similar to each other than we had thought and isolating them from the dataset would skew our accuracy results and rid the dataset of impactful data points.

4. Methods Used

4.1 Logistic Regression

Logistic regression is used to predict the categorical dependent variable using a given set of independent variables. We knew that the accuracy score would be quite low, but logistic regression is the most commonly used algorithm to solve classification problems. In essence, we wanted to use this as a baseline for comparison. We used the regular 70/30 training test split for our model.

Logistic regression is called logistic regression because it derives from the logit model used in statistics to guess the probability of an event occurring given insight from previous events. Since logistic regression is only able to be used with numeric features, we had to convert some features to binary data in order to run the model properly.

4.2 Agglomerative Clustering

Agglomerative clustering uses a bottom-up approach. Each data point is assumed to be a separate cluster at first. Then the similar clusters are iteratively combined. We wanted to group the data to see if there are any similarities among the features that are used to determine the genetic disorder, including when the dataset has N/A values versus without. We utilized the Ward’s method to create our dendrogram and assigned 4 clusters for the assignment.

4.3 Random Forest

Constructs a multitude of decision trees at training time, outputting the class that is the mode of the classes for classification tasks or mean prediction of the individual trees when used for regression tasks. The algorithm is highly accurate and can handle a lot of input variables. It is also able to handle datasets with missing values well. This made it very attractive to use for our dataset.

Random forest models don’t search for the most important features when performing splits like other models. Instead, it selects the best feature out of a random subset of features. This makes the selection a lot more diverse that usually results in a much more accurate model.

4.4 Random Forest with Simple Imputer

Simple imputer is a tool that can be used in situations where there are a lot of N/A values in a dataset. It is very valuable because in most real life datasets, there are going to be N/A values for a wide variety of reasons. By handling datasets with N/A values properly, we are able to create more accurate machine learning models.

In our case, when we created dummy variables and replaced all the N/A values with zero, it significantly skewed our data and our accuracy score was 47 percent, compared to 53 percent when we just removed all the rows with N/A values. Simple imputer simply calculates the mean of the column without the N/A values and inputs that number into each N/A cell. You could also direct the simple imputer to use mean, trimmed mean, median, and mode depending on the characteristics of your dataset. The below image was taken from Ponraj’s article from DevSkrol (2022).

5. Results

5.1 Information Synthesized From Models

We created three different random forest models to compare the accuracy score of predicting the genetic disorder based on blood tests, symptoms, and genetic testing. When using the white blood cell count and blood test results to create a random forest model, the accuracy score was 87 percent. When using symptoms 1-5, the accuracy score was 90 percent. And when using gene’s in mother’s side, inherited from father, maternal gene, and paternal gene to build the model, the accuracy score was 99 percent.

All of the models were created using random forest with simple imputer to replace N/A value with the mean column value. For the hyper-parameters, we used a repeated stratified K-Fold with 10 splits and the cross-validator being repeated 3 times.

5.2 Algorithm Accuracy Results

For logistic regression with N/A values, our accuracy score was 47 percent. To note, when we refer to an algorithm ”with N/A values,” it means that we are replacing N/A values in the y with a variable called ‘Unknown Genetic Disorder’. For logistic regression with all rows that have N/A values removed, our accuracy score was 53 percent. For Random Forest with rows that have N/A values removed, our accuracy score was 56 percent. For Random Forest using simple imputer to input column means into N/A values, our accuracy score of 93 percent.

6. Conclusion

In conclusion, while our accuracy score for the random forest with simple imputer was higher than the accuracy score with the other algorithms that we used, we recognize that the scores from our logistic regression and normal random forest are more reflective of realistic/real life results.

This is because the less accurate results are comparable to the CDC’s (Centers for Disease Control and Prevention) statements on genetic testing; ”Despite the many scientific advances in genetics, researchers have only identified a small fraction of the genetic component of most diseases. Therefore, genetic tests for many diseases are developed on the basis of limited scientific information and may not yet provide valid or useful results to individuals who are tested” (2019). Additionally, ”In 2008, the former Secretary’s Advisory Committee on Genetics, Health and Society of the U.S. Department of Health and Human Services released a report identifying gaps in the regulation, oversight, and usefulness of genetic testing” (Centers for Disease Control and Prevention, 2019).

Our results provided some insight on the difficulties of accurate genetic disorder prediction and showed that without the assistance of machine learning to input or replace missing values, the predictive accuracy hovers slightly below and above 50 percent. Once more, the call for more accurate testing and diagnoses for genetic disorders still remains true today.