Classifying Exoplanet Habitability using Machine Learning

Photo by NASA on Unsplash

Watching all the amazing videos recently of the Perseverance mission on Mars and NASA’s goal of sending humans to the red planet in the near future got me thinking about what other planets in the galaxy might we one day be able to visit? The explosion over the last few years of exoplanet discoveries from the Kepler mission data seemed to be a good opportunity to attempt to answer just that question.

Massive increase in exoplanet detections in the last few years.

To do this we will use machine learning techniques and the PHL Exoplanets Catalog.

Exploring and Cleaning our Data:

The first thing we will notice is the class we want to predict, Habitability, is highly imbalanced. We will deal with this a little later.

Highly imbalanced target class.

As is typical for astronomical datasets, we have quite a lot of missing values. So we will handle that right away.

#Drop columns with missing values greater than 40%
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.4].index
df.drop(missing_features, axis=1, inplace=True)

Next we will drop constant columns, high cardinality categorical columns, highly skewed columns, and columns that have little impact on our models. Our data also has several classifications of habitability, conservative and optimistic, we will combine these into a single binary column as we are primarily concerned with whether a planet is habitable or not.

#function to turn conservatively and optimistically habitable into a single binary column
def binary(habitable):
if habitable == 0:
return 0
else:
return 1

#Create new column from binary function and drop the previous column to prevent leakage
df['HABITABLE'] = df['P_HABITABLE'].apply(binary)
df.drop(columns=['P_HABITABLE'], inplace=True)

So after doing some cleaning on our data we are left with a data frame with 11 features and things are looking better.

Splitting our data:

We will now move on to splitting our data. First, we will now deal with any remaining missing values, and then take care of any outliers by running a function to transform any highly skewed columns, and then split our dataset into train, validation and test sets. So first to deal with the missing values we will use an Iterative Imputer which will impute values as a function of other nearby features.

#Run Iterative Imputer to replace all missing values
imp_mean = IterativeImputer(random_state=42)
X[:] = imp_mean.fit_transform(X)

Now with no missing values remaining we can take care of our skewed data.

#Run function to transform skewed columns
skew_autotransform(X_train_res, exclude = ['S_METALLICITY'], threshold = 24, exp=True)

Next we can split our data.

#Split data into training, validation, and test sets
# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.6
ratio_val = 0.2
ratio_test = 0.2
# Produce test split.
X_remaining, X_test, y_remaining, y_test = train_test_split(X, y, test_size=ratio_test)
# Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining
# Produce train and val splits.
X_train, X_val, y_train, y_val = train_test_split(X_remaining, y_remaining, test_size=ratio_val_adjusted)

Having split our data we can now address that issue we saw at the beginning of the imbalanced classes. To deal with this we will oversample our minority class by creating new synthetic members using SMOTE (Synthetic Minority Oversampling Technique). You want to make sure you only oversample from your training data, you can read more about the reasons for this here.

#Oversample our minority classes using SMOTE (only on the training data)
sm = SMOTE(random_state=12)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

Lets check our target class balance now.

Target class is now balanced.

So now with a balanced training set we are ready to look at some machine learning models.

Building our models:

We want to take a look at both a linear model and a tree-based model. For our linear model we will use a logistic regression classifier and for our tree-based model we will use a random forest classifier.

So lets fit our logistic regression model first.

#Run logistic regression model on our training data
model_logr = make_pipeline(StandardScaler(),
LogisticRegression(random_state=42)
)
model_logr.fit(X_train_res, y_train_res);

Now lets do the same for our random forest model:

#Run random forest classifier on our training data
model_rf = RandomForestClassifier(n_estimators=50,
n_jobs=-1,
random_state=42)
model_rf.fit(X_train_res, y_train_res);

Analyzing Performance:

Lets take a look at the classification reports for our two models to check our performance metrics.

Logistic Regression :
precision recall f1-score support

0 1.00 0.99 1.00 795
1 0.68 1.00 0.81 15

accuracy 0.99 810
macro avg 0.84 1.00 0.90 810
weighted avg 0.99 0.99 0.99 810

Random Forest :
precision recall f1-score support

0 0.99 1.00 0.99 795
1 1.00 0.33 0.50 15

accuracy 0.99 810
macro avg 0.99 0.67 0.75 810
weighted avg 0.99 0.99 0.98 810

Looking at this our linear model has a much better recall, lets take a look at their respective confusion matrices to visualize this.

Logistic Regression Confusion Matrix

Our logistic regression model correctly predicted all the habitable planets. It did have some issue with False Positives (Type I error).

Random Forest Confusion Matrix

Our random forest model really struggled for this task as it classified 10 planets incorrectly (Type II error) compared to only 5 correctly.

Let us now take a look at the partial dependence of 2 of our features that had an impact on our model.

Partial Dependence Plot

We can see here that as Earth Similarity Index increases and Planet Radius decreases our planet is more likely to be classified as Habitable, which makes sense and helps increase confidence in our model.

Closing Thoughts:

  • Some features that had a significant impact on my model was its Earth Similarity Index, whether it resided in the optimistic habitable zone, its planetary flux (how much light energy was radiated on to it from it’s star), and the planet’s radius.
  • For this dataset the Logistic Regression model appeared to perform the best.
  • The overwhelming majority of planets are classified as not habitable. This is primarily due to the limitations of our current methods of detecting exoplanets and limitations in current technology. As most planets are now discovered by the transit method and this technique is optimal for detecting very large planets that are very close to their stars.
  • In the future with the development of new telescopes and methodologies I would be interested in revisiting this topic and investigating how the models perform on newly discovered planets.