Comprehensive Modeling Analysis: Predicting Airbnb New User Bookings

10 minute read

The full code can be found here.

Home rental services such as Airbnb allow people to find affordable temporary housing on short notice. The efficiency of these services can be increased by making decisions informed by knowledge about how future customers will use them. By being able to determine how new customers will use home rental services for the first time, proactive changes can be made to the service and the surrounding operations. This results in a better product for the customers and improved operations for stakeholders. Data science techniques allow for the prediction of future user activities at the cost of relatively few resources. The abundance of data generated by short-term home rental services such as Airbnb, allows for plenty of opportunities for data science to be used to improve the operations. In order to utilize this readily usable data, I am interested in predicting booking destinations of first time Airbnb users.

Airbnb’s platform allows customers to efficiently find homes that fit their needs through the use of features such as filtered searches and wishlists. After finding desirable lodging, customers input payment information and book the locations. Throughout this process data is generated through information provided by the users and details saved about the user’s web sessions.

Several types of classification models were used to predict the first booking destination countries of Airbnb users. Models focused on using demographic and web session data to assign booking destination countries to individual users. The different models were compared by their ability to accurately and efficiently predict booking country. A final model was chosen from those that were evaluated and deemed suitable to be scaled for production. The process used to build this product is as follows:


Initiation and Data Preprocessing

  • Import Packages and Files
  • Data Cleaning
  • Feature Engineering

Data Analysis and Exploration

  • Viewing the Distribution of the Different Classes
  • Checking the Correlatedness of Different Variables
  • Interpreting Descriptive Statistics

Preparing The Data For Modeling

  • Class Balancing and Mathematical Transformations
  • Feature Selection and Reduction
  • Establishing Variables for Training and Testing

Supervised Learning Models

  1. Using Unaltered Features
  2. Using Features Selected using F Values
  3. Using Features Selected using Chi Squared
  4. Using Recurrent Neural Network
  5. Using Features Selected using Random Forests

Analysis and Conclusion

  • Final Model Selection and Analysis
  • Conclusion and Discussion


The purpose of this study is to: analyze user demographics and behavioral patterns via data visualization, identify key indicators of future behavior by utilizing statistical inference and machine learning algorithms, create a machine learning model that can predict the booking destination of users that haven’t made a booking yet, and use the aforementioned model to predict the future behavior of users that are currently on the platform.

Initiation and Data Preprocessing

The data used for this model was released by Airbnb in the following datasets: train_users.csv and sessions.csv. The train user dataset contains information about specific users and how they first accessed the service. This dataset has over 200000 records with each one containing information about a unique user. The train user dataset contains the outcome variable, country destination. The sessions dataset contains information about actions performed during the user’s time on the Airbnb platform. This dataset contains over 10 million records with each one reflecting a specific action performed on the platform. Multiple records on the sessions dataset can refer a single user’s actions. A dataframe was created for modeling containing features generated from both datasets.

Data Exploration and Analysis

The Airbnb data contains information about activity on the platform that occurred between January 2010 and June 2014. The train dataset originally contained columns related to the time specific activities first occured, information about the how the user accessed Airbnb, and demographics of the users. Additional features were engineered to include the amount of time users spent doing specific activities on the platform.

png

Above are plots representing the gender frequencies and age frequencies of the data, respectively. There are more females than males included in the data but the disparity between the two groups is not strong. The distribution of the ages is centered around the 30’s and skewed to the right. This likely reflects an age demographic with both the energy and resources to travel. Since both of these variables contain a large amount of null values imputation will be needed to make use of this data prior to modeling.

png

Above are plots of languages used on the platform and country destination, respectively. Both plots are scaled logarithmically for readability because the dominant classes far outnumbered the rest. With regard to the language counts, English outnumbered the other classes greatly; and with regard to the country destinations, ‘NDF’ and the US outnumbered the other classes. NDF represents the class of users that haven’t booked a destination yet. Since this would be the class that the model being built would be predicting, this was not be used as an outcome variable to train the model.

png

png

The above plots refer to the frequencies the accounts were first created and frequencies bookings were made over the course of multiple years. The frequency of accounts being created showed an increasing trajectory over the course of five years, likely reflecting an increase in the userbase of Airbnb. There is a sharp drop in bookings made around the time that the data was collected. This doesn’t reflect a drop in the usage of the platform, but rather people who use the platform but haven’t made a booking yet (these customers would have the country destination label ‘NDF’).

png

The above plots reflect the frequencies use of the platform created across different timespans. There’s a notable drop of accounts created between June and July. Since summer is a season that is popular for travel, people are less likely to need accounts during this time (because they would’ve presumably made accounts earlier than when they would travel using the service). There is a slight drop of accounts made over the weekends as well. Initial activity drops during the day which is likely the result of people having less time during the average US workday to be on Airbnb.

png

The scatterplot matrix gives information about the relationship between specific engineered features.

png

The heatmap is meant to gauge the correlatedness of engineered features. There is much correlatedness among these features, which is expected since they reflect amounts of time spent on the platform.

secs_elapsed view_search_results_count p3_count wishlist_content_update_count user_profile_count change_trip_characteristics_count similar_listings_count user_social_connections_count update_listing_count listing_reviews_count ... dashboard_secs user_wishlists_secs header_userpic_secs message_thread_secs edit_profile_secs message_post_secs contact_host_secs unavailable_dates_secs confirm_email_link_secs create_user_secs
count 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 ... 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000 73815.000
mean 1514234.959 12.366 8.265 6.093 3.634 4.278 4.172 1.940 2.105 1.128 ... 14830.135 29595.309 4515.531 51134.615 16820.605 74545.834 21243.895 7023.293 94984.878 100.872
std 1913191.475 28.446 20.404 13.310 14.324 9.715 10.018 9.129 8.068 5.697 ... 112658.964 177757.650 45779.443 306877.869 93129.557 268323.753 104188.163 53648.290 259453.037 7967.568
min 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
25% 256920.500 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 143.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
50% 872862.000 2.000 2.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.000 695.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
75% 2043487.500 13.000 9.000 6.000 2.000 4.000 4.000 0.000 1.000 0.000 ... 1268.000 0.000 2356.000 0.000 0.000 1583.000 427.000 0.000 44589.000 0.000
max 38221363.000 987.000 1131.000 524.000 454.000 308.000 325.000 280.000 249.000 198.000 ... 4632786.000 6713259.000 2288760.000 11242289.000 3415239.000 13260005.000 3258808.000 2310246.000 2902601.000 1723704.000

8 rows × 39 columns

Data will be normalized prior to modeling to account for the vast difference in scale of the variables.

Preparing The Data For Modeling

To prepare the data for modeling, values were imputed, the data was resampled to address the class imbalance in the outcome, and multiple forms of feature reduction were implemented. This resulted in four sets of variables: One reflecting all of the unaltered features of the dataset, one reflecting features with the highest F values, one reflecting features with the highest chi squared values, and encoded for deep learning.

%%time

## Train Test Split the Four Sets of Feature and Outcome Variables

# Unaltered Features
x_train, x_test, y_train, y_test = train_test_split(pca_components, y, test_size=0.2, random_state=20)

# F-Value
fx_train, fx_test, fy_train, fy_test = train_test_split(f_pca_components, y, test_size=0.2, random_state=22)

# Chi-squared
xx_train, xx_test, xy_train, xy_test = train_test_split(x_pca_components, y, test_size=0.2, random_state=21)

# Deep Learning
X_train, X_test, Y_train, Y_test = train_test_split(np.asarray(f_pca_components), np.asarray(Y), test_size=0.2, random_state=20)

Modeling the Data using Features Chosen with Random Forests

Random Forest

Test Set Evaluation

accuracy score:
0.978989898989899

confusion matrix:
[[1782    0    0    0    0    0    0    0    0    0    0]
 [   0 1702    0    0    0    0    0    0    0    0    0]
 [   0    0 1840    0    0    0    0    0    0    0    0]
 [   0    0    0 1858    0    0    0    0    0    0    0]
 [   0    0    0    0 1747    0    0    0    0   12    2]
 [   0    0    0    0    0 1749    0    0    0    0    0]
 [   0    0    0    0    0    0 1801    0    0    0    0]
 [   0    0    0    0    0    0    0 1840    0    0    0]
 [   0    0    0    0    0    0    0    0 1808    0    0]
 [   1    4    2   10   50    8    9    2    0 1630  140]
 [   0    3    0    1    6    0    1    0    0  165 1627]]

              precision    recall  f1-score   support

          AU       1.00      1.00      1.00      1782
          CA       1.00      1.00      1.00      1702
          DE       1.00      1.00      1.00      1840
          ES       0.99      1.00      1.00      1858
          FR       0.97      0.99      0.98      1761
          GB       1.00      1.00      1.00      1749
          IT       0.99      1.00      1.00      1801
          NL       1.00      1.00      1.00      1840
          PT       1.00      1.00      1.00      1808
          US       0.90      0.88      0.89      1856
       other       0.92      0.90      0.91      1803

    accuracy                           0.98     19800
   macro avg       0.98      0.98      0.98     19800
weighted avg       0.98      0.98      0.98     19800

The model that relied on features selected using random forests had the best accuracy in the study (with a 97% test set accuracy). An advantage of using all of the useful features is that as much meaningful variance was captured by the models as possible. A downside to this type of feature preparation that did stand out is the lack of efficiency. Since feature reduction didn’t take place with these models, their performance suffered and they had the longest runtimes. However, for reporting purposes, accuracy was prioritized. This method of feature selection also risks including features with variance that doesn’t aid in the predictive power of the models. However, this potential disadvantage didn’t hamper the model’s ability to perform well because many of the features that would noticeably have a negative effect on the models were already left out.

Identifying the Most Important Features

png

The above graph displays factors that contributed the most to the highly accurate predictions of booking destination. The age and the amount of time spent on the platform were the most useful factors in determining where a person was likely to book their next stay.

Generating Predictions of Current Users’ Booking Destinations

png

By using the most successful modeling type on the data of the users that haven’t booked housing with the platform, future usage of the Airbnb platform can be predicted and analyzed. The distribution of the plot above shows that France is the most likely destination for non-US Airbnb users.

Analysis and Conclusion

The random forest model using features chosen based on random forest feature importances was the best model when it came to predicting the first booking destinations, making it the strongest base for a classifier that can scaled for production. It was able to predict the booking destination of users with a 97% accuracy. While other model types had potential to produce more accurate predictions, they had much higher runtimes, making them unfit to run larger amounts of data. By introducing more data to this modeling pipeline, it can be trained to yield even more accurate and consistent results.

This classifier created by pairing the best supervised modeling technique and feature reduction method was built to be both accurate and scalable. Potential improvements in this product includes adding more features and further tuning of the model type. The accuracy of this model will most likely increase as it is trained with more data as well.

Understanding how to better utilize supervised modeling techniques to predict booking destination, will give insight as to how people are using the Airbnb platform and particular habits different types of customers share. This can allow for more direct marketing to specific types of users or changes in the product that better match how the service is used. Through the use of cheap and accessible data, decisions can be made that can result in increased efficiency and revenue for the company.