Classification Analysis: Predicting the Gender of Bikeshare Users

12 minute read

The full code can be found here.

Bike-sharing services offer a means for people to quickly access affordable, short term transportation in urban areas. Increased automation has allowed for these services to grow more robust in nature and generate more data. Such data can be used to make improvements in area such as product development and marketing. By being able to get a comprehensive picture of the people using bike share services, changes can be made that could directly benefit the users of the service. Such efficient decisionmaking can generate growth and increase revenue for bike-sharing companies. For this reason, I am interested in predicting the gender of NYC Citi Bike users by creating a model based on the data generated from the service.

The Citi Bike service works by allowing people to pick up bikes at one of hundreds of stations in Manhattan, Brooklyn, Queens and Jersey City. The users can ride for a predetermined amount of time based on the passes or memberships that they purchase. They end their time with the bikes by returning them to any of the affiliated stations.

In this study, several types of supervised learning classification models were used to predict the gender of the Citi Bike Users. Models focused on utilizing factors related to the individual uses of the service. The different models were compared to better understand their ability to accurately predict gender using multiple forms of statistical evaluation. The process used to undertake the study is as follows:

Data Exploration and Analysis

Viewing the Distribution of the Different Classes
Checking the Correlatedness of Different Variables
Interpreting Descriptive Statistics

Preparing The Data For Modeling

Imputing Outliers
Class Balancing
Feature Selection

Modeling the Data

Using All Useful Features
Using PCA Components
Using Selectkbest Function

Data Exploration and Analysis

The Citi Bike Trip Dataset contains information about 735502 anonymised trips that took place between January 2015 and June 2017. The data was processed to remove trips that are taken by staff and any trips that were below 60 seconds in length. The dataset originally contained columns related to the location of the stations, duration of the trips, and demographics of the users. Additional features were engineered with this information such as the distance between the stations used in a trip and the age of the users.

	start_station_id	start_station_name	start_station_latitude	start_station_longitude	end_station_id	end_station_name	end_station_latitude	end_station_longitude	trip_duration	start_time	stop_time	bike_id	user_type	birth_year	gender
0	3212	Christ Hospital	40.735	-74.050	3207	Oakland Ave	40.738	-74.052	376	2015-10-01 00:16:26	2015-10-01 00:22:42	3212	Subscriber	1960.000	1
1	3207	Oakland Ave	40.738	-74.052	3212	Christ Hospital	40.735	-74.050	739	2015-10-01 00:27:12	2015-10-01 00:39:32	3207	Subscriber	1960.000	1
2	3193	Lincoln Park	40.725	-74.078	3193	Lincoln Park	40.725	-74.078	2714	2015-10-01 00:32:46	2015-10-01 01:18:01	3193	Subscriber	1983.000	1
3	3199	Newport Pkwy	40.729	-74.032	3187	Warren St	40.721	-74.038	275	2015-10-01 00:34:31	2015-10-01 00:39:06	3199	Subscriber	1975.000	1

png

The bar graphs above references the number of trips that took place over the two year perion during which the data was collected and only represents the classes with the highest frequencies. While the station names aren’t going to be useful to for modeling due to the high number of classes relative to the number of observations, it’s useful for gaining an understanding of the nature in which the service is used.

png

The distribution of the frequency of distances between the start and end stations center around 1 km and the majority of trips lasted less than 10 minutes. It is important to note that while trip duration and distance are correlated, the users aren’t necessarily riding the bikes during the entire time period in which the bikes are checked out.

png

Most of the trips took place during the Fall with October experiencing a disproportionately higher use of the Citi Bike service. The frequency of Citi Bike trips was consistent during the weekday and dropped during weekends. With regards to the time of day, most trips took place during the morning and evening. These trends correspond to the average American work week.

The vast majority of the services users are subscribers (annual membership holders) as opposed to customers (single day or 3-day pass holders. While there is a sharp difference between rates of the two user types the model may be able to capture subtle differences between the types of users that could aid in prediction.

  525608
  150426
   59468
Name: gender, dtype: int64

The 1 class refers to the number of trips that were taken by a male and the 2 class refers to the number of trips that were taken by a female. The the gender is unknown for the observations that contain a 0 gender class. The sharp imbalance in the gender class will need to be remedied prior to modeling. The 0 class observations will need to be discarded prior to modeling but even without them, there are enough observations to make a large enough sample of the original dataset.

png

This scatterplot matrix was made to analyze the relationship between the continuous variables. The topright half of the matrix gives scatterplots of each combination of variables while the bottom left gives the best fit lines corresponding to the relationship between each pair. The diagonals consist of KDE plots which show the shape of the distribution of each variable.

png

Aside from the variables directly related to the locations of the stations, most of the features of this dataset have little to no correlation.

	start_station_latitude	start_station_longitude	end_station_latitude	end_station_longitude	trip_duration	birth_year	gender	distance	month	start_day_of_week_number	start_day_of_month	start_day_of_year	start_hour	start_timestamp	age
count	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000	735502.000
mean	40.723	-74.046	40.722	-74.045	934.312	1979.544	1.124	1.148	7.264	2.779	15.748	206.134	13.599	1467895633.510	36.408
std	0.008	0.011	0.095	0.173	40638.782	9.371	0.520	19.717	3.372	1.909	8.775	102.695	5.226	13579046.873	9.382
min	40.693	-74.097	0.000	-74.097	61.000	1900.000	0.000	0.000	1.000	0.000	1.000	1.000	0.000	1442847196.000	16.000
25%	40.718	-74.051	40.717	-74.050	246.000	1975.000	1.000	0.621	4.000	1.000	8.000	114.000	9.000	1458289118.500	30.000
50%	40.722	-74.044	40.721	-74.044	383.000	1982.000	1.000	0.883	8.000	3.000	16.000	231.000	14.000	1470155248.000	34.000
75%	40.728	-74.038	40.727	-74.036	652.000	1986.000	1.000	1.467	10.000	4.000	23.000	292.000	18.000	1477541380.000	41.000
max	40.753	-74.032	40.801	0.000	20260211.000	2000.000	2.000	8449.122	12.000	6.000	31.000	366.000	23.000	1491002734.000	116.000

These descriptive statistics supplement the scatterplot matrix by quantifying the distribution of the variables. The distributions of the variables related to location and time are relatively consistent because they a limited range and fixed frequencies (e.g. the max start_day_of_month would be 31 because that’s the maximum number of days in a month). Potential outliers do exist in the variables that aren’t limited to a specific range such as, trip_duration, diatance, and age.

png

These boxplots show that there are outliers in the features that are not fixed to a specific range. Take note of the logarithmic scale being used for trip_duration and distance. This means that any gaps in the points shown for those two plots represent a large amount of distance between the actual points.

Preparing The Data For Modeling

To prepare the data for modeling, features were selected, the data was resampled to address the class imbalance in the outcome, and three forms of feature selection were implemented. This resulted in three sets of variables: One reflecting all of the useful features of the dataset, one reflecting PCA components, and one reflecting features chosen by the selectKbest function.

Features that were useful for modeling were isolated into a final dataframe that would be used for training and testing along side versions of the dataframe that underwent different forms of feature selection.

%%time

## Train Test Split the Three Sets of Feature and Outcome Variables

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=20)

kx_train, kx_test, ky_train, ky_test = train_test_split(k_predictors, y, test_size=0.2, random_state=21)

px_train, px_test, py_train, py_test = train_test_split(pca_components, y, test_size=0.2, random_state=22)

Training and testing sets of three variables were generated to be used in modeling. The x and y variables represent the variables to be used for modeling that reflect the all of the useful features of the data. The px and py variables represent the variables to be used for modeling that reflect PCA components of the initial features. The kx and ky variables represent the variables to be used for modeling that reflect features chosen by selectKbest.

Modeling the Data using all Useful Features

Random Forest

accuracy score:
0.9426571428571429

cross validation:
[0.811942   0.82264286 0.82185714 0.81464286 0.8173441 ]

cross validation with AUC:
[0.90020833 0.90318671 0.89839968 0.89670665 0.90036234]

confusion matrix:
[[32719  2417]
 [ 1597 33267]]

              precision    recall  f1-score   support

           1       0.95      0.93      0.94     35136
           2       0.93      0.95      0.94     34864

   micro avg       0.94      0.94      0.94     70000
   macro avg       0.94      0.94      0.94     70000
weighted avg       0.94      0.94      0.94     70000

The random forest model had the best performance out of all of the models run with all of the dataset’s useful features. Cross validation showed few signs of overfitting with this model. The strength of this model when it comes to making predictions using this data comes from its ability to separate datapoints using binary splits, which is especially useful for isolated latitude and longitude features. The decision tree had similar accuracy scores but lower cross validation scores. Naive bayes performed poorly with this data. This is likely due to the naturally high correlatedness of variables related to station locations. The support vector classifier didn’t have the best performance but there is room for this model to improve as computational resources increase in availability. KNN and logistic regression had poor performance (although better than Naive Bayes). This likely has to do with these models being unable to capture nuances between different groups of observations and the interactions between the variables.

In general, the models that relied on the dataset’s most useful features had the best performance in the study. An advantage of using all of the useful features is that as much meaningful variance was captured by the models as possible. This method of feature selection also risks including features with variance that doesn’t aid in the predictive power of the models. However, this potential disadvantage didn’t hamper the model’s ability to perform well because many of the features that would noticeably have a negative effect on the models were already left out.

Modeling the Data using PCA Components

Random Forest

accuracy score:
0.9099

cross validation:
[0.7205     0.71721429 0.72242857 0.718      0.71907143]

cross validation with AUC:
[0.79895943 0.79188841 0.79536728 0.79454343 0.79382654]

confusion matrix:
[[31428  3557]
 [ 2750 32265]]

              precision    recall  f1-score   support

           1       0.92      0.90      0.91     34985
           2       0.90      0.92      0.91     35015

   micro avg       0.91      0.91      0.91     70000
   macro avg       0.91      0.91      0.91     70000
weighted avg       0.91      0.91      0.91     70000

The random forest model had the best performance out of all of the models run with 15 of the dataset’s best PCA components. While cross validation did show few signs of overfitting with this model, those scores were significantly lower than the accuracy score. This is a noticeably lower performance than the random forest model that used all of the available features. In general, the models that used the PCA components had lower scores than the models that used all of the dataset’s useful features and the models that used features chosen by selectKbest. However, KNN and naive bayes did perform better with PCA components. The increase in performance of these two model types can be attributed to the increased significance of the remaining components allowing the algorithms to better classify the observations.

Using a limited number of PCA components from the dataset likely removed some variance that was important to the predictive accuracy of the models. Using PCA components does have the advantage of reducing computational complexity and runtimes but this did not make up for the drop in accuracy of the better performing model types.

Modeling the Data using Features Chosen with the SelectKbest Function

Random Forest

accuracy score:
0.9357

cross validation:
[0.82772659 0.82151275 0.8145     0.81962997 0.81605829]

cross validation with AUC:
[0.89991833 0.9033447  0.90034001 0.89884141 0.90075802]

confusion matrix:
[[32054  2789]
 [ 1712 33445]]

              precision    recall  f1-score   support

           1       0.95      0.92      0.93     34843
           2       0.92      0.95      0.94     35157

   micro avg       0.94      0.94      0.94     70000
   macro avg       0.94      0.94      0.94     70000
weighted avg       0.94      0.94      0.94     70000

CPU times: user 25min 54s, sys: 33.1 s, total: 26min 27s
Wall time: 26min 36s

The random forest model had the best performance out of all of the models run with features chosen using the selectkbest function. Cross validation showed few signs of overfitting with this model. The random forest model using selectkbest still had lower accuracy than the model that used all of the available features in the dataset, implying that useful features were removed prior to modeling. Since selectKbest removes a low number of features in this case, this would mean that the majority of the dataset’s useful features had meaningful variance . Logistic regression’s performance didn’t differ much regardless of feature selection, implying that the model type doesn’t suit the bikeshare data.

When it comes to comparing full featured models and models that used sectkbest, most of the full featured models slightly outperformed their counterparts that used K best features. Using selectKbest allows for computational complexity to be reduced without abstracting the individual features (unlike PCA). However, in a situation where most of the features had a positive effect on the ability of the model to predict gender, this form of feature selection wasn’t particularly helpful.

Analysis and Conclusion

The random forest model using all of the dataset’s useful features the best model when it came to predicting the gender of the Citi Bike users. It is also important to note that the support vector classifier could also prove valuable when it comes to making predictions if larger amounts of computational resources are available due to its ability to draw distinct boundaries among the classes; a quality that is especially useful for location based data.

This study established the best suprvised modeling technique and feature classification pairing for the gender of the citibike users. The next step in using this data to discern the demographics of the users based on their usage of the service would be to collect more types of data and go more in depth into which features have a greater impact on the likelihood of a user being a particular gender. This includes collecting data from different sources and engineering more features. Afterwards the study can be expanded to include different types of demographical classes as outcomes such as age.

Understanding how to better utilize supervised modeling techniques to predict gender, will give insight as to what kind of people are using the bike share service and particular habits different types of customers share. This can allow for more direct marketing to specific types of users or changes in the product that better match how the service is used. Through the use of cheap and accessible data, decisions can be made that can result in increased efficiency and revenue for the company.

Share on

Twitter Facebook LinkedIn

Don Macfoy

Classification Analysis: Predicting the Gender of Bikeshare Users

Data Exploration and Analysis

Preparing The Data For Modeling

Modeling the Data using all Useful Features

Random Forest

Modeling the Data using PCA Components

Random Forest

Modeling the Data using Features Chosen with the SelectKbest Function

Random Forest

Analysis and Conclusion

Share on

You may also enjoy

Regression Analysis: Predicting Ames Housing Market Prices

Comprehensive Modeling Analysis: Predicting Airbnb New User Bookings

Classification and Clustering Analysis: An Analysis of Texts from the Gutenberg Corpora

Sentiment Analysis: Classifying Amazon Healthcare Product Reviews