New York City Airbnb Data Analysis, Visualization, and Prediction (V2)

Note:

This is the second version of this notebook. In this version, I

  1. add more analysis, such as the top 10 neighbourhoods with the most listings and the top 10 most and least expensive neighbourhoods in Manhattan and Brooklyn, in the data exploratory analysis section.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

1. Introduction

Airbnb has been widely used these days by people for traveling. Here, I explore the NYC Airbnb dataset to get some insights, especially for the following three questions:

1. What is the status of the Airbnb market in NYC?
2. Which factors affect the price?
3. Can we predict the price?

2. Data Understanding

The dataset is from Kaggle(https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). It has around 49,000 observations with 16 columns.

The dataset is like below,

The 16 columns include ‘id’, ‘name’, ‘host_id’, ‘host_name’, ‘neighbourhood_group’, ‘neighbourhood’, ‘latitude’, ‘longitude’, ‘room_type’, ‘price’, ‘minimum_nights’, ‘number_of_reviews’, ‘last_review’,
‘reviews_per_month’, ‘calculated_host_listings_count’, and ‘availability_365’.

The libraries I used in this work include NumPy, Pandas, Matplotlib, Seaborn, Wordcloud, and Sklearn.

Check null values

There are some null values in the dataset as below.

I drop some unnecessary columns such as ‘id’ and ‘host_name’ and then process the null values for data analysis and modeling.

There is no null value in the new dataset now, and it is good for analysis.

3. Data Exploratory Analysis

Through data exploratory analysis, I try to answer the first two questions:

1. What is the status of the Airbnb market in NYC?
2. Which factors affect the price?

Q1: What is the status of the Airbnb market in NYC?

I study the market from the number of listings, price, availability, etc., and compare each neighbourhood group. First, here is the map of the neighbourhood group of NYC.

Number of listings

Based on the data, most of the listings in NYC are in Manhattan and Brooklyn. Both of them have more than 20,000 listings, which are more than 85% of the overall listings.

Here are the top 10 neighbourhoods with the most listings. The neighbourhood with the most listings is Williamsburg in Brooklyn. The other top 10 neighbourhoods include Bedford-Stuyvesant, Harlem, Bushwich, Upper West Side, Hell’s Kitchen, East Village, Upper East Side, Crown Heights, and Midtown. All of them are either in Brooklyn or Manhattan.

Room type

As shown in the figure below, ~52% of listings are entire home/apt, and 45% are private rooms. Only 2% are shared room.

Among different neighbourhood groups, Brooklyn has the most private rooms, while Manhattan has the most entire homes/apts.

Price

It is hard to directly see the distribution of price and the difference in price among different neighbourhoods due to outliers.

I remove the outliers to compare the price. As shown below, without outliers, the median price in Manhattan is ~$150, much higher than other neighbourhood groups (less than $100).

It looks like the listings that the price is more than $400 are considered outliers. Since they are only ~3.6% of all listings, it is ok to drop the listings that the price is higher than $400 for analysis, visualization, and modeling when it is necessary.

Here is the price distribution after excluding the listings that the price is more than $400. For the listings that the price is less than $400 in NYC, the mean price is $125, and the median price is $100.

Here is NYC's price map (excluded the listings that the price is more than $400). The most expensive area is Manhattan and some locations of Brooklyn, as indicated by the color. Let us check out the top 10 neighbourhoods in Manhattan and Brooklyn with the highest and lowest median price.

Here are the top 10 neighbourhoods in Manhattan with the highest median price. The price is between $200 and $300. The most expensive neighbourhood is Tribeca.

Here are the top 10 neighbourhoods in Manhattan with the lowest median price. The price is between $70 and $130. The cheapest neighbourhood is Washington Heights.

Here are the top 10 neighbourhoods in Brooklyn with the highest median price. The price is between $130 and $190. The most expensive neighbourhood is DUMBO.

Here are the top 10 neighbourhoods in Brooklyn with the lowest median price. The price is between $50 and $75. The cheapest neighbourhood is Borough Park.

Availability

The availability is lower in Brooklyn and Manhattan compared to other locations.

What are the popular words in names?

It is interesting to see what are the popular words in the names. Here, I use Wordcloud to check the popularity of words in the names. Some words are quite popular in the names such as ‘private’, ‘beautiful’, ‘cozy’, ‘modern’, and ‘quiet’, which indicate some important features that people may appreciate. Some other popular words in the names show the popular locations, such as ‘central park’, ‘Brooklyn’, ‘east village’, and ‘east side’.

Q2: Which factors affect the price?

To understand which factors affect the price, I check out the correlation between price and neighbourhood, room type, minimum nights, listings count, availability, etc.

Price vs. Neighbourhood

The figure below shows the price in Manhattan is much higher than in other locations.

Price vs. Room Type

The entire home/apt is more expensive than the private room and shared room. The private room is just slightly more expensive than the shared room.

Price vs. Minimum nights

No clear trend. But looks like the price that the minimum nights equal to 1 is slightly lower. Interesting.

Price vs. Calculated_host_listings_count

It looks like the prices that the calculated_host_listings_count equals to 1 are higher. It may indicate the owner who has only one property for rent has a higher expectation on the price.

Price vs. Availability

No clear trend between availability and price.

According to those results, looks like the price is mainly affected by the location and room type.

4. Data Modeling

Through data modeling, I try to predict the price using five different models: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. I also use the grid search to improve the model.

Q3: Can we predict the price?

Data preparation

Here, I use the dataset that the listing price is less than $400. The categorical columns are processed using two methods: LabelEncoder and get_dummies, and their results are also compared. The following figure shows the correlation between price and other features. It looks like the price is more related to the location and room type, which agrees with our previous analysis. It is interesting that the calculated_host_listings_count also shows a high correlation with the price.

Modeling

Five models are used to predict the price: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. A general model is first created as below and is then used by each specific model.

The results is printed using a showResults function.

Both airbnb_le (based on the LabelEncoder method) and airbnb_dummies (based on the get_dummies method) datasets are used in each model. Their results are also compared.

Linear Regression

Here shows the code of linear regression using airbnb_le as an example. The code for other models is similar. For details, please see my Github (https://github.com/tyuion/NYC_airbnb).

The results of linear regression are shown below. The R2 score of the linear regression model based on the LabelEncoder method is 0.422. The model based on the get_dummies method looks bad. Let us try to regularize the linear models using Ridge Regression and Lasso Regression.

Ridge Regression

Lasso Regression

The Lasso and Ridge regressions improve the results for the data based on the get_dummies method, and their results are similar. Since the linear regression model based on the LabelEncoder method is actually underfitting, Ridge and Lasso regression would not help much.

Decision Tree Regression

The score on the training set is much higher than that on the testing set, indicating overfitting. Try to solve the overfitting by limiting max_depth =10 and min_samples_leaf = 2 as below. The score increases to 0.503, and the scores using LabelEncoder and get_dummies methods are similar.

Random Forest

The random forest model also has an overfitting issue. Try to solve the overfitting by limiting the max_depth=10 as below. It gives a score of 0.551, which is better than the decision tree and linear regression. The scores of two models built using the LabelEncoder and get_dummies methods are close.

Grid Search

Then I use the grid search to tune the parameters of the random forest model and try to improve the model further. The best parameters found are

The final model and results are as below. It gives a score of 0.561.

Feature Importance

Let us check out the importance of features. It looks like the most important features for price prediction are room type and location, which agrees with our results in the data analysis section.

5. Conclusions

In this project, I analyze the NYC Airbnb dataset and get some interesting findings as below,

1. What is the status of the Airbnb market in NYC?
— Most of the listings in NYC are in Manhattan and Brooklyn. Both of them have more than 20,000 listings, which are over 85% of the overall listings. The neighbourhood with the most listings is Williamsburg in Brooklyn.
— Around 52% of listings are entire home/apt, and 45% are private room. Only 2% are shared rooms.
— The median price in Manhattan is ~$150, which is much higher than other neighbourhood groups (less than $100).
— The top 10 most expensive neighbourhoods in Manhattan are between $200 to $300. The most expensive neighbourhood is Tribeca. The top 10 least expensive neighbourhoods in Manhattan are between $70 to $130. The cheapest neighbourhood is Washington Heights.
— The availability is lower in Brooklyn and Manhattan compared to other locations.
— Some words are quite popular in the name, such as ‘private’, ‘beautiful’, ‘cozy’, ‘modern’, and ‘quiet’.

2. Which factors affect the price?
— Location: The price in Manhattan is much higher than in other locations.
— Room type: The entire home/apt is much more expensive than the private room and shared room. The private room is slightly more expensive than the shared room.
— Host listings count: It looks like the listings that the calculated_host_listings_count equals to 1 are more expensive.
— Minimum nights: No clear trend.
— Availability: No clear trend.

3. Can we predict the price?
— Five models are used to predict the price: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. Random forest gives a better result than other models.
— Grid search is used to improve the Random forest model further. The best model gives an R2 score of 0.561.
— According to the model, the most important features are room type and location.

For details and source code, please visit my Github

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store