New York City Airbnb Data Analysis, Visualization, and Prediction (V2)

Note:
This is the second version of this notebook. In this version, I
- add more analysis, such as the top 10 neighbourhoods with the most listings and the top 10 most and least expensive neighbourhoods in Manhattan and Brooklyn, in the data exploratory analysis section.
- add Ridge Regression, Lasso Regression, grid search, and feature importance study in the data modeling section.
- update code and add some other discussions.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
1. Introduction
Airbnb has been widely used these days by people for traveling. Here, I explore the NYC Airbnb dataset to get some insights, especially for the following three questions:
1. What is the status of the Airbnb market in NYC?
2. Which factors affect the price?
3. Can we predict the price?
2. Data Understanding
The dataset is from Kaggle(https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). It has around 49,000 observations with 16 columns.
The dataset is like below,

The 16 columns include ‘id’, ‘name’, ‘host_id’, ‘host_name’, ‘neighbourhood_group’, ‘neighbourhood’, ‘latitude’, ‘longitude’, ‘room_type’, ‘price’, ‘minimum_nights’, ‘number_of_reviews’, ‘last_review’,
‘reviews_per_month’, ‘calculated_host_listings_count’, and ‘availability_365’.
The libraries I used in this work include NumPy, Pandas, Matplotlib, Seaborn, Wordcloud, and Sklearn.
Check null values
There are some null values in the dataset as below.

I drop some unnecessary columns such as ‘id’ and ‘host_name’ and then process the null values for data analysis and modeling.

There is no null value in the new dataset now, and it is good for analysis.

3. Data Exploratory Analysis
Through data exploratory analysis, I try to answer the first two questions:
1. What is the status of the Airbnb market in NYC?
2. Which factors affect the price?
Q1: What is the status of the Airbnb market in NYC?
I study the market from the number of listings, price, availability, etc., and compare each neighbourhood group. First, here is the map of the neighbourhood group of NYC.

Number of listings
Based on the data, most of the listings in NYC are in Manhattan and Brooklyn. Both of them have more than 20,000 listings, which are more than 85% of the overall listings.

Here are the top 10 neighbourhoods with the most listings. The neighbourhood with the most listings is Williamsburg in Brooklyn. The other top 10 neighbourhoods include Bedford-Stuyvesant, Harlem, Bushwich, Upper West Side, Hell’s Kitchen, East Village, Upper East Side, Crown Heights, and Midtown. All of them are either in Brooklyn or Manhattan.

Room type
As shown in the figure below, ~52% of listings are entire home/apt, and 45% are private rooms. Only 2% are shared room.

Among different neighbourhood groups, Brooklyn has the most private rooms, while Manhattan has the most entire homes/apts.

Price
It is hard to directly see the distribution of price and the difference in price among different neighbourhoods due to outliers.


I remove the outliers to compare the price. As shown below, without outliers, the median price in Manhattan is ~$150, much higher than other neighbourhood groups (less than $100).

It looks like the listings that the price is more than $400 are considered outliers. Since they are only ~3.6% of all listings, it is ok to drop the listings that the price is higher than $400 for analysis, visualization, and modeling when it is necessary.
Here is the price distribution after excluding the listings that the price is more than $400. For the listings that the price is less than $400 in NYC, the mean price is $125, and the median price is $100.

Here is NYC's price map (excluded the listings that the price is more than $400). The most expensive area is Manhattan and some locations of Brooklyn, as indicated by the color. Let us check out the top 10 neighbourhoods in Manhattan and Brooklyn with the highest and lowest median price.

Here are the top 10 neighbourhoods in Manhattan with the highest median price. The price is between $200 and $300. The most expensive neighbourhood is Tribeca.

Here are the top 10 neighbourhoods in Manhattan with the lowest median price. The price is between $70 and $130. The cheapest neighbourhood is Washington Heights.

Here are the top 10 neighbourhoods in Brooklyn with the highest median price. The price is between $130 and $190. The most expensive neighbourhood is DUMBO.

Here are the top 10 neighbourhoods in Brooklyn with the lowest median price. The price is between $50 and $75. The cheapest neighbourhood is Borough Park.

Availability
The availability is lower in Brooklyn and Manhattan compared to other locations.

What are the popular words in names?
It is interesting to see what are the popular words in the names. Here, I use Wordcloud to check the popularity of words in the names. Some words are quite popular in the names such as ‘private’, ‘beautiful’, ‘cozy’, ‘modern’, and ‘quiet’, which indicate some important features that people may appreciate. Some other popular words in the names show the popular locations, such as ‘central park’, ‘Brooklyn’, ‘east village’, and ‘east side’.

Q2: Which factors affect the price?
To understand which factors affect the price, I check out the correlation between price and neighbourhood, room type, minimum nights, listings count, availability, etc.
Price vs. Neighbourhood
The figure below shows the price in Manhattan is much higher than in other locations.

Price vs. Room Type
The entire home/apt is more expensive than the private room and shared room. The private room is just slightly more expensive than the shared room.

Price vs. Minimum nights
No clear trend. But looks like the price that the minimum nights equal to 1 is slightly lower. Interesting.

Price vs. Calculated_host_listings_count
It looks like the prices that the calculated_host_listings_count equals to 1 are higher. It may indicate the owner who has only one property for rent has a higher expectation on the price.

Price vs. Availability
No clear trend between availability and price.

According to those results, looks like the price is mainly affected by the location and room type.
4. Data Modeling
Through data modeling, I try to predict the price using five different models: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. I also use the grid search to improve the model.
Q3: Can we predict the price?
Data preparation
Here, I use the dataset that the listing price is less than $400. The categorical columns are processed using two methods: LabelEncoder and get_dummies, and their results are also compared. The following figure shows the correlation between price and other features. It looks like the price is more related to the location and room type, which agrees with our previous analysis. It is interesting that the calculated_host_listings_count also shows a high correlation with the price.

Modeling
Five models are used to predict the price: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. A general model is first created as below and is then used by each specific model.

The results is printed using a showResults function.

Both airbnb_le (based on the LabelEncoder method) and airbnb_dummies (based on the get_dummies method) datasets are used in each model. Their results are also compared.

Linear Regression
Here shows the code of linear regression using airbnb_le as an example. The code for other models is similar. For details, please see my Github (https://github.com/tyuion/NYC_airbnb).

The results of linear regression are shown below. The R2 score of the linear regression model based on the LabelEncoder method is 0.422. The model based on the get_dummies method looks bad. Let us try to regularize the linear models using Ridge Regression and Lasso Regression.


Ridge Regression


Lasso Regression


The Lasso and Ridge regressions improve the results for the data based on the get_dummies method, and their results are similar. Since the linear regression model based on the LabelEncoder method is actually underfitting, Ridge and Lasso regression would not help much.
Decision Tree Regression

The score on the training set is much higher than that on the testing set, indicating overfitting. Try to solve the overfitting by limiting max_depth =10 and min_samples_leaf = 2 as below. The score increases to 0.503, and the scores using LabelEncoder and get_dummies methods are similar.

Random Forest
The random forest model also has an overfitting issue. Try to solve the overfitting by limiting the max_depth=10 as below. It gives a score of 0.551, which is better than the decision tree and linear regression. The scores of two models built using the LabelEncoder and get_dummies methods are close.

Grid Search
Then I use the grid search to tune the parameters of the random forest model and try to improve the model further. The best parameters found are

The final model and results are as below. It gives a score of 0.561.

Feature Importance
Let us check out the importance of features. It looks like the most important features for price prediction are room type and location, which agrees with our results in the data analysis section.

5. Conclusions
In this project, I analyze the NYC Airbnb dataset and get some interesting findings as below,
1. What is the status of the Airbnb market in NYC?
— Most of the listings in NYC are in Manhattan and Brooklyn. Both of them have more than 20,000 listings, which are over 85% of the overall listings. The neighbourhood with the most listings is Williamsburg in Brooklyn.
— Around 52% of listings are entire home/apt, and 45% are private room. Only 2% are shared rooms.
— The median price in Manhattan is ~$150, which is much higher than other neighbourhood groups (less than $100).
— The top 10 most expensive neighbourhoods in Manhattan are between $200 to $300. The most expensive neighbourhood is Tribeca. The top 10 least expensive neighbourhoods in Manhattan are between $70 to $130. The cheapest neighbourhood is Washington Heights.
— The availability is lower in Brooklyn and Manhattan compared to other locations.
— Some words are quite popular in the name, such as ‘private’, ‘beautiful’, ‘cozy’, ‘modern’, and ‘quiet’.
2. Which factors affect the price?
— Location: The price in Manhattan is much higher than in other locations.
— Room type: The entire home/apt is much more expensive than the private room and shared room. The private room is slightly more expensive than the shared room.
— Host listings count: It looks like the listings that the calculated_host_listings_count equals to 1 are more expensive.
— Minimum nights: No clear trend.
— Availability: No clear trend.
3. Can we predict the price?
— Five models are used to predict the price: Linear Regression, Ridge Regression, Lasso Regression, Decision Tree, and Random Forest. Random forest gives a better result than other models.
— Grid search is used to improve the Random forest model further. The best model gives an R2 score of 0.561.
— According to the model, the most important features are room type and location.
For details and source code, please visit my Github