Udacity’s Data Scientist Nanodegree
This is a brief summary of the results obtained in the Airbnb data analysis: part of the Udacity Data Scientist Nanodegree program.
When I was in Singapore in 2013, I couldn’t help but look at the sheer number of high-raised apartments (South East Asian/South Asian style). Apart from being a financial and an entrepot hub in Asia, Singapore also hosts several tourist destinations. As a tourist, I know that my initial question would be: ‘where am I going to stay?’
In the 21st century, one of the most popular short term stay options has been Airbnb. Therefore, I decided to peek into its market, in The Lion City.
The project is split into three main research questions:
- How well can we predict the price of a listing? Are the prices likely to be higher in the Central and Eastern regions?
- Which neighborhood in Singapore receives more positive reviews? Does this quality reflect in these neighborhood’s average Airbnb prices?
- How well can we predict the positive reviews for a listing?
The Central and Eastern regions of Singapore are highly populated with tourist spots, and directionally speaking, it is closer to Sentosa, Marina Bay Sands, Merlion and other magnificent architectural marvels. Naturally, we would expect the prices to be higher in the region, but not necessarily. The more the listings are in a region, that is supply’s response to demand, and thus, the prices settle lower into an equilibrium. But let us take a look at how the listings are scattered.
The graph below illustrates the neighborhoods and the number of listings in each of them.
Five neighborhoods other than those in the graph below, were removed from the analysis, as all of them had 1 or 2 listings. But there was another reason: one of those neighborhoods, Tuas, had a listing for SG$10,000! That is certainly an anomaly in the data, considering the questions that I seek to answer. Weirdly so, the more residential a neighborhood gets, the lesser is the number of listings.
Through Linear Regression, I used variables from the dataset such as characteristic features of the listings. These include: amenities, accounting for the neighborhood group it is located in (Central, Easter, Northern and Northeastern), how many people it can accommodate, number of beds, etc.
The Linear Regression Model to predict price
The linear model was predicted with 15.5% accuracy (based on the R-Squared statistic). The variables that I imputed by adding two new dummy variables indicating whether or not the row in the original column is missing or not, were reviews_per_month, and review_scores_value. These are the number of reviews a listing gets per month, and the review score of “value” of the listing (stay) respectively. They each had 40% missing data. In the process of debugging, the model gave a 20% accuracy without including the two original columns!
The graph above is important because the regression results suggests that the price of a listing is likely to be up by $54 on average if a neighborhood is located in the Eastern region. And the price is likely to be SGD $20–25 lower than the average if the listing is located in the Northern/Northeastern regions. Other findings include higher price for more number of accommodation capacity, if a listing is a private room or an entire apartment, etc. — all of which are natural and expected. The interesting findings are: i) prices are likely to be higher than average if a host is not a superhost, ii) the prices are lower for a higher reviews per month number for a listing.
The number of superhosts in Singapore are not a lot. Thus, this result can be attributed to statistical error, or, if a listing’s host is not a superhost, and the price is high, there might be other factors that affect this, such as the location of the listing. Reviews per month was the column with missing values. This might very well affect the statistic. This is reflected by the fact that the coefficient of the dummy variable indicating missing or not, was approximately 39. This means that the information that is missed in the reviews per month had an impact on the model to predict price.
NLTK, and the sentiment of a listing
For the second (and third) part of the analysis, the calendar dataset and the reviews dataset come into use.
Using the Natural Language Toolkit’s Vader package, we can obtain the sentiment (positivity, negativity and neutrality) of a list of strings (comments/reviews). The reviews dataset consisted of several reviews for one listing, and thus, I obtained the sentiment score of all the reviews, and grouped the values into an average for each listing. This is then combined with the calendar dataset, from which I obtained an average price variation throughout a year for each listing, its average, minimum and maximum prices, etc. These datasets are then merged with the one obtained before the first regression model.
To answer question 2, I grouped the data by neighborhood, and averaged the sentiment scores. This is helpful to study, since it lets us know the relationship between average prices in a neighborhood, and how people perceive the neighborhood (and its listings as) — through the sentiment scores.
Beware of this result. Central Water Catchment has only 26 listings in it. A lot of the top 10 neighborhoods in the graph above also appeared in the bottom 10 of the frequency of listings graph earlier in the post. This means that a neighborhood with fewer listings may have a higher average review score. But their standard deviation is also likely to high. This will not account for the lower standard deviation of the reviews in neighborhoods with more listings (even if they have a lower average review score). Perhaps creating an index that accounts for the “size” of each neighborhood would appropriate the “happiness” of the listing better.
Another interesting analysis is the relationship between the positivity score and the average price. Higher prices may be to reduce the higher demand in a neighborhood, but may also have a higher review (which is why the demand was high in the first place).
The hypothesis above also suggests that the relationship between positive reviews and the average prices may not be linear. A lot of the scatter points fall outside the confidence interval band. Also, the unremovable outlier values of the neighborhoods limits the accuracy of the linear fit. Hence, the two graphs below show this relationship with polynomial degrees of orders 2 and 3.
The fit of order 2 predicts better than the linear fit, and the fit of order 3 predicts better than of order 2. But this is only natural. The fit does always get better with higher degrees, but the penalty of adding the higher orders does provide a limitation. Order 3 graph shows that a higher sentiment has an “eventual” effect on price, and not an immediate.
Linear Regression to predict positive reviews
To answer the third and final question, the linear regression model was well able to predict the positive sentiment scores. About 36.5% of the data was accurately predicted, with a very low root mean squared error value of 0.13. This result is good, considering that the number of observations for the sentiment scores were reduced from 4481 to 2684. Perhaps with more data points, the model could perform better.
The inputs of the model we the quality of a listing (the same used in the first model), price fluctuation for a listing, qualities of the host, etc. As established above, the result, albeit a good performance score, needs to be taken with a pinch of salt since assuming that prices (and other factors) and positive sentiment score have a linear relationship may not be accurate.
The Cross-Industry Standard Procedure for Data Mining (CRISP-DM) approach to solve a data science problem provides a much needed flow of the analysis, coherent with the flow of thought. Since the dawn of 21st century, data is exponentially more available, yet its novelty has not reduced. To work on this project has lead to immense learning: not just about how an end-to-end analysis works, but also about Singapore!.
Visit me on Github, and view the full repository of the project!