preloader

Predictive model for AirBnB prices in Istambul

illustrations illustrations illustrations illustrations illustrations illustrations illustrations
Predictive model for AirBnB prices in Istambul

Date

Oct 20, 2020

Client

Self-work

Categories

Data science, modelling, predictive models

Project Link

projects.fivethirtyeight.com/trump-approval-ratings

Summary

In the following Assignment, we forecast the total cost for two tourists staying four nights in an Airbnb in Istanbul. To carry out this order, we have downloaded the data from insideairbnb.com.

We carried out the following steps to prepare the Data:

  1. First, we imported the data with “vroom”. Then we looked at the data set and determined the variables that were important for us. We also examined the data in detail and decided how we would clean up the data set.

  2. For our second step, we started to cleanse the data. We treated missing data, changed the property to the correct variables (character variables to numeric values).

  3. We also created new variables. For example, we clustered the apartment type, taking the four largest and filtering the rest to “Others” or we counted the minimum nights and filtered for minimum nights less than or equal to four since we determine the price for tourists who only want to stay four nights.

  4. We have also noticed that our data set has many outliers. To get rid of outliers, we have taken out prices above 2000 and equal to 0, because we believe that these prices are unrealistic since there, for example, no places for free

After we cleaned, adjusted and visualised the data, we started modelling with different factors. After variations of models with different factors, we found our model with the best fitted R-square. Then we filtered by our assumptions, such as two bedrooms, one bathroom, located in the city centre, and found our predicted price with a confidence interval of 95%. Our model predicted a price for two people to stay four nights in Istanbul of USD 670.8, with a 95% confidence interval between USD 644.74 and USD 697.92.

Preparing the data - Data wrangling, handling missing values, outliers, etc

We load the data:

listings <- vroom("http://data.insideairbnb.com/turkey/marmara/istanbul/2020-06-28/data/listings.csv.gz")
glimpse(listings)
## Rows: 23,728
## Columns: 106
## $ id                                           <dbl> 4826, 20815, 27271, 2827…
## $ listing_url                                  <chr> "https://www.airbnb.com/…
## $ scrape_id                                    <dbl> 2.02e+13, 2.02e+13, 2.02…
## $ last_scraped                                 <date> 2020-06-28, 2020-06-29,…
## $ name                                         <chr> "The Place", "The Bospho…
## $ summary                                      <chr> "My place is close to gr…
## $ space                                        <chr> "A double bed apartment …
## $ description                                  <chr> "My place is close to gr…
## $ experiences_offered                          <chr> "none", "none", "none", …
## $ neighborhood_overview                        <chr> NA, "The lovely neighbor…
## $ notes                                        <chr> NA, "The house may be su…
## $ transit                                      <chr> NA, "The city center, Ta…
## $ access                                       <chr> NA, "Our dear guests may…
## $ interaction                                  <chr> NA, "Depending on our ti…
## $ house_rules                                  <chr> NA, "- Windows facing th…
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url                                  <chr> "https://a0.muscache.com…
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id                                      <dbl> 6603, 78838, 117026, 121…
## $ host_url                                     <chr> "https://www.airbnb.com/…
## $ host_name                                    <chr> "Kaan", "Gülder", "Mutlu…
## $ host_since                                   <date> 2009-01-14, 2010-02-08,…
## $ host_location                                <chr> "Istanbul, Istanbul, Tur…
## $ host_about                                   <chr> "Hello...\r\nI am Kaan a…
## $ host_response_time                           <chr> "N/A", "N/A", "N/A", "N/…
## $ host_response_rate                           <chr> "N/A", "N/A", "N/A", "N/…
## $ host_acceptance_rate                         <chr> "N/A", "N/A", "50%", "10…
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FAL…
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com…
## $ host_picture_url                             <chr> "https://a0.muscache.com…
## $ host_neighbourhood                           <chr> "Üsküdar", "Beşiktaş", "…
## $ host_listings_count                          <dbl> 1, 2, 1, 20, 1, 1, 1, 2,…
## $ host_total_listings_count                    <dbl> 1, 2, 1, 20, 1, 1, 1, 2,…
## $ host_verifications                           <chr> "['email', 'phone', 'fac…
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified                       <lgl> FALSE, FALSE, TRUE, FALS…
## $ street                                       <chr> "Istanbul Province, Ista…
## $ neighbourhood                                <chr> "Üsküdar", "Beşiktaş", "…
## $ neighbourhood_cleansed                       <chr> "Uskudar", "Besiktas", "…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, …
## $ city                                         <chr> "Istanbul Province", "Is…
## $ state                                        <chr> "Istanbul", NA, NA, NA, …
## $ zipcode                                      <chr> "34684", "34345", "34433…
## $ market                                       <chr> "Istanbul", "Istanbul", …
## $ smart_location                               <chr> "Istanbul Province, Turk…
## $ country_code                                 <chr> "TR", "TR", "TR", "TR", …
## $ country                                      <chr> "Turkey", "Turkey", "Tur…
## $ latitude                                     <dbl> 41.1, 41.1, 41.0, 41.0, …
## $ longitude                                    <dbl> 29.1, 29.0, 29.0, 29.0, …
## $ is_location_exact                            <lgl> FALSE, TRUE, FALSE, TRUE…
## $ property_type                                <chr> "Apartment", "Apartment"…
## $ room_type                                    <chr> "Entire home/apt", "Enti…
## $ accommodates                                 <dbl> 2, 3, 2, 5, 2, 3, 2, 2, …
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 1.0, 1.0,…
## $ bedrooms                                     <dbl> 0, 2, 1, 1, 2, 1, 1, 1, …
## $ beds                                         <dbl> 1, 2, 1, 3, 2, 1, 1, 1, …
## $ bed_type                                     <chr> "Real Bed", "Real Bed", …
## $ amenities                                    <chr> "{TV,\"Cable TV\",Intern…
## $ square_feet                                  <dbl> 700, NA, NA, 753, 700, 0…
## $ price                                        <chr> "$720.00", "$816.00", "$…
## $ weekly_price                                 <chr> NA, "$1,556.00", "$1,769…
## $ monthly_price                                <chr> NA, "$5,327.00", "$6,307…
## $ security_deposit                             <chr> NA, "$679.00", "$769.00"…
## $ cleaning_fee                                 <chr> NA, NA, "$308.00", "$77.…
## $ guests_included                              <dbl> 2, 4, 2, 2, 6, 1, 1, 2, …
## $ extra_people                                 <chr> "$178.00", "$240.00", "$…
## $ minimum_nights                               <dbl> 1, 365, 30, 3, 3, 3, 1, …
## $ maximum_nights                               <dbl> 730, 900, 90, 360, 60, 1…
## $ minimum_minimum_nights                       <dbl> 1, 365, 30, 3, 3, 3, 1, …
## $ maximum_minimum_nights                       <dbl> 1, 365, 30, 3, 3, 3, 1, …
## $ minimum_maximum_nights                       <dbl> 730, 900, 90, 360, 60, 1…
## $ maximum_maximum_nights                       <dbl> 730, 900, 90, 360, 60, 1…
## $ minimum_nights_avg_ntm                       <dbl> 1, 365, 30, 3, 3, 3, 1, …
## $ maximum_nights_avg_ntm                       <dbl> 730, 900, 90, 360, 60, 1…
## $ calendar_updated                             <chr> "38 months ago", "7 mont…
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30                              <dbl> 30, 13, 28, 30, 28, 29, …
## $ availability_60                              <dbl> 60, 26, 58, 60, 58, 59, …
## $ availability_90                              <dbl> 90, 36, 80, 90, 88, 89, …
## $ availability_365                             <dbl> 365, 279, 289, 365, 88, …
## $ calendar_last_scraped                        <date> 2020-06-28, 2020-06-29,…
## $ number_of_reviews                            <dbl> 1, 41, 13, 0, 0, 0, 1, 1…
## $ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ first_review                                 <date> 2009-06-01, 2010-03-24,…
## $ last_review                                  <date> 2009-06-01, 2018-11-07,…
## $ review_scores_rating                         <dbl> 100, 90, 98, NA, NA, NA,…
## $ review_scores_accuracy                       <dbl> NA, 9, 10, NA, NA, NA, N…
## $ review_scores_cleanliness                    <dbl> NA, 9, 9, NA, NA, NA, NA…
## $ review_scores_checkin                        <dbl> NA, 10, 10, NA, NA, NA, …
## $ review_scores_communication                  <dbl> NA, 10, 10, NA, NA, NA, …
## $ review_scores_location                       <dbl> NA, 10, 10, NA, NA, NA, …
## $ review_scores_value                          <dbl> NA, 9, 10, NA, NA, NA, N…
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, FAL…
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable                             <lgl> FALSE, FALSE, FALSE, TRU…
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy                          <chr> "flexible", "moderate", …
## $ require_guest_profile_picture                <lgl> FALSE, TRUE, FALSE, FALS…
## $ require_guest_phone_verification             <lgl> FALSE, FALSE, FALSE, FAL…
## $ calculated_host_listings_count               <dbl> 1, 2, 1, 19, 1, 1, 1, 2,…
## $ calculated_host_listings_count_entire_homes  <dbl> 1, 1, 1, 6, 1, 0, 0, 1, …
## $ calculated_host_listings_count_private_rooms <dbl> 0, 1, 0, 0, 0, 1, 1, 1, …
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month                            <dbl> 0.01, 0.33, 0.19, NA, NA…

We can rapidly see our listings data frame using glimpse() function. Below is a description of the most important variables for this project:

  • price: cost per night

  • cleaning_fee: cleaning fee

  • extra_people: charge for having more than 1 person

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude, latitude: geographical coordinates to help us locate the listing

  • neighbourhood: three variables on a few major neighbourhoods

The first step is to prepare the data for our work, by modifying the variables that need so.

With the glimpse() function before, we have seen that price, cleaning_fee and extra_people is a character variable, but it would be more useful and logical for them to be numeric (double). We transform those variables with the following code.

listing <- listings %>% 
  mutate(price = parse_number(price)) %>% 
  mutate(weekly_price = parse_number(weekly_price)) %>%
  mutate(monthly_price = parse_number(monthly_price)) %>%
  mutate(security_deposit = parse_number(security_deposit)) %>% 
  mutate(cleaning_fee = parse_number(cleaning_fee)) %>% 
  mutate(extra_people = parse_number(extra_people))

#With typeof() we check that the type of variable has changed correctly to double
typeof(listing$price)
## [1] "double"
typeof(listing$weekly_price)
## [1] "double"
typeof(listing$monthly_price)
## [1] "double"
typeof(listing$security_deposit)
## [1] "double"
typeof(listing$cleaning_fee)
## [1] "double"
typeof(listing$extra_people)
## [1] "double"

We skim() through the data set to get some more information. For example, for every data set it is worth checking if there is something weird regarding missing values, such as important variables having a lot of (or all) missing values.

skim(listing)
Table 1: Data summary
Name listing
Number of rows 23728
Number of columns 106
_______________________
Column type frequency:
character 40
Date 5
logical 16
numeric 45
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 23728 0
name 54 1.00 1 108 0 22685 0
summary 3779 0.84 1 1000 0 17202 1
space 11361 0.52 1 1000 0 10575 0
description 2876 0.88 1 1000 0 18952 0
experiences_offered 0 1.00 4 4 0 1 0
neighborhood_overview 12658 0.47 1 1000 0 8867 0
notes 18474 0.22 1 1000 0 4221 0
transit 13660 0.42 1 1000 0 8147 0
access 16337 0.31 1 1000 0 5943 0
interaction 15019 0.37 1 1000 0 6607 0
house_rules 16407 0.31 1 1000 0 6306 0
picture_url 0 1.00 80 146 0 22996 0
host_url 0 1.00 38 43 0 14450 0
host_name 1 1.00 1 35 0 4907 0
host_location 83 1.00 2 105 0 775 0
host_about 11902 0.50 1 5717 0 6144 10
host_response_time 1 1.00 3 18 0 5 0
host_response_rate 1 1.00 2 4 0 60 0
host_acceptance_rate 1 1.00 2 4 0 85 0
host_thumbnail_url 1 1.00 55 106 0 14358 0
host_picture_url 1 1.00 57 109 0 14358 0
host_neighbourhood 15027 0.37 4 33 0 59 0
host_verifications 0 1.00 2 158 0 277 0
street 0 1.00 10 116 0 1180 0
neighbourhood 5377 0.77 4 15 0 15 0
neighbourhood_cleansed 0 1.00 4 13 0 39 0
city 773 0.97 2 69 0 641 0
state 397 0.98 1 58 0 293 0
zipcode 2422 0.90 1 43 0 388 0
market 0 1.00 7 21 0 3 0
smart_location 0 1.00 6 77 0 706 0
country_code 0 1.00 2 2 0 3 0
country 0 1.00 6 12 0 3 0
property_type 0 1.00 3 22 0 43 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1297 0 21255 0
calendar_updated 0 1.00 5 14 0 104 0
cancellation_policy 0 1.00 6 27 0 6 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2020-06-28 2020-06-30 2020-06-29 3
host_since 1 1.00 2009-01-14 2020-06-27 2017-08-27 3095
calendar_last_scraped 0 1.00 2020-06-28 2020-06-30 2020-06-29 3
first_review 12375 0.48 2009-06-01 2020-06-29 2019-04-12 2216
last_review 12375 0.48 2009-06-01 2020-06-29 2020-01-01 1424

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnail_url 23728 0 NaN :
medium_url 23728 0 NaN :
xl_picture_url 23728 0 NaN :
host_is_superhost 1 1 0.11 FAL: 21147, TRU: 2580
host_has_profile_pic 1 1 1.00 TRU: 23625, FAL: 102
host_identity_verified 1 1 0.16 FAL: 19951, TRU: 3776
neighbourhood_group_cleansed 23728 0 NaN :
is_location_exact 0 1 0.35 FAL: 15329, TRU: 8399
has_availability 0 1 1.00 TRU: 23728
requires_license 0 1 0.00 FAL: 23728
license 23728 0 NaN :
jurisdiction_names 23728 0 NaN :
instant_bookable 0 1 0.60 TRU: 14249, FAL: 9479
is_business_travel_ready 0 1 0.00 FAL: 23728
require_guest_profile_picture 0 1 0.01 FAL: 23571, TRU: 157
require_guest_phone_verification 0 1 0.01 FAL: 23526, TRU: 202

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.91e+07 1.31e+07 4.83e+03 2.10e+07 3.40e+07 3.97e+07 4.40e+07 ▂▂▂▃▇
scrape_id 0 1.00 2.02e+13 0.00e+00 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ▁▁▇▁▁
host_id 0 1.00 1.49e+08 1.16e+08 6.60e+03 3.29e+07 1.48e+08 2.59e+08 3.52e+08 ▇▂▃▅▃
host_listings_count 1 1.00 2.43e+01 2.24e+02 0.00e+00 1.00e+00 1.00e+00 4.00e+00 3.77e+03 ▇▁▁▁▁
host_total_listings_count 1 1.00 2.43e+01 2.24e+02 0.00e+00 1.00e+00 1.00e+00 4.00e+00 3.77e+03 ▇▁▁▁▁
latitude 0 1.00 4.10e+01 5.00e-02 4.08e+01 4.10e+01 4.10e+01 4.10e+01 4.15e+01 ▁▇▁▁▁
longitude 0 1.00 2.90e+01 1.30e-01 2.80e+01 2.90e+01 2.90e+01 2.90e+01 2.99e+01 ▁▁▇▁▁
accommodates 0 1.00 3.21e+00 2.25e+00 1.00e+00 2.00e+00 2.00e+00 4.00e+00 1.60e+01 ▇▁▁▁▁
bathrooms 86 1.00 1.21e+00 1.04e+00 0.00e+00 1.00e+00 1.00e+00 1.00e+00 5.00e+01 ▇▁▁▁▁
bedrooms 173 0.99 1.39e+00 1.44e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 5.00e+01 ▇▁▁▁▁
beds 698 0.97 2.05e+00 2.04e+00 0.00e+00 1.00e+00 1.00e+00 2.00e+00 7.70e+01 ▇▁▁▁▁
square_feet 23492 0.01 6.05e+02 1.21e+03 0.00e+00 7.00e+01 5.38e+02 8.07e+02 1.74e+04 ▇▁▁▁▁
price 0 1.00 4.85e+02 1.97e+03 0.00e+00 1.37e+02 2.47e+02 4.46e+02 7.69e+04 ▇▁▁▁▁
weekly_price 21993 0.07 2.95e+03 4.13e+03 8.50e+01 1.04e+03 2.06e+03 3.52e+03 8.57e+04 ▇▁▁▁▁
monthly_price 22031 0.07 1.02e+04 1.32e+04 2.06e+02 3.46e+03 7.31e+03 1.27e+04 3.26e+05 ▇▁▁▁▁
security_deposit 15623 0.34 7.26e+02 2.38e+03 0.00e+00 0.00e+00 0.00e+00 7.00e+02 3.52e+04 ▇▁▁▁▁
cleaning_fee 13660 0.42 1.27e+02 1.78e+02 0.00e+00 0.00e+00 8.00e+01 1.92e+02 4.57e+03 ▇▁▁▁▁
guests_included 0 1.00 1.40e+00 1.09e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.60e+01 ▇▁▁▁▁
extra_people 0 1.00 3.01e+01 7.53e+01 0.00e+00 0.00e+00 0.00e+00 4.00e+01 2.06e+03 ▇▁▁▁▁
minimum_nights 0 1.00 4.53e+00 2.76e+01 1.00e+00 1.00e+00 1.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights 0 1.00 9.13e+04 1.39e+07 1.00e+00 6.00e+01 1.12e+03 1.12e+03 2.15e+09 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 4.41e+00 2.68e+01 1.00e+00 1.00e+00 1.00e+00 2.00e+00 1.12e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.71e+00 2.86e+01 1.00e+00 1.00e+00 1.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 9.28e+02 9.68e+03 1.00e+00 3.60e+02 1.12e+03 1.12e+03 1.00e+06 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 9.29e+02 9.68e+03 1.00e+00 3.60e+02 1.12e+03 1.12e+03 1.00e+06 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.51e+00 2.70e+01 1.00e+00 1.00e+00 1.00e+00 3.00e+00 1.12e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 9.29e+02 9.68e+03 1.00e+00 3.60e+02 1.12e+03 1.12e+03 1.00e+06 ▇▁▁▁▁
availability_30 0 1.00 2.21e+01 1.21e+01 0.00e+00 1.70e+01 2.90e+01 3.00e+01 3.00e+01 ▂▁▁▁▇
availability_60 0 1.00 4.53e+01 2.38e+01 0.00e+00 4.10e+01 5.90e+01 6.00e+01 6.00e+01 ▂▁▁▁▇
availability_90 0 1.00 6.90e+01 3.51e+01 0.00e+00 6.60e+01 8.90e+01 9.00e+01 9.00e+01 ▂▁▁▁▇
availability_365 0 1.00 2.28e+02 1.47e+02 0.00e+00 8.90e+01 3.02e+02 3.65e+02 3.65e+02 ▃▂▂▁▇
number_of_reviews 0 1.00 7.87e+00 2.32e+01 0.00e+00 0.00e+00 0.00e+00 4.00e+00 3.45e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 3.04e+00 7.47e+00 0.00e+00 0.00e+00 0.00e+00 2.00e+00 1.08e+02 ▇▁▁▁▁
review_scores_rating 12978 0.45 9.13e+01 1.40e+01 2.00e+01 9.00e+01 9.60e+01 1.00e+02 1.00e+02 ▁▁▁▁▇
review_scores_accuracy 12991 0.45 9.29e+00 1.42e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_cleanliness 12988 0.45 9.06e+00 1.51e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▂▇
review_scores_checkin 12991 0.45 9.52e+00 1.28e+00 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_communication 12987 0.45 9.55e+00 1.24e+00 2.00e+00 1.00e+01 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_location 12991 0.45 9.44e+00 1.23e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
review_scores_value 12993 0.45 9.19e+00 1.40e+00 2.00e+00 9.00e+00 1.00e+01 1.00e+01 1.00e+01 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 5.86e+00 1.65e+01 1.00e+00 1.00e+00 2.00e+00 5.00e+00 1.76e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 2.81e+00 6.37e+00 0.00e+00 0.00e+00 1.00e+00 2.00e+00 6.60e+01 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 2.46e+00 1.51e+01 0.00e+00 0.00e+00 1.00e+00 1.00e+00 1.75e+02 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 9.00e-02 6.00e-01 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.10e+01 ▇▁▁▁▁
reviews_per_month 12375 0.48 7.10e-01 9.00e-01 1.00e-02 1.30e-01 3.30e-01 9.50e-01 9.20e+00 ▇▁▁▁▁

It calls our attention that there are 13600 missing values for cleaning_fee. However, this is an example of data that is missing not at random, since there is a specific pattern/explanation: quite probably, the cleaning fee for those properties is $0 and thus a missing value is actually a 0. We add this modification below.

listing <- listing %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ 0, 
    TRUE ~ cleaning_fee
  ))

Now, let’s examine property_type, as it is an important variable to describe the general characteristics one should expect from the property. We can use the count() function to determine how many categories are there and their frequency.

#List of the type of properties and their frequency:
listing_type <- listing %>% 
  count(property_type) %>% 
  arrange(desc(n)) %>% 
  mutate(share = n / sum(n), share=scales::percent(share))
listing_type
## # A tibble: 43 x 3
##    property_type          n share   
##    <chr>              <int> <chr>   
##  1 Apartment          14958 63.0394%
##  2 Serviced apartment  1700 7.1645% 
##  3 House               1564 6.5914% 
##  4 Boutique hotel      1113 4.6907% 
##  5 Townhouse            692 2.9164% 
##  6 Condominium          629 2.6509% 
##  7 Aparthotel           590 2.4865% 
##  8 Bed and breakfast    585 2.4654% 
##  9 Hotel                545 2.2969% 
## 10 Loft                 436 1.8375% 
## # … with 33 more rows
#Top 4 categories represent the vast majority of properties.
#Calculate this percentage:
listing_type_sum <- listing_type %>% 
  summarise(share = n / sum(n)) %>% 
  head(4) %>% 
  summarise(sum = sum(share))
listing_type_sum
## # A tibble: 1 x 1
##     sum
##   <dbl>
## 1 0.815

We see that the four most common property types are Apartment, Serviced apartment, House, and Boutique hotel; with 81.5% of properties being represented by these four categories.

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.

listing <- listing %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","Serviced apartment", "House","Boutique hotel") ~ property_type, 
    TRUE ~ "Other"
  ))

#Check that prop_type_simplified was correctly made:
listing_prop_sum <- listing %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))
listing_prop_sum
## # A tibble: 43 x 3
##    property_type      prop_type_simplified     n
##    <chr>              <chr>                <int>
##  1 Apartment          Apartment            14958
##  2 Serviced apartment Serviced apartment    1700
##  3 House              House                 1564
##  4 Boutique hotel     Boutique hotel        1113
##  5 Townhouse          Other                  692
##  6 Condominium        Other                  629
##  7 Aparthotel         Other                  590
##  8 Bed and breakfast  Other                  585
##  9 Hotel              Other                  545
## 10 Loft               Other                  436
## # … with 33 more rows

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes.

Thus, our last step in preparing the data for analysis will be to study minimum_nights variable in order to define the most common values and how they are related to the different purposes of Airbnb listings. Remember, we want to exclude other purposes different than traveling, such as short-term rentals for work, etc.

#We check the different values for minimum_nights and their frequency:
listing_nights <- listing %>% 
  count(minimum_nights) %>% 
  arrange(desc(n))
listing_nights  
## # A tibble: 65 x 2
##    minimum_nights     n
##             <dbl> <int>
##  1              1 13228
##  2              2  4511
##  3              3  2682
##  4              7   702
##  5              5   701
##  6              4   490
##  7             30   324
##  8             10   185
##  9             15   179
## 10              6   138
## # … with 55 more rows

We see that the most common value for minimum_nights is 1, following by 2 and 3. These values look like what we would expect from any property listed for tourists. However, 7 and 30 are also relatively common minimum night requirements, reflecting one week and one month stays, respectively. These are reflections of common time intervals that hosts would like to rent their properties for, likely not for tourism purposes, at least not for the average stay for travelers. These hosts could be looking for less turnover on their properties over longer periods of time by having fewer independent guests, targeting a different market from tourism, such as short-term rents for business purposes or newcomers to Istanbul.

Hence, those properties which are not destined for tourists should be out of the scope of our analysis. For that, we will filter out all properties with a minimum stay of more than 4 nights.

listing <- listing %>% 
  filter(minimum_nights <= 4)

Dealing with outliers

Additionally, before starting to build models we should explore if there are outliers that could affect them. We want to build predictive models using data of normal properties, discarding especial situations such as staying at a palace for a price 100 times higher than the average hotel. We can use histograms to check those possible outliers in price.

ggplot(listing, mapping = aes(x = price)) +
  geom_histogram() +
  scale_x_continuous() +
  labs(title = "Distribution of price",
       y = "Number of properties") +
  theme_bw()

We can see extreme outliers in the price variable. As we said before, we are not interested in these extreme cases for our predictive model, as these can be properties with very special characteristics, very luxurious, or with an absurd price because the host does not want to acomodate hosts at that moment. We will filter out the values above 5000, and run the plot again.

listing_clean <- listing %>% 
  filter(price<5000)

ggplot(listing_clean, mapping = aes(x = price)) +
  geom_histogram() +
  scale_x_continuous() +
  labs(title = "Distribution of price",
       y = "Number of properties") +
  theme_bw()

We still see a very fat tail in the last histogram, so we will make one last correction, filtering out outliers with a price greater that 2000, and also discarding the properties with a price of 0, as a property for free does not make sense.

#We will also filter out, just in case, values with a price of 0.
listing_clean <- listing %>% 
  filter(price<2000, price>0)

ggplot(listing_clean, mapping = aes(y = reorder(prop_type_simplified, price, median), x = price)) +
  geom_boxplot() +
  scale_x_continuous() +
  labs(title = "Distribution of price and outliers",
       y = "Type of property") +
  theme_bw()

After dealing with the outliers, the data set is ready for our analysis.

Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While it is not the main objective of this project to go into geospatial visualisations with R, we will make a map of Istanbul, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so we can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having created a dataframe listings with all AirbnB listings in Bordeaux, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4).

leaflet(data = filter(listing, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression analysis

This is the core of this project. We are going develop a regression model to predict the cost for two guests for staying 4 nights in the city.

For the target variable “y”, we will use the cost for two people to stay at an Airbnb location for four (4) nights. For this, we create a new variable called price_4_nights that uses price, cleaning_fee, guests_included, and extra_people to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable “y” we want to explain.

Now, we will use an histogram to examine the distributions of price_4_nights.

listing_price <- listing_clean %>% 
  mutate(price_4_nights = (price * 4) + cleaning_fee + (if_else(guests_included < 2, extra_people, 0)))

ggplot(listing_price, aes(x = price_4_nights)) +
        geom_histogram() +
        labs(title = "Distribution of variable price_4_nights")

As the data is clearly right-skewed, we should use log(price_4_nights) for our regression model, as a majority of the values are clustered at the low end of the scale. This means that the high price listings create a positively skewed distribution that is difficult to interpret without a logarithmic scale.

#Create log(price_4_nights)
listing_price <- listing_price %>% 
  mutate(log_price_4_nights = log(price_4_nights))
#Plot histogram with price_4_nights in logarithmic scale
ggplot(listing_price, aes(x = log_price_4_nights)) +
        geom_histogram() +
        labs(title = "Distribution of variable ln(price_4_nights)")

Following, we will use ggpairs to produce a correlation scatterplot with selected numerical variables from the dataset.

#using ggpairs to explore the relationship between some numeric variables of interest
listing_price %>% 
  select(log_price_4_nights, number_of_reviews, review_scores_rating, reviews_per_month, bathrooms, bedrooms, beds, accommodates) %>% 
  ggpairs(aes(alpha=0.1))+
  theme_bw()

As we see in the scatterplot, there is not very high correlation between log_price_4_nights and the other numerical variables, but they are still significant so we are going to test these variables in our model.

In the next step, we fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

We will also use the package broom::“tidy models”, with: * tidy() for model coefficients * glance() for model fit

model1 <- lm(log_price_4_nights ~ prop_type_simplified +
                                    number_of_reviews +
                                    review_scores_rating,
             data = listing_price)

model1 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 7 x 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            6.79     0.0501     136.    0.        6.69e+0   6.89   
## 2 prop_type_simplified…  0.00960  0.0332       0.289 7.72e- 1 -5.55e-2   0.0747 
## 3 prop_type_simplified… -0.0518   0.0326      -1.59  1.11e- 1 -1.16e-1   0.0120 
## 4 prop_type_simplified… -0.00479  0.0212      -0.226 8.21e- 1 -4.63e-2   0.0367 
## 5 prop_type_simplified…  0.192    0.0283       6.78  1.25e-11  1.36e-1   0.247  
## 6 number_of_reviews      0.00198  0.000225     8.77  2.02e-18  1.53e-3   0.00242
## 7 review_scores_rating   0.00135  0.000534     2.53  1.14e- 2  3.04e-4   0.00240
model1 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0141        0.0135 0.721      22.7 1.09e-26     6 -10393. 20802. 20860.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

As we obtained an adjusted R-squared value of 0.0135 in model1, we will assess if there are more significant variables to add to our model. Also, we can confirm our choice of log(price_4_nights) against price_4_nights as target variable by running the model with the latter, which gave a lower R-squared value, 0.0084 —that is, “worse” in terms of predictive power.

First, we look at the coefficient of review_scores_rating in terms of price_4_nights. It is negatively correlated with price, and this can happen because guests may be more willing to give higher ratings for properties that did not cost them a lot. Similarly, those paying for more expensive properties probably expect more from their stay and are more likely to give unflattering reviews for minor inconveniences.

Then, looking at the Property type (variable prop_type_simplified), we know that it is a categorical variable, so the coefficients for each of the property types is a reflection of the expected pricing discrepancy between the average apartment, which is the baseline, and the average Boutique hotel, House, etc. Because apartments represent the most basic offering on Airbnb, it makes sense that other, less common listings like boutique hotels and serviced apartments would command a higher price. The negative coefficient on houses is likely more a function of location, as houses in a major city like Istanbul would likely be much further from the city center than apartments.

We would also like to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. For that, we fit another regression model called model2 that includes all of the explanantory variables in model1 plus room_type.

model2 <- lm(log_price_4_nights ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type, data = listing_price)
model2 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 10 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          7.03e+0  0.0424     166.    0.         6.95e+0  7.11    
##  2 prop_type_simplifi…  3.92e-1  0.0313      12.5   1.05e- 35  3.31e-1  0.454   
##  3 prop_type_simplifi…  1.74e-2  0.0274       0.633 5.27e-  1 -3.64e-2  0.0711  
##  4 prop_type_simplifi…  1.75e-1  0.0186       9.42  5.62e- 21  1.39e-1  0.211   
##  5 prop_type_simplifi…  1.03e-1  0.0243       4.24  2.30e-  5  5.53e-2  0.151   
##  6 number_of_reviews    1.21e-4  0.000192     0.629 5.29e-  1 -2.56e-4  0.000497
##  7 review_scores_rati…  2.15e-3  0.000450     4.78  1.77e-  6  1.27e-3  0.00303 
##  8 room_typeHotel room -3.63e-1  0.0322     -11.3   2.24e- 29 -4.26e-1 -0.300   
##  9 room_typePrivate r… -8.51e-1  0.0141     -60.4   0.        -8.79e-1 -0.824   
## 10 room_typeShared ro… -1.20e+0  0.0540     -22.2   2.43e-106 -1.30e+0 -1.09
model2 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.301         0.301 0.607      456.       0     9 -8754. 17530. 17608.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

After running the model, we see that room_type is a significant predictor of the price for 4 nights, as each of the room type categorical variables is statistically significant at the 95% confidence level: their p-value are less than the usual significance level of 0.05.

Let’s keep looking for which predictors are significant for our target variable. For example, it is probable that some other characteristics of the properties will help us to predict its price. So we will add to our model the variables bathrooms, bedrooms, beds, and accommodates (size of the house).

model3 <- lm(log_price_4_nights ~ prop_type_simplified +
                                  number_of_reviews + 
                                  review_scores_rating + 
                                  room_type +
                                  bathrooms +
                                  bedrooms +
                                  beds +
                                  accommodates,
             data = listing_price)
model3 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 14 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.49e+0  0.0420     155.    0.         6.41e+0  6.57    
##  2 prop_type_simplif…   3.48e-1  0.0290      12.0   6.35e- 33  2.92e-1  0.405   
##  3 prop_type_simplif…  -8.42e-3  0.0253      -0.333 7.39e-  1 -5.81e-2  0.0412  
##  4 prop_type_simplif…   1.42e-1  0.0173       8.21  2.57e- 16  1.08e-1  0.176   
##  5 prop_type_simplif…   9.23e-2  0.0224       4.12  3.79e-  5  4.84e-2  0.136   
##  6 number_of_reviews    4.15e-5  0.000177     0.234 8.15e-  1 -3.06e-4  0.000389
##  7 review_scores_rat…   2.63e-3  0.000418     6.30  3.07e- 10  1.81e-3  0.00345 
##  8 room_typeHotel ro…  -1.98e-1  0.0302      -6.56  5.51e- 11 -2.57e-1 -0.139   
##  9 room_typePrivate …  -6.04e-1  0.0145     -41.7   0.        -6.32e-1 -0.576   
## 10 room_typeShared r…  -1.00e+0  0.0501     -20.0   5.15e- 87 -1.10e+0 -0.903   
## 11 bathrooms           -4.26e-2  0.00956     -4.45  8.62e-  6 -6.13e-2 -0.0238  
## 12 bedrooms             3.92e-2  0.00834      4.71  2.54e-  6  2.29e-2  0.0556  
## 13 beds                -4.02e-3  0.00447     -0.899 3.69e-  1 -1.28e-2  0.00474 
## 14 accommodates         1.22e-1  0.00415     29.4   2.46e-182  1.14e-1  0.130
model3 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.411         0.410 0.557      505.       0    13 -7847. 15725. 15832.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We see that all of the variables added are significant, especially the size of the house (accommodates). Thus, we will add them to our model.

Now, we will also check if superhosts (host_is_superhost) command a pricing premium, after controlling for other variables.

model4 <- lm(log_price_4_nights ~ prop_type_simplified +
                                  number_of_reviews + 
                                  review_scores_rating + 
                                  room_type +
                                  bathrooms +
                                  bedrooms +
                                  beds +
                                  accommodates +
                                  host_is_superhost,
             data = listing_price)
model4 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 15 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.53e+0  0.0421     155.    0.         6.45e+0  6.61    
##  2 prop_type_simplifi…  3.52e-1  0.0290      12.1   1.17e- 33  2.95e-1  0.408   
##  3 prop_type_simplifi… -1.01e-2  0.0252      -0.398 6.90e-  1 -5.95e-2  0.0394  
##  4 prop_type_simplifi…  1.43e-1  0.0172       8.29  1.26e- 16  1.09e-1  0.177   
##  5 prop_type_simplifi…  1.04e-1  0.0224       4.63  3.64e-  6  5.99e-2  0.148   
##  6 number_of_reviews   -1.92e-4  0.000179    -1.07  2.86e-  1 -5.43e-4  0.000160
##  7 review_scores_rati…  2.03e-3  0.000425     4.78  1.80e-  6  1.20e-3  0.00286 
##  8 room_typeHotel room -1.86e-1  0.0301      -6.18  6.70e- 10 -2.45e-1 -0.127   
##  9 room_typePrivate r… -5.97e-1  0.0145     -41.3   0.        -6.26e-1 -0.569   
## 10 room_typeShared ro… -9.99e-1  0.0500     -20.0   3.71e- 87 -1.10e+0 -0.901   
## 11 bathrooms           -4.39e-2  0.00953     -4.60  4.18e-  6 -6.26e-2 -0.0252  
## 12 bedrooms             4.00e-2  0.00831      4.81  1.56e-  6  2.37e-2  0.0563  
## 13 beds                -4.07e-3  0.00446     -0.914 3.61e-  1 -1.28e-2  0.00466 
## 14 accommodates         1.21e-1  0.00414     29.3   2.16e-180  1.13e-1  0.129   
## 15 host_is_superhostT…  1.09e-1  0.0149       7.35  2.15e- 13  8.02e-2  0.139
model4 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.414         0.413 0.555      475.       0    14 -7820. 15673. 15787.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Indeed, we see that host_is_superhost is significant. Let us now explore the significance of the location of the properties. Our data set has several variables related to this. To begin with, most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. Let’s see if this is a significant predictor of price_4_nights.

model5 <- lm(log_price_4_nights ~ prop_type_simplified +
                                  number_of_reviews + 
                                  review_scores_rating + 
                                  room_type +
                                  bathrooms +
                                  bedrooms +
                                  beds +
                                  accommodates +
                                  host_is_superhost+
                                  is_location_exact,
             data = listing_price)
model5 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 16 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.52e+0  0.0424     154.    0.         6.44e+0  6.60    
##  2 prop_type_simplifi…  3.48e-1  0.0290      12.0   7.51e- 33  2.91e-1  0.405   
##  3 prop_type_simplifi… -9.22e-3  0.0252      -0.365 7.15e-  1 -5.87e-2  0.0403  
##  4 prop_type_simplifi…  1.42e-1  0.0173       8.23  2.17e- 16  1.08e-1  0.176   
##  5 prop_type_simplifi…  1.02e-1  0.0224       4.56  5.10e-  6  5.83e-2  0.146   
##  6 number_of_reviews   -1.90e-4  0.000179    -1.06  2.90e-  1 -5.41e-4  0.000162
##  7 review_scores_rati…  2.03e-3  0.000425     4.79  1.69e-  6  1.20e-3  0.00287 
##  8 room_typeHotel room -1.88e-1  0.0301      -6.25  4.31e- 10 -2.47e-1 -0.129   
##  9 room_typePrivate r… -5.96e-1  0.0145     -41.1   0.        -6.24e-1 -0.567   
## 10 room_typeShared ro… -9.98e-1  0.0500     -20.0   5.61e- 87 -1.10e+0 -0.900   
## 11 bathrooms           -4.41e-2  0.00953     -4.63  3.74e-  6 -6.28e-2 -0.0254  
## 12 bedrooms             4.05e-2  0.00832      4.87  1.11e-  6  2.42e-2  0.0568  
## 13 beds                -3.97e-3  0.00446     -0.892 3.72e-  1 -1.27e-2  0.00476 
## 14 accommodates         1.21e-1  0.00414     29.2   9.58e-180  1.13e-1  0.129   
## 15 host_is_superhostT…  1.07e-1  0.0149       7.21  6.09e- 13  7.82e-2  0.137   
## 16 is_location_exactT…  2.47e-2  0.0118       2.09  3.66e-  2  1.54e-3  0.0478
model5 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.414         0.413 0.555      444.       0    15 -7818. 15670. 15792.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We see that is_location_exact is significant, and this seems logical because low price properties can have their not-so-good location as a reason, so there is incentive for the hosts not to show the exact street or address.

Another variable that could be useful for our model is cancellation_policy. As with the previous ones, our hypothesis is that it will be significant, in this case because a flexible cancellation policy is in the interest of many guests thus it could derive in a price premium. As usual, we will confirm its significance by looking at the p-value after running model6.

model6 <- lm(log_price_4_nights ~ prop_type_simplified +
                                  number_of_reviews + 
                                  review_scores_rating + 
                                  room_type +
                                  bathrooms +
                                  bedrooms +
                                  beds +
                                  accommodates +
                                  is_location_exact +
                                  host_is_superhost+
                                  cancellation_policy,
             data = listing_price)

model6 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 21 x 7
##    term                estimate std.error statistic   p.value conf.low conf.high
##    <chr>                  <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 (Intercept)          6.47e+0  0.0423    153.     0.         6.39e+0  6.56    
##  2 prop_type_simplifi…  3.47e-1  0.0288     12.0    3.58e- 33  2.91e-1  0.404   
##  3 prop_type_simplifi… -1.99e-3  0.0251     -0.0794 9.37e-  1 -5.11e-2  0.0472  
##  4 prop_type_simplifi…  1.46e-1  0.0171      8.50   2.17e- 17  1.12e-1  0.179   
##  5 prop_type_simplifi…  1.06e-1  0.0222      4.78   1.78e-  6  6.27e-2  0.150   
##  6 number_of_reviews   -4.73e-4  0.000180   -2.62   8.75e-  3 -8.26e-4 -0.000119
##  7 review_scores_rati…  1.92e-3  0.000422    4.55   5.44e-  6  1.09e-3  0.00275 
##  8 room_typeHotel room -1.84e-1  0.0299     -6.17   7.29e- 10 -2.43e-1 -0.126   
##  9 room_typePrivate r… -5.79e-1  0.0145    -40.0    2.00e-323 -6.07e-1 -0.550   
## 10 room_typeShared ro… -9.84e-1  0.0496    -19.8    9.13e- 86 -1.08e+0 -0.887   
## # … with 11 more rows
model6 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.423         0.422 0.551      345.       0    20 -7746. 15537. 15694.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Grouping properties by city area

Lastly, we saw when we skimmed through the data that there are 3 more variables related to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. If there are many different neighbourhoods in Istanbul, which is probably the case, it wouldn’t make sense to include them all in our model. First, we will check how many different neighbourhoods are in Istanbul, using neighbourhood_cleansed, which is equal to neighbourhood but excluding accents and special characters from the words. We must say also that neighbourhood_group_cleansed is not relevant, as it is missing in all observations.

#List of the different neighbourhoods and their frequency:
listing_neighbourhood <- listing_price %>% 
  count(neighbourhood_cleansed) %>% 
  arrange(desc(n)) %>% 
  mutate(share = n / sum(n), share=scales::percent(share))
listing_neighbourhood
## # A tibble: 39 x 3
##    neighbourhood_cleansed     n share   
##    <chr>                  <int> <chr>   
##  1 Beyoglu                 5738 28.0669%
##  2 Sisli                   2823 13.8085%
##  3 Fatih                   2671 13.0650%
##  4 Kadikoy                 1998 9.7730% 
##  5 Besiktas                1491 7.2931% 
##  6 Uskudar                  697 3.4093% 
##  7 Kagithane                596 2.9153% 
##  8 Esenyurt                 493 2.4115% 
##  9 Atasehir                 353 1.7267% 
## 10 Maltepe                  320 1.5653% 
## # … with 29 more rows

With 39 different neighbourhoods, we need to group them together so the majority of listings falls in fewer geographical areas. For this, we will create a new categorical variable neighbourhood_simplified and determine whether location is a predictor of price_4_nights.

The areas would be divided in a simple way that is applied by the average tourist when going to the city. First, there is the city centre, at the left of the Bosphorus, with the East area on the right (also called “Asian side”). Then, the remaining areas are the neighbourhoods of North, West, and finally the outskirts of the city.

#List of the different neighbourhood and their frequency:
listing_price <- listing_price %>%
  mutate(neighbourhood_simplified = case_when(
    neighbourhood_cleansed %in% c("Catalca", "Silivri", "Buyukcekmece", "Sile", "Arnavutkoy", "Pendik", "Tuzla", "Beykoz", "Cekmekoy", "Adalar", "Kartal", "Sancaktepe", "Sultanbeyli") 
                                  ~ "Outskirt",
    neighbourhood_cleansed %in% c("Eyup", "Sariyer") 
                                  ~ "North",
    neighbourhood_cleansed %in% c("Beylikduzu", "Esenyurt", "Basaksehir", "Avcilar", "Kucukcekmece", "Bakirkoy", "Bahcelievler", "Bagcilar", "Esenler", "Sultangazi", "Gaziosmanpasa", "Gungoren", "Bayrampasa") 
                                  ~ "West",
    neighbourhood_cleansed %in% c("Uskudar", "Maltepe", "Atasehir", "Umraniye", "Kadikoy") 
                                  ~ "East (Asian side)", 
    neighbourhood_cleansed %in% c("Beyoglu", "Sisli", "Fatih", "Besiktas", "Kagithane", "Zeytinburnu") 
                                  ~ "Centre"
  ))

#Test that it has been created correctly:
#listing_test <- listing %>% 
#  count(neighbourhood_simplified) %>% 
#  arrange(desc(n)) %>% 
#  mutate(share = n / sum(n), share=scales::percent(share))
#listing_1_test

With the new geographical areas determined, we can add neighbourhood_simplified to the model and confirm its significance.

model7 <- lm(log_price_4_nights ~ prop_type_simplified +
                                  number_of_reviews + 
                                  review_scores_rating + 
                                  room_type +
                                  bathrooms +
                                  bedrooms +
                                  beds +
                                  accommodates +
                                  is_location_exact +
                                  cancellation_policy +
                                  neighbourhood_simplified+
                                  host_is_superhost+
                                  reviews_per_month,
             data = listing_price)
model7 %>% broom::tidy(conf.int = TRUE)
## # A tibble: 26 x 7
##    term                 estimate std.error statistic  p.value conf.low conf.high
##    <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 (Intercept)           6.54e+0  0.0416     157.    0.        6.45e+0   6.62   
##  2 prop_type_simplifie…  3.08e-1  0.0283      10.9   2.73e-27  2.52e-1   0.363  
##  3 prop_type_simplifie…  2.17e-2  0.0247       0.877 3.81e- 1 -2.68e-2   0.0701 
##  4 prop_type_simplifie…  1.43e-1  0.0169       8.45  3.49e-17  1.10e-1   0.176  
##  5 prop_type_simplifie…  1.07e-1  0.0218       4.92  8.99e- 7  6.43e-2   0.150  
##  6 number_of_reviews     7.34e-4  0.000233     3.15  1.63e- 3  2.77e-4   0.00119
##  7 review_scores_rating  2.54e-3  0.000415     6.13  9.40e-10  1.73e-3   0.00336
##  8 room_typeHotel room  -2.09e-1  0.0293      -7.13  1.10e-12 -2.66e-1  -0.152  
##  9 room_typePrivate ro… -5.80e-1  0.0143     -40.5   0.       -6.08e-1  -0.552  
## 10 room_typeShared room -1.00e+0  0.0486     -20.7   1.03e-92 -1.10e+0  -0.909  
## # … with 16 more rows
model7 %>% broom::glance()
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.449         0.447 0.539      306.       0    25 -7533. 15121. 15314.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Final model

At this point, we think that model7 has enough predictive capability for price, with an adjusted R-squared of 0.443. The next step is to look for signs of multi-colinearity, as high correlation between independent (explanatory) variables causes problems in the regression calculations, because in the case of multi-colinearity the independent variables “rob” one another of explanatory power. For this test, we will use car::vif(model_x) to calculate the Variance Inflation Factor (VIF) for your predictors. Our benchmark will be that A VIF larger than 5 means that our model suffers from colinearity.

#Look for signs of multi-colinearity (VIF > 5)
car::vif(model7)
##                          GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified     1.49  4            1.05
## number_of_reviews        1.92  1            1.39
## review_scores_rating     1.08  1            1.04
## room_type                1.86  3            1.11
## bathrooms                2.26  1            1.50
## bedrooms                 2.94  1            1.72
## beds                     1.99  1            1.41
## accommodates             2.30  1            1.52
## is_location_exact        1.04  1            1.02
## cancellation_policy      1.13  5            1.01
## neighbourhood_simplified 1.15  4            1.02
## host_is_superhost        1.19  1            1.09
## reviews_per_month        2.12  1            1.46

There is no VIF value above 5, so we do not have to remove any variable.

Comparison of residuals between models

Continuing with the diagnosis of our model, we are also interested in checking the residuals. For this, we will use autoplot combined with ggfortify package that allows autoplot to be used with more object types, such as lm() (regressions).

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model1) +
  labs(subtitle = "Model 1") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model2) +
  labs(subtitle = "Model 2") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model3) +
  labs(subtitle = "Model 3") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model4) +
  labs(subtitle = "Model 4") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model5) +
  labs(subtitle = "Model 5") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model6) +
  labs(subtitle = "Model 6") +
  theme_bw()

#Using autoplot to check residuals of our regression model
ggfortify:::autoplot.lm(model7) +
  labs(subtitle = "Model 7") +
  theme_bw()

  • Residuals vs Fitted: There is no clear pattern in the plot. If we had pattern, it would mean that the data have a linear relationship, which was not defined by the model and was omitted out in the residuals.

  • Normal Q-Q: The Normal Q-Q Plot only has a limited S form. If we cut off more outliers, we would get a straighter line. The S shape means that we have extremer values at the tails - meaning we have very low/very high prices for there given characteristics.

  • Scale-Location: This plot shows if residuals are spread equally along with the ranges of predictors. This is how one can check the assumption of equal variance. It’s acceptable if you see a horizontal line with equally spread points. Our Data set looks like the residuals are spread randomly.

  • Residuals vs Leverage: This plot helps us to find influential cases if any, as not all outliers are influential in linear regression. In our models, there are no influential cases, as we can’t see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.

Summary of the models developed

During the development of our models, we have been assessing the improvements and the value of our changes looking at the p-values and adjusted R-squared, leading to the conclusion that our last model (model7) has the best predictive capacity. However, let us recap all the models that we have developed so far and compare the improvements between them. We will do this using huxtable.

# Produce summary table comparing models using huxtable::huxreg()
huxtable::huxreg(model1, model2, model3, model4, model5, model6, model7,
       statistics = c('#observations' = 'nobs', 
                      'R squared' = 'r.squared', 
                      'Adj. R Squared' = 'adj.r.squared', 
                      'Residual SE' = 'sigma'), 
       bold_signif = 0.05, 
       stars = NULL
)
(1)(2)(3)(4)(5)(6)(7)
(Intercept)6.793 7.031 6.492 6.529 6.519 6.474 6.535 
(0.050)(0.042)(0.042)(0.042)(0.042)(0.042)(0.042)
prop_type_simplifiedBoutique hotel0.010 0.392 0.348 0.352 0.348 0.347 0.308 
(0.033)(0.031)(0.029)(0.029)(0.029)(0.029)(0.028)
prop_type_simplifiedHouse-0.052 0.017 -0.008 -0.010 -0.009 -0.002 0.022 
(0.033)(0.027)(0.025)(0.025)(0.025)(0.025)(0.025)
prop_type_simplifiedOther-0.005 0.175 0.142 0.143 0.142 0.146 0.143 
(0.021)(0.019)(0.017)(0.017)(0.017)(0.017)(0.017)
prop_type_simplifiedServiced apartment0.192 0.103 0.092 0.104 0.102 0.106 0.107 
(0.028)(0.024)(0.022)(0.022)(0.022)(0.022)(0.022)
number_of_reviews0.002 0.000 0.000 -0.000 -0.000 -0.000 0.001 
(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)
review_scores_rating0.001 0.002 0.003 0.002 0.002 0.002 0.003 
(0.001)(0.000)(0.000)(0.000)(0.000)(0.000)(0.000)
room_typeHotel room     -0.363 -0.198 -0.186 -0.188 -0.184 -0.209 
     (0.032)(0.030)(0.030)(0.030)(0.030)(0.029)
room_typePrivate room     -0.851 -0.604 -0.597 -0.596 -0.579 -0.580 
     (0.014)(0.014)(0.014)(0.014)(0.014)(0.014)
room_typeShared room     -1.199 -1.001 -0.999 -0.998 -0.984 -1.005 
     (0.054)(0.050)(0.050)(0.050)(0.050)(0.049)
bathrooms          -0.043 -0.044 -0.044 -0.047 -0.048 
          (0.010)(0.010)(0.010)(0.009)(0.009)
bedrooms          0.039 0.040 0.041 0.043 0.049 
          (0.008)(0.008)(0.008)(0.008)(0.008)
beds          -0.004 -0.004 -0.004 -0.004 -0.007 
          (0.004)(0.004)(0.004)(0.004)(0.004)
accommodates          0.122 0.121 0.121 0.118 0.117 
          (0.004)(0.004)(0.004)(0.004)(0.004)
host_is_superhostTRUE               0.109 0.107 0.095 0.126 
               (0.015)(0.015)(0.015)(0.015)
is_location_exactTRUE                    0.025 0.021 0.014 
                    (0.012)(0.012)(0.012)
cancellation_policymoderate                         0.115 0.110 
                         (0.014)(0.014)
cancellation_policystrict                         0.423 0.310 
                         (0.551)(0.539)
cancellation_policystrict_14_with_grace_period                         0.136 0.122 
                         (0.014)(0.014)
cancellation_policysuper_strict_30                         1.004 0.916 
                         (0.184)(0.180)
cancellation_policysuper_strict_60                         1.283 1.310 
                         (0.551)(0.540)
neighbourhood_simplifiedEast (Asian side)                              -0.236 
                              (0.016)
neighbourhood_simplifiedNorth                              -0.089 
                              (0.044)
neighbourhood_simplifiedOutskirt                              -0.144 
                              (0.028)
neighbourhood_simplifiedWest                              -0.284 
                              (0.023)
reviews_per_month                              -0.094 
                              (0.009)
#observations9522     9522     9423     9423     9423     9423     9422     
R squared0.014 0.301 0.411 0.414 0.414 0.423 0.449 
Adj. R Squared0.013 0.301 0.410 0.413 0.413 0.422 0.447 
Residual SE0.721 0.607 0.557 0.555 0.555 0.551 0.539 

Prediction

Finally, it is the moment to use our best model (model7) for prediction. We will use it to predict the total cost to stay at an Airbnb for four nights. The prediction will be made for an apartment with a private room, have at least ten reviews, and an average rating of at least 90. As we are looking for an Airbnb for tourist, we make the assumption that the tourist wants to live in the city centre. Additionally, we make the assumption that we need two bedrooms and one shared bathroom. The appropriate 95% confidence interval will be included, and both the point prediction and interval will be reported in terms of price_4_nights. As we used a log(price_4_nights) model, we will use anti-log to convert the value in $.

# Adjust the data set to the specifications of our prediction (the type of property whose price we want to predict)

prediction_Istanbul <- listing_price %>% 
  
        filter(room_type == "Private room",
        number_of_reviews >= 10,
        review_scores_rating >= 90,
        prop_type_simplified == "Apartment",
        bathrooms == 1,
        bedrooms == 2,
        beds == 2,
        accommodates == 2,
        is_location_exact = TRUE,
        cancellation_policy == "flexible",
        neighbourhood_simplified == c("Centre"))

#Anti-log price by calculatin its exponential value, because exp(ln(x)) = x

model_predictions_log <- exp(predict(model7, newdata = prediction_Istanbul, interval = "confidence"))


paste("The predicted price for two people to stay four nights in Istanbul is $", round( model_predictions_log [1], 2), "with a 95% confidence interval between $", round(model_predictions_log[2], 2), "and $", round(model_predictions_log[3],2))
## [1] "The predicted price for two people to stay four nights in Istanbul is $ 670.8 with a 95% confidence interval between $ 644.74 and $ 697.92"