Due: June 20, 2018 @ 11:59pm
This is an extra credit homework assignment for you to work on after you have finished and submitted Homework 4. The extra credit counts towards the homework category of your final grade.
For this extra credit assignment, you will be guided through the process of building a regression model that predicts the market value of condominiums in New York City using a dataset published by the New York City Department of Finance.
Obtain the Github repository you will use to complete homework 5 that contains a starter RMarkdown file named homework_5.Rmd, which you will use to do your work and write-up when completing the questions below. Remember to fill in your name at the top of the RMarkdown document and be sure to save, commit, and push (upload) frequently to Github so that you have incremental snapshots of your work. When you’re done, follow the How to submit section below to setup a Pull Request, which will be used for feedback.
Extra credit details
Each correct answer will provide you with some extra credit and you do not need to answer all 6 questions. However, any questions you do try to answer must be completed in the order given. For example, this means you shouldn’t skip the first question while providing answers to questions 2, 3, 4, 5, and/or 6. Skipping over a question like this will invalidate the entire submission. In addition, there is no partial credit, meaning that incomplete answers will be marked as incorrect.
About the dataset
This dataset reports on the market valuations of condominiums in New York City for Fiscal Year 2011/2012. The data1 comes from the New York City Department of Finance and was made available to the general public on the NYC OpenData website (https://opendata.cityofnewyork.us/). The official description for the dataset is as follows:
Condominiums and cooperatives are valued as if they were residential rental apartments. Income information from similar rental properties is applied to determine value. The Department of Finance (DOF) chooses similar properties to value condos and coops. Properties are selected based on a combination of factors such as: land location, income levels, building age and construction and exemptions and subsidies.
The data in the files housing_train.rds and housing_test.rds was adapted from the cleaned and aggregated version by data scientist Jared Lander (https://www.jaredlander.com/data).
|boro||Borough where building is located. New York City is divided into 5 boroughs, Manhattan, The Bronx, Brooklyn, Queens, and Staten Island.|
|neighborhood||Neighborhood of building location. The neighborhood name is assigned by the New York City Department of Finance, and in most cases is the same as the neighborhood’s common name.|
|class||Building classification code assigned by the New York City Department of Finance. There are four building classifications for the condominiums in the dataset, rental, walk-up, elevator, and co-op.|
|year_build||Year the building was built|
|units||Total number of units in the building|
|sqft||Gross square footage of the building|
|value_per_sqft||Total market value per squarefoot of the land and building|
The main goal of this homework assignment is to build a model that predicts the market value per squarefoot — the variable
value_per_sqft — of condominiums in New York City. We should not expect to be able to construct a model with 100% precision, but we would like to uncover trends in the data. This allows us to pose the driving question for our analysis as follows:
What are key factors that affect the overall price of condominiums in New York City?
When building and evaluating predictive models, it is standard protocol to split your dataset into a training dataset and a test dataset. This has already been done for you, with the training dataset loaded into the variable
housing_train and the testing dataset loaded into
housing_test. You will be using
housing_train for most of the homework to build and cross-validate your models. Once you’ve selected your model, as a final step you will use it to predict the
value_per_sqft column in the dataset stored in
For your convenience, the helper function
rep_kfold_cv(data, k, model, cv_reps) is loaded into your R environment and will run the code that cross-validates your models.2 This function requires four inputs:
|data||The training dataset|
|k||the number of folds to use for cross-validation|
|model||Model to cross-validate written in the lm() formula syntax (money ~ work + time)|
|cv_reps||Number of times to repeat cross-validation sequence to improve statistical averaging|
Create the following visualizations to explore the dataset:
A histogram of
value_per_sqftfaceted over boroughs of New York City
Box plots of
units(y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a
log10()scale along the y axis (see http://r4ds.had.co.nz/graphics-for-communication.html#replacing-a-scale for how to scale the axes)
Box plots of
sqft(y axis) for the different boroughs (x axis) plotted two different ways: in a normal scale and in a
log10()scale along the y axis
value_per_sqft(y axis) versus
units(x axis) using
units. Facet over two variables: boroughs and condominium classification.
Based on your plots so far, which variables (columns) in the dataset seem to have the strongest overall impact on the condominium values?
The box plots of
unitsin the previous question should reveal extreme outliers in the plot. Since our goal is to model general trends and not precise values, fitting to these data points may skew our model in an unhelpful way. Filter the dataset to remove these outliers (there are 3 in all).
Note: Besides these 3 extreme points, there are other potential outliers that you might consider removing. If you detect others, you are welcome to remove them to see if it helps you build your model, provided you explain why you’re removing them.
To begin, try building univariate (single variable) models and see how they compare with each other. Use
value_per_sqftas your response variable and then try
sqftfor your explanatory variable (this means you will try out 4 different models). Plug these models into the k-fold cross-validation function
rep_kfold_cv()with k = 10 and cv_reps = 3. Compare the mean-square error (MSE), both unadjusted and adjusted, and \(R^2\) for these models, noting that models with better predictive power will have lower MSE and higher \(R^2\) scores. Which model did the best so far?
These next three questions will guide you through the process of building a more complicated multivariate model and making predictions with your final model on the
Build and cross-validate at least 3 multivariate models that predict
value_per_sqft, using the k-fold cross-validation parameters k = 10 and cv_reps = 3. An example of a multivariate model is
value_per_sqft ~ boro + units. You may also want to consider including interaction terms (see http://r4ds.had.co.nz/model-basics.html#interactions-continuous-and-categorical for a quick review). For example, you might try
value_per_sqft ~ boro + class * sqft. Which of your models performs the best? Is it significantly better than your best model in the last question?
Now that you’ve selected your model, train it on the full dataset:
final_model <- lm(model_formula, data = housing_train)
model_formulais the best performing model from the previous question.
To predict values in the testing set, use
add_predictions()from the modelr package to put the model predictions directly into your testing dataset. Then calculate the mean-square error for the predictions:
test_predictions %>% summarize(mse = sum((value_per_sqft - pred)^2) / n())
This score is useful because it is absolute and allows you to compare how well your model performs against all other model types. Can you do better than a MSE score of 2030.34?
To wrap up, evaluate how well your model obeys the conditions for least squares linear regression, which are summarized on page 238 of the Introductory Statistics with Randomization and Simulation textbook. Make two plots to inspect how well your model conforms to the requirements for linear regression:
To evaluate the residual spread, make a scatterplot of
(value_per_sqft - pred)(y axis) versus
To inspect whether the residual distribution is nearly normal, make a Q-Q plot of
(value_per_sqft - pred).
Explain whether your model obeys the conditions for least squares linear regression.
How to submit
When you are ready to submit, be sure to save, commit, and push your final result so that everything is synchronized to Github. Then, navigate to your copy of the Github repository you used for this assignment. You should see your repository, along with the updated files that you just synchronized to Github. Confirm that your files are up-to-date, and then do the following steps:
Click the Pull Requests tab near the top of the page.
Click the green button that says “New pull request”.
Click the dropdown menu button labeled “base:”, and select the option
Confirm that the dropdown menu button labeled “compare:” is set to
Click the green button that says “Create pull request”.
Give the pull request the following title: Submission: Homework 5, FirstName LastName, replacing FirstName and LastName with your actual first and last name.
In the messagebox, write: My homework submission is ready for grading @jkglasbrenner.
Click “Create pull request” to lock in your submission.
You are encouraged to review and keep the following cheatsheets handy while working on this assignment:
Data set aggregated from the following sources:
For those that are interested in seeing how you would implement k-fold cross-validation using the
tidyversepackages, the code for the function
rep_kfold_cv()can be found in the file
repeated_kfold_cross_validation.Rdistributed with your Github repo.↩