Skip to content Skip to navigation Skip to collection information

Connexions

You are here: Home » Content » The Art of the PFUG » An Exploratory Data Analysis of the US Housing Crisis

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This collection is included in aLens by: Digital Scholarship at Rice University

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Also in these lenses

  • Lens for Engineering

    This collection is included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

An Exploratory Data Analysis of the US Housing Crisis

Module by: Garrett Grolemund. E-mail the author

Summary: This report summarizes work done as part of the Visualizing Large Data Sets PFUG under Rice University's VIGRE program. VIGRE is a program of Vertically Integrated Grants for Research and Education in the Mathematical Sciences under the direction of the National Science Foundation. A PFUG is a group of Postdocs, Faculty, Undergraduates and Graduate students formed round the study of a common problem. This module will do exploratory analysis on large data sets, specifically data related to the housing crisis.

Introduction

The US housing crisis has undermined the world economy in wide reaching and poorly understood ways. Although there is a lot of speculation over the causes and the effects of the housing crisis, most of these ideas come from opinionated blogs or news articles that do not list their sources. This lack of data becomes perilous as the US government invests trillions of dollars based on untested hypotheses concerning the crisis. Our PFUG's focus is to compile, clean, and analyze data pertaining to the housing crisis to get a clearer picture of what is actually going on.

Overview and Motivation

Real Estate Bubble: Around 2006, house prices rose much higher than their true value. Eventually, housing prices became so high, it was difficult for current owners to afford their house. As foreclosure rates increased, house prices began to plummet. This has largely affected the global economy.

Little Public Organized Data: There is a lot of speculation over the causes and the effects of the housing crisis. Unfortunately, most of these ideas come from opinionated blogs or news articles that don’t list their sources. Therefore, it is difficult to collect reliable information.

Government Expenditures: The government has already exhausted millions of dollars in order to aid those affected by housing crisis. With such little public data about the crisis, we are left wondering what data the government is using.

Still Unfolding: It is important to realize that the housing crisis in ongoing. This allows us to track its progression and hopefully make predictions for the upcoming years.

Large Data Sets: The housing crisis serves as a perfect model for visualizing large data sets. Most data sets we collect usually cover multiple years, counties and variables.

Problems with Large Data

Hard To Find: All of the data we have collected come from multiple sources. Currently, there is no central repository where data can be found.

Licenses and Fees: Some of the data sets have licenses that do not allow us to reproduce or publish any of our findings. Also many of the data sets cost large amounts of money to purchase.

Size: Some data sets were as large as 10 GB. In order to work around this problem, we were able to extract certain parts of the data sets without having to completely download them.

Dirty: Most of the data sets we find are what we call “dirty.” They are usually unorganized and practically unreadable.

Data Sets

To view our most current data sets and work, please visit our PFUG's website: http://github.com/hadley/data- housing-crisis. Some of our major data sets include...

  • American Community Survey
  • Case-Shiller House Price Index (HPI)
  • Census 2007
  • Construction of Housing Units
  • Market Value of 1 month rent in a Room
  • Vacancies
  • Mortgage Rates
  • Federal Housing Finance Agency HPI

Cleaning and Analysis

To facilitate sharing data, we have conducted both data cleaning and analysis with the open source statistical software R, which is available free of charge at http://www.r-project.org. We use the program R to clean our data sets. R is considered a statistical standard among statisticians. There are several advantages to using R. We are able to manipulate extremely large data sets (>2GB) on a normal desktop. It also allows us to produce impressive graphics with minimal coding.

Clean Data is...

  • Consistent: In a few data sets county names change over the course of a few years. This affects how we compare yearly data.
  • Concise: Some data sets contained only parts of information we needed. For example, the American Community Survey contains over 200 questions. We were only interested in the answer to one of those questions.
  • Complete: One of the data sets that was collected was missing around 80\% of the data.
  • Correct: We must assume that the data we collect is not corrupt and was recorded properly. Some smaller data sets contained unusual observations. We used our own discretion when deciding what data sets were correct.

Cleaning Process

1. First we start with ``dirty'' data. (Fig.1)

Figure 1
Figure 1 (graphics1.png)

2. Next we must download the data. A section of download code is shown below. (Fig. 2)

Figure 2
Figure 2 (graphics2.png)

3. Once we have the data, we clean the data as best we can according to the rules describing clean data above. A section of cleaning code is shown below. (Fig. 3)

Figure 3
Figure 3 (graphics3.png)

4. Now that the data has been cleaned, it may look like the top part of the data below. (Fig. 4)

Figure 4
Figure 4 (graphics4.png)

5. With clean data, we are able to explore it. The code below (Fig. 5) is the command used to produce the plot in figure Fig. 6.

Figure 5
Figure 5 (graphics5.png)

6. With R code we are able to produce complex plots with minimal amount of code. (Fig. 6)

Figure 6
Figure 6 (graphics6.png)

Interesting Findings

Location, Location, Location...

The data graphed (Fig. 7 & Fig. 8) is from the Federal Housing Finance Agency (FHFA) house price index (HPI). Both of these graphs analyze what time the HPI peaked for each metropolitan statistical area (MSA).

Looking at both graphs we believe that timing seems to be very significant. If a state peaked earlier than 2006 or later than 2007, their HPI was not as greatly affected. This also supports the claim that California and Florida were impacted the greatest.

In Figure 7, you can see that both California and Florida peaked around the same time. The graph shows in what year each MSA area reached its maximum housing price.

Figure 7
Figure 7 (graphics7.png)

In Figure 8, every point is a MSA and labeled by state. It graphs the peak HPI time versus the percent change in HPI between then maximum HPI to 2009, quarter 1 HPI. This graph shows that if HPI peaked between 2006 and 2007, then that state typically experienced a much larger percent change in HPI.

Figure 8
Figure 8 (graphics8.png)

Merced, CA

The city with the greatest percent change in the FHFA HPI was Merced, CA. This observation is very unusual of small cities. Further research into Merced showed that University California of Merced has finished construction in late 2005. Using both Figures 9 and 10, we hypothesize that the construction increased due to the necessity of housing for UC Merced students and employees.

Figure 9
Figure 9 (graphics9.png)

Figure 10
Figure 10 (graphics10.png)

Myth Busters

After discovering Merced, CA we decided to look more closely at college towns. Contrary to belief, college towns were not greatly impacted by the housing crisis. They were affected more by the location that they were in, rather than being a ``college town''. (Fig. 11)

Figure 11
Figure 11 (graphics11.png)

Other Explorations

  • Vacation Spots: Are areas where people own a second home more affected?
  • Renting vs. Owning: Is is better to rent or own a house?
  • Migration:Are cities that experienced massive population change affected?
  • Gross Domestic Product: Can we categorize a certain city by industry? Is there a relationship between cities that were hit by the housing crisis?

Communication and Future Work

It is extremely important that all of our data cleaning and findings are reproducible. We've made both the data and programming code available to the public through our PFUG's website on http://github.com/hadley/data-housing-crisis . Github is a very advance website that is able to track changes made to data and code from multiple individuals.

Github is advantageous to both our research group and to the general public. Firstly, we are able to freely store large amounts of data. Also it allows us to work on the same data without having to e-mail changes back and forth. In addition, others can view and download our data for free. We hope that by keeping the code transparent and self-replicating, others are able to easily build off our work.

We would like to develop a website that will allow users to easily access the data they are interested in, which would otherwise be a daunting task for those who wish to use a data set of this size. Because our analysis and findings also involve large amounts of information, (such as construction price time series for each US metropolitan area) we are exploring interactive graphical methods for displaying this information. Our future research will involve using the internet application Many Eyes, http://manyeyes.alphaworks.ibm.com, and then eventually the program Protovis,http://vis.stanford.edu/protovis, to create this website.

Acknowledgements

This Connexions module describes work conducted as part of Rice University's VIGRE program, supported by National Science Foundation grant DMS--0739420.

Collection Navigation

Content actions

Download module as:

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks