Census Project - A Pictorial Odyssey :)

Welcome to the Census 1994 Adult Dataset!

Here, you can see the overview trends of our 1994 data in Tableau.

Age has the highest correlation to high/low 1994 income at 0.32. This visualization shows that the average age of respondents in the high income group was 44, while it was 34 in the low income group.

The next highest correlation is relationship at 0.16. Marital Status, a similar variable, was 0.14. The underlying variable, sex, which only had a correlation of 0.02 may have accounted for this slight difference.

Educational Attainment had a correlation of 0.14. Education levels vary across country of origin; however, higher earners in our U.S. dataset tended to have college degrees, whereas lower income earners tended to not have college degrees.

Although race is not well correlated to income, our dataset had a preponderance of white people.

Higher income earners tended to have more capital gains, which stands to reason.

Low income earners worked 39 hours per week on average, while higher income earners worked 45.5 hours per week on average.

Graph

Graph

Graph

Graph

Graph

Graph

Graph

... View the Top 100 Highest "Fnlwgt" Records Below!

... View the ERD Diagram of the Database!

Graph

WebDivider

Welcome to the Modern- Day Income!

CPI Widget

Occupation Comparison Widget

Disparity

Income Gains in Each Bracket

WordCloud

View more graphics on Tableau!

... A Few Key Take-aways about the 1994 Dataset!

AgeStory

EducationStory

Occupation

Marital

... In the past 28 years, wages have very much inflated!

WorthToday

See what the wage is worth today!

Type in Any Job!

MarriedAge

See how age and marital status continue to be important today!

Education

Real Wage Trends 1979 to 2019

EducationWages

Real Wage Trends 1979 to 2019

MedianWage

Real Wage Trends 1979 to 2019

... Select Countries in Average U.S. Dollars Today!

Income comparison in US Dollars

Derived from OECD data

WebDivider

Machine Learning

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

The outcome variable was income level, which was either above or below $50K, so we knew that this would lend itself the most to supervised machine learning. Description of how data was split into training and testing sets. We set "iter" to 500. The outcome variable is the income level variable (whether it would be above or below $50,000). We opted to stratify, because this stratify parameter splits it so that the proportion of values in the resulting sample will be the same as the proportion of values given to parameter stratify. In the first attempt at machine learning, we set up six different supervised models, which were: Cluster Centroids, SMOTEENN, Naive Random Oversampling, SMOTE, Balanced Random Forest Classifier, and the Easy Ensemble with Adaptive Boost. We selected almost all of the predictor variables in this first attempt to use as a baseline, which included: age, relationship, education category level, marital status, occupation, working class (AKA sector), race, and sex because they had a relevance of at least 0.02 to the model. We did not include capital gains/loss because few records contained values in those columns, and the records that did contain values (particularly for capital gains) had substantially large values, which could have thrown off the model with outliers. In addition, the average hours worked in both groups was relatively comparable overall.

Printout

First Attempt

The end result of the first attempt was that the best overall model was the Easy Ensemble with Adaptive Boost, which we believed had the highest balanced accuracy because it avoids overfitting (which could be useful if future Census samples are taken and added to this dataset) and it tends to learn the best from the weaker classifiers. Since we eliminated some of the anticipated noise and outliers, it seemed like this model was the most logical contender at outset. The variables in descending order of importance to the model were: age, relationship, and educational level as among the most relevant. Since race and sex were lowest-scoring, we planned to get rid of them on our second machine learning attempt; however, we were aware that sometimes variables can be statistically significant, but they simply had low “say” on the model. Therefore, we did not anticipate that the Easy Ensemble with Adaptive Boost would perform any better on the second attempt, but we did want to see if other models could outperform it on the second attempt.

Models Eliminated by First Attempt

Graph

Model Selected for Use in First Attempt: Easy Ensemble with Adaptive (Ada) Boost

Balanced Accuracy: 0.8050432225910351 or 80.50%

Precision: 94% for <$50k outcomes

Recall: 76% of <$50k outcomes were found

Graph

Graph

Second Attempt

On the second attempt, we removed race and sex from the models, but this set of models performed worse across the board, so we reverted to the original model.

Third Attempt: Exploration of feature combinations

By breaking up the dataset and using different combinations of features, we hoped to see different trends and patterns in our data. The key important features that were considered in this analysis were Education, Sex, and Marital Status. The emphasis was put on being able to predict whether or not an individual was making <$50k, so only the classification report for outcomes labeled "0" was analyzed.

Education

Printout

While the accuracy of these models was the same, the < high school set had an almost perfect precision of 97% when looking at <$50k outcomes, at the cost of poor precision for >$50k outcomes. This pattern will be seen consistently, with a possible explanation being that most <$50k outcomes end up in the < high school demographic.

Marital Status

Printout

This faces a similar situation to the education split, where the single dataset had great performance for prediction <$50 outcomes, but the married dataset had a ~70% rate for predicting all outcomes. Since "Single" was counted as "has never married" while "Married" was counted as "at one point was married", it is possible that young people overwhelmingly belong to the single group, and have great representation there.

Sex

Printout

A matter of note with the "single" classifications is that the sample sizes were starting to get relatively small, but useful information trends are still possible to retrieve from them. The most interesting matter of note is with the married female demographic. This subset had an unusually low accuracy, and we believe this to be attributed to the time in which this dataset was collected. Since this is from the USA in 1994, the general social attitude was for young women to play a smaller role in the workforce compared to the more equal roles of today. Therefore, with the married female dataset, more members of the dataset could have less influence of their income compared to the members of the married male dataset, where more of the members were focused on making money.

Combined

Printout

With the models and analysis gained from working on this dataset, it may become necessary to compare it to other datasets, either from around the world or from a different time. Therefore, two subsets of the data were made that focus only on key demographic features. When looking at only the members' Age, Race, and Sex, we see a 70% accuracy for results. This is an incredible "at a glance" rate that allows us to compare this to most other datasets from other sources, since these features are almost certain to be included. Another analysis was done by also including Education into the aformentioned cluster, raising the accuracy up to nearly 75%.

Future Directions

We answered the question well of the factors that could lead Americans to make at or below $50K per year in 1994, and those same factors likely would come up today, such as lower educational attainment, younger age, and being unmarried.

The machine learning model had some notions of the factors that could lead Americans to make above $50K per year in 1994; however, this would be better in the future as a regression study that goes back to the original education number column from the original dataset and uses that as the benchmark of education.

In addition, capital gains, capital losses, and hours worked could maybe be used more meaningfully in the future, although none of them had coefficients of determination above .05 in our quick glance in Excel when we examined the proportion of the variation in the income level that is predictable from the independent variable.

With more recent data samples, this project would have been better as more of a time series to see the variables' performances over time.