






... View the Top 100 Highest "Fnlwgt" Records Below!
... View the ERD Diagram of the Database!



... A Few Key Take-aways about the 1994 Dataset!




... In the past 28 years, wages have very much inflated!
See what the wage is worth today!

See how age and marital status continue to be important today!



... Select Countries in Average U.S. Dollars Today!
Derived from OECD data
Description of preliminary feature engineering and preliminary feature selection, including their decision-making process
The outcome variable was income level, which was either above or below $50K, so we knew that this would lend itself the most to supervised machine learning. Description of how data was split into training and testing sets. We set "iter" to 500. The outcome variable is the income level variable (whether it would be above or below $50,000). We opted to stratify, because this stratify parameter splits it so that the proportion of values in the resulting sample will be the same as the proportion of values given to parameter stratify. In the first attempt at machine learning, we set up six different supervised models, which were: Cluster Centroids, SMOTEENN, Naive Random Oversampling, SMOTE, Balanced Random Forest Classifier, and the Easy Ensemble with Adaptive Boost. We selected almost all of the predictor variables in this first attempt to use as a baseline, which included: age, relationship, education category level, marital status, occupation, working class (AKA sector), race, and sex because they had a relevance of at least 0.02 to the model. We did not include capital gains/loss because few records contained values in those columns, and the records that did contain values (particularly for capital gains) had substantially large values, which could have thrown off the model with outliers. In addition, the average hours worked in both groups was relatively comparable overall.
First Attempt
The end result of the first attempt was that the best overall model was the Easy Ensemble with Adaptive Boost, which we believed had the highest balanced accuracy because it avoids overfitting (which could be useful if future Census samples are taken and added to this dataset) and it tends to learn the best from the weaker classifiers. Since we eliminated some of the anticipated noise and outliers, it seemed like this model was the most logical contender at outset. The variables in descending order of importance to the model were: age, relationship, and educational level as among the most relevant. Since race and sex were lowest-scoring, we planned to get rid of them on our second machine learning attempt; however, we were aware that sometimes variables can be statistically significant, but they simply had low “say” on the model. Therefore, we did not anticipate that the Easy Ensemble with Adaptive Boost would perform any better on the second attempt, but we did want to see if other models could outperform it on the second attempt.Models Eliminated by First Attempt

Model Selected for Use in First Attempt: Easy Ensemble with Adaptive (Ada) Boost


Second Attempt
On the second attempt, we removed race and sex from the models, but this set of models performed worse across the board, so we reverted to the original model.
Third Attempt: Exploration of feature combinations
By breaking up the dataset and using different combinations of features, we hoped to see different trends and patterns in our data. The key important features that were considered in this analysis were Education, Sex, and Marital Status. The emphasis was put on being able to predict whether or not an individual was making <$50k, so only the classification report for outcomes labeled "0" was analyzed.



