Data Science: 2019

Thursday, July 18, 2019

Bias In Data Visualizations

The chart below shows a media bias outlay as proposed by its source, https://www.adfontesmedia.com
'

The raw data set gives a horizontal and vertical view of how they have classified media bias. One of the major issues is the X axis, which classifies Left, Neutral, and Right political leanings. This can be subjective. One of the major issues in visualizations in data science is introducing subjective categories for analysis.

The Y axis is somewhat mixed. For example, we can see how a classification for Original Fact Reporting can be defended, as for the rest of the categories to a certain degree, but classification of the X axis is substantially more difficult to quantify.

This is much different than graphing opinions from surveys to quantitative data. While the X, Y graph's data set is open to the public it merely contains the news outlet and the X and Y values. While I agree with many of its classifications from my personal viewing consumption, there are questions as to the qualitative methodology listed on the graph.

The point of this blog entry is not to quarrel with its findings, but to highlight what we see every day. Without insight into the methodology, one would have to first see if they agree or disagree with the findings. The next step would be to ask why and refer to a white paper on the methods to dig down.

It is ironic to be questioning bias on a graph about bias.

Thursday, July 11, 2019

Data Imputation Models Using Linear Regression

There are many ways to deal with missing values in data sets. For continuous data, we may impute the missing values to the mean or the median. For categorical values we may impute missing values to the dominant category. There is also an additional method for continuous values to test out.

Consider a scenario where we have missing values in a column. We could write a linear regression model to predict these missing values from other data in the data set. The issues one would encounter is that there may be multicollinearity between the imputed column and the independent variables. This would be problematic if a large percentage of the data is missing. It may be worthwhile to leave out that feature in a predictive model because the predicted values would be linear combinations of other features.

The end state is to feed your final model enough data with enough examples and produce variability for your target variable model to pick up on. Once that is completed, the imputation model you use to fill in the missing values for the training data set must be the same model for your validation set to fill in missing values before validation.

With the option presented in the blog post, the challenge to the reader is to test, test, test. Find a model or imputation method that reduces your error rate in training and doesn't degrade during validation. You might find additional value in dummy coding your categories and using them in the imputation model as well. Good luck!