Partiality in Data Analysis that One Should Know About
The chances of partiality, in the process of data analysis, are extreme and it can vary from how a question is hypothesized and explored to how the data is sampled and organized. Bias can be introduced at any stage from defining and capturing the data set to run the analytics or AI or ML system. Hariharan Kolam, CEO, and founder of Findem, a people intelligence company stated in an interview, “Avoiding bias starts by recognizing that data bias exists, both in the data itself and in the people analyzing or using it,” Actually it is kind of impossible to be completely unbiased and biasedness is an existing element of human nature.
The Human Catalyst
Bias in data analysis can come from human sources because they use unrepresentative data sets, leading questions in surveys, and biased reporting and measurements. Often bias goes unnoticed until some decision is made based on the data, such as building a predictive model that turns out to be wrong. Although data scientists can never completely eliminate bias in data analysis, they can take countermeasures to look for it and mitigate issues in practice.
The Social Catalyst
Bias is also a moving target as societal definitions of fairness evolve. Reuters has reported an instance when the International Baccalaureate program had to cancel its annual exams for high school students in May due to COVID-19. Instead of using exams to grade students, the IB program used an algorithm to assign grades that were substantially lower than many students and their teachers expected.
Biasedness from Existing Data
Amazon’s previous recruiting tools showed preference toward men, who were more representative of their existing staff. The algorithms didn’t explicitly know or look at the gender of applicants, but they ended up being biased by other things they looked at that were indirectly linked to gender, such as sports, social activities, and adjectives used to describe accomplishments. In essence, the AI was picking up on these subtle differences and trying to find recruits that matched what they internally identified as successful.
Under-representing populations
Another big source of bias in data analysis can occur when certain populations are under-represented in the data. This kind of bias has had a tragic impact in medicine by failing to highlight important differences in heart disease symptoms between men and women, said Carlos Melendez, COO, and co-founder of Wovenware, a Puerto Rico-based nearshore services provider. Bias shows up in the form of gender, racial or economic status differences. It appears when data that trains algorithms do not account for the many factors that go into decision-making.
Cognitive biases
Cognitive bias leads to statistical bias, such as sampling or selection bias. Often analysis is conducted on available data or found in data that is stitched together instead of carefully constructed data sets. Both the original collection of the data and an analyst’s choice of what data to include or exclude creates sample bias. Selection bias occurs when the sample data that is gathered isn’t representative of the true future population of cases that the model will see. In times like this, it’s useful to move from static facts to event-based data sources that allow data to update over time to more accurately reflect the world we live in. This can include moving to dynamic dashboards and machine learning models that can be monitored and measured over time.