Data Bias and Diversity and InclusionBy
Data bias in analytical models can impact their accuracy. Correcting this bias throughout the data life cycle can improve diversity and inclusion.
Diversity and inclusion have been increasingly in the spotlight and have been a key focus of discussion in social media and the news. They have also become a central topic of discussion in executive offices and boardrooms at the same time that businesses and organizations of all kinds are collecting, analyzing, and using data to drive decision making. Yet one consequence that’s often overlooked in these processes is data bias, which can negatively affect diversity and inclusion. Finance and accounting professionals, along with corporate professionals in all industries, have a responsibility to ensure that data is utilized in a manner that drives unbiased decision making.
UNDERSTANDING DATA BIAS
The word “bias,” derived from the French word biais, describing an oblique line or a deviation from the horizontal, is often used to describe the systematic favoritism toward a certain group of people. This word has traveled into the field of data science to describe “a deviation from expectation in the data. More fundamentally, bias refers to an error in the data. But, the error is often subtle or goes unnoticed,” according to Will Goodrum, director of research and development at Elder Research.
In other words, data bias is a risk that sways one’s decisions based on data that’s more favorable or unfavorable toward a certain group of people.
Why does data bias occur? “Predictive models only ‘see’ the world through the initial data used for training. In fact, they ‘know’ of no other reality,” Goodrum wrote. “When those initial data are biased, model accuracy and fidelity are compromised. Biased models can limit credibility with important stakeholders. At worst, biased models will actively discriminate against certain groups of people.” Goodrum notes that awareness of these risks helps eliminate bias, leading to higher-quality models that not only improve the adoption of analytics but also enhance the value derived from analytics investment (bit.ly/2IHY7rK).
Although there are several types of data bias, two of the most common biases are selection bias and prejudice bias.
Selection bias occurs when there isn’t a fair representation of the population due to a lack of proper randomization in the collected data. For example, in 2015, Amazon discovered that its recruiting engine had a favorable bias toward male candidates before the company brought this system into use.
Prejudice bias is driven by automation bias, which, according to M.L. Cummings, is “a tendency to disregard or not search for contradictory information in light of a computer-generated solution that is accepted as correct” (bit.ly/37pjrfT). An example of this is COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), an algorithm used to assess potential recidivism risk by the judicial system in various jurisdictions. This system mislabeled Black defendants as “high risk” at nearly twice the rate it mislabeled white defendants, which also resulted in longer sentences for those in the former group (bit.ly/2Hl2rgn).
DATA LIFE CYCLE
Biases can occur at different stages of the data life cycle. For example, bias can occur if the designer/developer has a conscious or unconscious bias. It can also occur if the data sets used are inherently biased. There are also examples in which the data sets aren’t recognizably biased but are skewed in their selection or in their emphasis. The data life cycle involves the following stages:
- Data capture is the first step that an enterprise takes in order to make use of data via data inputs or acquisitions through data entry, connected devices, or the Internet of Things.
- Data maintenance assesses data for its quality and completeness, using a set of predefined rules to transform and utilize the data.
- Data synthesis, commonly called “analytical modeling,” is used to synthesize data in order to create more values in it by applying logic or using other data as input.
- Data usage applies the transformed data into internal management reporting to help enterprises make good business decisions.
- Data publication creates external reporting and publishes the information outside of the enterprise.
- Data archiving transfers the data in an active state to a passive state so that it can be retrieved and reutilized as needed.
- Data purge removes the data (and its copies) from the enterprise.
This data life cycle guides professionals in identifying and mitigating the data bias before it produces biased results. The previously mentioned Amazon example occurred in the data capture and maintenance phases, as the data captured the pile of résumés submitted to the company over a 10-year period, most of which came from men. Whereas COMPAS incurred the bias in the data usage phase where Northpointe, the company that developed the system, applied a logic to optimize for true positives (i.e., people at high risk of committing another crime) by also increasing false negatives (i.e., people unjustly classified as likely reoffenders).
MINIMIZING DATA BIAS
Having fair and nonskewed data is necessary to mitigate biased outcomes and for better decision making. Thus, an enterprise should capture more diverse and inclusive sets of data and review the quality of the data in the earlier stages of the data life cycle. Those who seek to reduce the use of biased data should:
- Have a more diverse workforce that allows for a company to anticipate, spot, and review issues of unfair bias and to better engage communities likely affected by bias.
- Receive feedback on the results from a diverse group of people, allowing better detection of the unnoticed bias in the captured data. A diverse group of people can help to reduce the bias in the data life cycle, eventually decreasing the biased outcomes.
Big Data has quickly led to a number of advancements within society. Yet with this rapid development of technology also comes a greater responsibility to use the data properly.
As companies look to increase their use of large data sets and automated systems to improve their workflow, it’s becoming increasingly important that they review how the data is captured and actively look for opportunities to minimize bias. This begins with using best practices in hiring to ensure that the team assigned to the project is as diverse and inclusive as possible and is supported from the top down by management that’s aware of the risks connected to data bias. A holistic approach that includes regular communication and ongoing education on the types of bias and the best practices for minimizing them is encouraged.
As society keeps developing in technology, we need to be reminded that technology can’t produce unbiased results based on biased data—it should be used to reduce human bias, not increase it.