Data scientists and business analysts need to not only find answers to their questions by querying data in various repositories, but also transform it in order to build sophisticated analysis and models. Read and write operations are at the heart of the data science process and are essential to helping them make quick and highly informed decision-making. It is also an imperative capability for data infrastructure teams that are tasked with democratizing data while complying with privacy and industry regulations.
Understanding and meeting the necessary components for both groups require a data governance platform capable of accelerating the data sharing process to satisfy the unique requirements of the data consumers, while ensuring the organization as a whole is remaining in compliance with regulations such as GDPR, CCPA, LGPD, and HIPAA.
Data is the raw material for any type of analytics – whether it is related to the historical analysis presented in reports and dashboards by business analysts, or predictive analysis that involves building a model by data scientists that anticipates an event or behavior that has not yet occurred. To be truly useful, the raw information that forms the basis of reports and dashboards must be converted into data ready for consumption so business analysts can create reports, dashboards, and visualizations to paint a picture of the overall health of the organization.
Data scientists too can benefit from converted data as they can now leverage it to build and train statistical models using techniques such as linear regression, logistic regression, clustering, and time series. The output of which can be used to automate decision-making using sophisticated techniques such as machine learning.
But this task is becoming increasingly difficult due to the rise in compliance regulations such as GDPR, CCPA, LGPD, and HIPAA and the need for organizations to secure sensitive data across multiple cloud services. In fact, according to Gartner's Hype Cycle for Privacy, 2021 report[1], "By year-end 2023, 75% of the world's population will have its personal data covered under modern privacy regulations, up from 25% today"…and that "before year-end 2023, more than 80% of companies worldwide will be facing at least one privacy-focused data protection regulation".
Because data analytics is an exploratory exercise, it requires data consumers such as business analysts and data scientists to analyze large bodies of data to reveal patterns, behaviors, or insights to inform some decision-making process. Machine learning, on the other hand, specifically attempts to understand the features with the biggest influence on the target variable. This requires access to a large amount of data that may contain sensitive elements, personally identifiable information (PII) such as a person's age, social security number, address, etc.
In many instances, this data is owned by different business units and is subjected to strict data sharing agreements; presenting infrastructure teams with unique challenges such as balancing the need to provide data consumers with access to enterprise data at the required granularity while complying with privacy regulations and requirements set by the actual data owners themselves. Another major challenge for the data infrastructure team is to support the rapid demand for data by the data science team for their analytics and innovation projects.
Data science requires not only reading data but also updating it in the above-mentioned preprocessing steps. Put simply, data science by nature is a read and write-intensive activity. To address this, data infrastructure teams usually create sandbox instances for these data consumers whenever they start a new project. However, these too require robust data access governance so as to not expose any sensitive or confidential data during data exploration.
According to the previously mentioned, Gartner Hype Cycle for Privacy, 2021 report, "through 2024, privacy-driven spending on data protection and compliance technology will breakthrough to more than $15 billion worldwide". To support the growing data science activities in a company, data infrastructure teams need to implement a unified data access governance platform that has four important attributes:
Enterprises can only thrive in this economy if data can flow to the far reaches of the organization to help make decisions that improve the company's profitability and competitive position. However, every company must share data with proper guardrails in place so that only authorized personnel can access the required data. This is mandated by an ever-increasing list of privacy regulations, as well as to foster the trust that customers have placed with the company. A data governance solution that companies need to securely extract insights from their data must support both read and write operations, as well as automate the process of identifying and classifying sensitive data, take action on it by encrypting it, and providing visibility into the company's data ecosystem.
Balaji Ganesan is CEO and co-founder of both Privacera, the cloud data governance and security leader, and XA Secure, which was acquired by Hortonworks. He is an Apache Ranger committer and member of its project management committee (PMC). To learn more visit www.privacera.com or follow the company on Twitter.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.