Security And Privacy In The Modern Data World

If you’ve worked as a DBA or data engineer, then you’ve likely had to deal with data access requests. These basic data requests can go from quick requests to endless data access requests that bog down entire data teams.

This has been heavily driven by the fact that companies want to be data-driven. Meaning an ever increasing list of end-users from analysts and product managers and external customers who have access via data sharing.

Back when I worked at Facebook we saw several iterations of security.

When I first started, if there was data that was considered sensitive, we had to create two tables. One normal table and a second called “table_name_sensitive”. This was because we were using Presto and an internal version of HDFS which meant we were constantly having to build out our security support.

A far more granular approach is to provide column-level security. However, this also becomes very tedious and time-consuming due to how granular the grouping types of data can go.

An Example

For example, even a topic such as immigration data. You have immigration recruiting data and HR immigration data and even in the HR data, there are varying levels. Meaning you could have 5-6 extra groups that you need to manage in only one sub-domain of employee data.

This forces us, data engineers, to spend a lot of our time during oncalls managing the varying data access requests. It is tedious.

When I left Facebook I started looking into how other companies were looking to manage their data security.

The standard method is to go through a similar process we went through at Facebook but instead, data teams will create groups or roles using either the database-provided security layer or perhaps a broadly accepted company standard.

Leading to lots of tickets and backed-up data teams. Now let’s add in more complex use cases companies are facing today like managing GDPR and sharing data via Snowflake and Databricks in other companies.

Of course as a consultant, part of my role is often to look into solutions for clients and help them find the right fit. Thus, looking into vendors is a must. That’s why I wrote my last newsletter about the varying all-in-one data stacks and that’s why I am writing this one.

I kept seeing and hearing about Immuta and I finally had to stop and ask…

What is Immuta and why should I care?

Digging Into Immuta

Immuta was founded in 2015 and it has grown into a market leader for secure data access. Immuta provides data teams with a universal, cloud-based platform to automate and control access to data sets. Data-driven organizations across the world use Immuta to save money and time by automating the discovery, security, and monitoring of data. This allows organizations to share more data with the right users and mitigate the risk of data leaks and breaches.

What is the problem Immuta is trying to solve?

To get the most value out of their data, organizations prioritize collection and storage. As we move towards a more data-driven future, the scale of our data continues to grow at an alarming rate. This data needs to be accessible to a growing number of users to maximize the information organizations can derive from it. Data will also need to be inaccessible to certain users to maintain security and privacy.

“The CEO Matthew Carroll painted a future to us where analytical datastores will become more polyglot over time, while traditional rule-based approaches wouldn’t scale as the DWH started to be used outside of the traditional data team.” – Ryan Wexler Dell Technologies Capital

The processes of efficient data collection, storage, security, and access facilitation are collectively known as data management. Organizations can only ensure that their resources are being leveraged to their highest potential when data management is done well.

There are a few components to consider when we look at the modern data stack as it relates to data management. Creating a suitable environment for drawing value from data requires organizations to compile an array of (largely cloud-based) tools that operate jointly. This may include platforms that facilitate data access control, governance, storage, computing, analysis, ELT/ ETL, data visualization, business intelligence (BI), and many more. When these tools work in concert with each other, organizations can accommodate more users without sacrificing regulatory compliance. This is where Immuta comes in.

How does Immuta help with data sharing?

Data analytics is often a collaborative process. Analysts share scripts, models, and dashboards as they derive value from their valuable data. Most controls only protect the raw data, ignoring the security risks present in production-level data. Immuta focuses on solving four common data-sharing challenges:

Collaborate and share data securely inside the organization
Securely share data with third parties
Implement a secure data exchange platform
Assure lawful data use agreements

To achieve these goals, Immuta provides the features like Immuta Projects, Equalization and Masking Techniques.

Immuta Projects are virtual walled gardens for teamwork, protected through a variety of policy enforcement options (Immuta 2022). Users can share data outputs models and dashboards freely with users under the same project.I reached out to Matt Vogt the Vice President, Global Solution Architecture of Immuta to further understand what the goal of projects were and he gave a great example.Let’s say you’re working on a fraud detection model that has sensitive data. Of which, the only time you should be touching that data is when you’re working on a fraud detection model. Projects lets companies pre-define what more sensitive data is to be used for and then as an end-user you are only allowed to work on said data when you set up your workspace to connect to the project.
This can sound like an extra layer of governance. No one tends to like the extra governance. But when you read stories like the CustomerIO story where data from their customer OpenSea was used improperly you can understand why this is important.
Equalization allows users in the same project to have equal data permissions with the click of a button, saving weeks or months of data cleansing time.
Masking Techniques add re-identification techniques such as format-preserving encryption and reversible masking that re-identify data during collaboration and sharing events.

How does Immuta help with data privacy?

Immuta automates the scanning, detection, and generation of standard tags across cloud platforms. This makes it faster and easier for data teams to apply dynamic data masking and advanced access controls. Immuta focuses on solving three main data privacy challenges:

Discover and tag sensitive data
Anonymize PHI/PII data
Mask sensitive data

Immuta discovers sensitive information for millions of fields without manual effort. 60+ prebuilt classifiers work alongside domain-specific, custom classifiers based on desired confidence levels. Discovering and tagging sensitive data is done by applying anonymization techniques at query runtime with masking strategies such as k-anonymization, format-preserving masking, hashtag, Regex, conditional masking, differential privacy, and randomized responses – all without code or copying data.

How does Immuta help with data regulation?

Organizations are facing pressure to collect and distribute more data across their enterprise while implementing more stringent privacy and security policies. Immuta’s unified access and control layer creates a single intuitive place to manage all data (Immuta 2022). Any data governance personnel can write and implement even the most stringent policies on all data without having to write memos or code. Immuta focuses on solving three main data governance regulation challenges:

Protect data for regulatory compliance- GDPR, HIPAA, DOD/IC
Prove compliance with audits, reporting, and remediation actions
Prevent data breaches with data sovereignty and localization

When you enforce data policies from one place with Immuta, your organization gains regulatory efficiency and transparency. Policies are enforced dynamically, in real-time, as users attempt to access and work with data. This means all security stakeholders can write consistent policies for any data. There is also no longer a need to move or copy data and risk violating data localization laws.

When you put all these features together, Immuta shines as a powerful enforcement tool even for organizations with the most stringent data privacy and governance regulations.

Role-based access control (RBAC) vs Attribute-based access control (ABAC)

Both role-based access control (RBAC) and attribute-based access control (ABAC) involve user tools, but what is the difference between the two approaches?

Fundamentally, both data access control approaches are more concerned about how the policy is handled than the role or attribute.

RBAC is designed as an approach to data security that allows or restricts system access based on a user’s role within the organization. Essentially, users are only allowed access to data that pertains to their job functions. At the organizational level, RBAC typically manages access to tables, columns, and cells. This is closely tied to data consumer-specific table access control lists (ACL).

Under the RBAC approach, data engineers decide who belongs to a role and what that role has access to. The issue comes in when data engineers attempt to write policies for numerous roles. Even if your policy only has slight changes for each role, everything must be predetermined.

Imagine you wanted a policy to restrict access to a specific U.S. state for each role in your organization. You would have to write 50 policies – one for each state – and maintain 50 roles for each policy. If any users had access to more than one state, you would also need to create a policy for those scenarios (Immuta 2022). You can see how this quickly becomes cumbersome for data engineers and policy stakeholders. However, almost all modern and legacy databases still use the RBAC approach to implement data access controls.

This is another area where Immuta has an advantage. Research by GigaOm demonstrated that implementing the RBAC framework of Apache Ranger (a common access control platform) led to 75x more policy changes than Immuta’s ABAC approach to satisfy the same security requirements. In this landscape, fewer policy changes are better for efficiency.

Cumulative policy changes under Apache Ranger (RBAC) vs Immuta (ABAC)

Attribute-based access control is a data security approach that permits or restricts data access based on an assigned user, object action, and environmental attributes. This contracts RBAC’s reliance on the static privileges inherent to each specific role. The ABAC approach is a “highly dynamic model because policies, users, and objects can be provisioned independently, and policies make access control decisions when the data is requested”

Imagine, again, that you wanted to build specific policies for users in each state. Instead of having to create separate policies for all 50 states, you can treat attributes as dynamic variables and create a 50-state rule with a single policy. ABAC is like writing code with variables, eliminating the need to write the same blocks of policies repeatedly. This is much more efficient and reduces the chances of error in access control.

Another definition for ABAC from NIST is listed below.

“Provisioning ABAC describes attributes to subjects and objects governed by an access control rule set that specifies what operations can take place, this capability enables object owners or administrators to apply access control policy without prior knowledge of the specific subject and for an unlimited number of subjects that might require access.”

Who uses Immuta?

Many large and important organizations choose Immuta for their secure data access needs. Some of the most significant Immuta customers include the U.S Army, S&P Global, Sony, and the Mercedes-Benz Group. For a full list of Immuta clients, see the figure below:

Who Are Immuta’s Competitors?

Privacera

Privacera is a unified, automated data access governance platform. It was founded in 2014 and closed a $50 series B in 2021 as it builds momentum in the landscape of growing data privacy concerns. Privacera enhances data access governance for Google, BigQuery, Starburst Enterprise, and their cloud partner ecosystem. BigQuery is a fully managed data warehouse vital to Google Cloud’s analytics. With Privacera, users get a comprehensive set of tools to define, manage, and enforce access control across projects datasets, tables, columns, and views in BigQuery from a single location.

The line between Privacera and Immuta is drawn based on which data access approach each uses. Privacera uses an Apache-based RBAC model, while Immuta uses an ABAC model for more dynamic roles and policies.

Raito

Raito makes data teams more productive by simplifying data access management. Raito’s productivity-first approach provides observability, collaboration and automation of data access requests, all whilst respecting privacy and security. Raito gives data analysts faster access to data, removes the burdensome data access requests from data engineers, and gives data governance the tools to control, monitor and report on data usage.

What makes Raito unique is its approach to Grassroots Governance by starting from a 360° view of your current data access controls and data usage, and helping you improve your data access management maturity through usage of the product. Raito’s Policy Recommender will detect over privileged users, trends in data access requests and approvals, unused datasets, and conflicts in your data access controls, for which it will make targeted recommendations to improve your data access controls and meta-data. This way Raito grows as you grow, allowing you to transition from ACL to ABAC at your own pace.

Internal IT Teams

The largest competitor for any vendor is always internal IT teams. Facebook built their own solution that required us all to manage data access requests and other companies will likely use a combination of Active Directory and database native security functionality to create their own groupings.

Security Moving Forward

Security and governance have been brought to the forefront of many companies data strategies. As GDPR, CCPA and other laws are implemented, companies are working quickly to ensure how data is being used in appropriate methods. Tools like Immuta can make it simpler.

How does your team manage security?

If you are interested in reading more about data science or data engineering, then read/watch the articles/videos below.

What Is A Data Catalog

Databases Vs Data Warehouses Vs Data Lakes

How to Choose the Right ELT Tool

How To Build a Data-Based Business Strategy in 2022

Onboarding For Data Teams – How to set-up a streamlined onboarding experience

analytics Big Data Consulting data engineering data governance Data Science

Security And Privacy In The Modern Data World