Azure Synapse Analytics – new insights into data security
Azure Synapse Analytics is a new product in the Microsoft Azure portfolio. It brings a whole new layer of control plane over well-known services as SQL Warehouse (rebranded to SQL Provisioned Pool), integrated Data Factory Pipelines and Azure Data Lake Storage, as well as adds new components such as Serverless SQL and Spark Pools. Integrated Azure Synapse Workspace helps handle security and protection of data in one place for all data lake, data analytics and warehousing needs, but also requires learning some new concepts. At GFT, working with financial institutions all over the world, we pay particular attention to security aspects of solutions that we provide to our customers. Synapse Analytics is a welcome new tool in this area.
The new Workspace portal
The first visible difference, when compared to other services, is that Synapse Analytics has a separate Workspace: https://web.azuresynapse.net/ that provides access to code, notebooks, SQL, pipelines, monitoring and management panels. The portal is available in the public Internet using Azure AD Access controls for controlling access to any Synapse Analytics instance in any tenant that we have access to. However, Synapse Analytics introduces a new way to connect to the portal from Internet-isolated, on-premises networks and offices using Private Link Hubs. Compared to Private Links that protect accessing to services and databases, this solution is used for routing traffic to a web portal. In conjunction with Azure AD Conditional Access policy, the new Synapse Analytics Workspace can be protected with network and authentication policies.
In terms of authorization, Synapse Analytics introduces a new concept – Synapse Roles. This is another layer of role assignment on the Analytics Workspace and internal workspace items. There are:
- Azure Roles – control access for configuring Azure Synapse Analytics resource in Azure Portal (configuration, network, security, creating pools, diagnostic logs etc.). There are no Synapse-specific built-in roles.
- Synapse Roles – resolved inside Synapse Portal and API, for example, Synapse Spark Administrator, Synapse Artifact Publisher, Synapse Artifact User, Synapse Compute Operator and more. These roles can be assigned at workspace level as well as at the level of individual Spark Pool, Integration Runtime, Linked Service or Credential.
The granularity of role assignments allows for detailed access control for all administrators, support specialists as well as data scientists and developers. Creating custom Synapse Roles is not supported now and most Synapse Roles are in preview which means that users must be prepared for changes. What’s worth mentioning is that these roles do not protect access to data sets in ADLS.
Azure Synapse Analytics is a PaaS solution, and this is most apparent when using Serverless SQL Pools, Spark Pools and Integration Runtimes. The compute part of the platform is provided by Microsoft and never exists inside owners’ subscriptions. The same applies even to a more “traditional” data warehouse with SQL Provisioned Pools. As a result, in the default configuration, Synapse Analytics will use public endpoints for communication and will not be able to connect to isolated vNETs.
The alternative deployment model with Managed Virtual Network should be considered when using Synapse in an environment that requires higher network isolation standards. With this solution, all Synapse components (SQL Pools, Spark Pools and Integration Runtimes) will be able to use Managed Private Endpoints to connect to other services (databases, storage accounts) that allow access through private IPs only. Enabling additional Data Exfiltration protection will allow Synapse runtimes to be deployed to a virtual network to communicate over private endpoints at all times and prevent accessing any external resources.
It is worth mentioning that the Managed Virtual Network for Synapse is not a regular vNET. It exists in an external Subscription managed by Microsoft, with no access for users. As a result, the network cannot be used either for Peering nor VPN gateway configuration. Managed Private Endpoint in Synapse feature will create a Private Endpoint in the vNET from the Synapse management panel.
Managed Private Endpoint is a different feature than a Private Endpoint and different from Private Endpoint and Private Endpoint Hub.
- Managed Private Endpoint – allows Synapse runtime (for example Spark or Integration Runtime) to access services in Azure with limited access (for example Azure SQL, storage accounts),
- Private Endpoint Connection – allows accessing Synapse storage and runtime (for example Serverless or Provisioned SQL) from other vNETs. This is no different than Private Endpoint in any other PaaS database in Azure,
- Private Link Hub – Allows accessing Synapse Workspace portal from internal networks.
Enabling managed virtual network and data exfiltration will also prevent from installing external Spark libraries from public repositories, which is usually required to meet security demands/concerns.
Synapse Managed Network does not apply to the default Azure Storage used by Synapse. All regular Azure Storage network protection mechanisms are applicable for this storage, including using Private Links, disabling public access, enforcing TLS 1.2 and using Managed Identities for authorization.
Currently, Synapse Workspace with Managed Virtual Network cannot be used together with Azure Synapse Link for Azure Cosmos DB.
Data access control
Controlling access to datasets is usually the most complicated aspect of security in heterogeneous environments that use different tools to address business-critical warehousing and reporting solutions, as well as data analytics and data science environments. Using Synapse Workspace simplifies the process as the same controls can be used with Azure Active Directory. Having a centralized place for access control is one of the most important aspects required for proper access control and attestation.
- For SQL access, Azure AD groups can be used for authorization with integrated AAD-provided authentication rather than internal database local accounts. Database objects (tables, views, etc.) support access controls that are compatible with Microsoft SQL Server/Azure SQL model,
- For Azure Data Lake, access control is based on Azure roles (Storage Blob Data Reader/Owner/Contributor) as well as in the Hierarchical Namespace feature, which allows for setting up read and update access for AAD groups on the level of directory hierarchy, like in a filesystem. This is a unique feature of ADLS Gen2.
When working with Azure Data Lake, the authentication pass-through process takes place. This means that data is accessed using the user’s principal identity and authority.
Besides that, Azure Synapse Workspace has its system-assigned Managed Identity. This identity can be used for password-less access to external data sources and targets for Data Pipelines, as well as external Data Lake storage (alternatively to authentication pass-through). This, combined with earlier mentioned Synapse Roles on the Linked Services level, allows for detailed data access control for external sources as well.
Just like almost any other Azure data solution, Synapse Analytics supports built-in encryption at rest with Customer Managed Encryption Keys. For the Azure Data Lake Storage, customer managed keys are provided regularly, just like for any regular Azure Storage outside Synapse. For Azure Synapse Workspace, a key provided with a key vault will be used for SQL Pools, Spark Pools and Data factory runtimes. For dedicated SQL Pools, the Transparent Data Encryption feature can be enabled and disabled individually per pool. Synapse workspaces support RSA keys with 2048 and 3072-byte-sized keys.
Using Customer Managed Encryption Keys requires additional company policies introduced for key rotation and backups. Such policies are usually established across the organization and they are no different for Azure Synapse Analytics.
Dedicated SQL Pools
The Dedicated SQL Pool service is a re-branded Azure SQL Warehouse service with all security features already known and shared with Azure SQL Database. Besides SQL access controls and Transparent Data Encryption, Synapse provides advanced security features:
- Azure Defender for SQL with Vulnerability Assessment tool which provides automated tools for evaluating security of a permission and feature configuration, as well as database settings with an integrated view in Azure Security Center,
- Data Discovery and Classification – an engine that scans databases to identify columns storing potentially sensitive data that provides data labelling, reporting controls, as well as result-set sensitivity calculation in real-time,
- Dynamic Data Masking allows for masking columns in result sets for users that are not allowed to access sensitive data including names, credit card numbers, emails and custom data types.
These are well-known and proven security controls known to Azure SQL databases administrators from other services offered by Azure already.
Azure Synapse Analytics incorporated Data Pipelines known from Azure Data Factory including the same concept of Integration Runtimes (excluding SSIS IR). It is possible to use Auto-Resolve Integration runtime, Azure Integration runtime, as well as Self-hosted IR. For Managed Network-enabled Synapse Analytics Workspace using Auto-Resolve IR is not applicable, but Azure IR will be integrated with the managed network automatically. This will provide an option to use Managed Private Endpoints for all Data pipelines.
Integration Runtimes will provide a possibility to authorize to Linked Services using the Synapse Analytics Workspace Managed Identity and act on behalf of the service. Still, alternative methods of using a Key Vault for storing confidential data are possible.
Data loss protection
Protection of data is not just about preventing unauthorized data access, but also about guarding against data loss or corruption. In Synapse Analytics all data can be stored using two storage engines: Azure Data Lake Storage Gen2 and Dedicated SQL Pools.
Data protection in ADLS Gen2 is based on Azure Storage redundancy options:
- Zone-redundant storage (ZRS) and Geo-redundant storage (GRS) that will store redundant copies of data in separate Availability Zones or regions to avoid data loss in case of local data centre or the entire zonal or regional outage,
- The read-access extensions for both zone-redundant (ZRS-RA) and geo-redundant (GRS-RA) that also allow accessing the data in read-only mode in case of regional or zonal outage improving the data availability SLA.
For data stored in SQL Provisioned Pools, the data is automatically backed-up and stored for 7 days in a geographically redundant storage. This configuration cannot be modified. It’s worth mentioning that in modern multi-layer Data Lake architecture, the warehouse layer should be designed in such a way that re-creating from the top Data Lake layer is always possible providing a data redundancy strategy as well.
Protecting code, notebooks, metadata, and pipeline definitions from loss is a matter of establishing a proper Infrastructure as Code, CI/CD and Git integration for Synapse Analytics Workspace. Just like Data Factory, Synapse Analytics has built-in Git integration for Azure DevOps as well as GitHub, including GitHub Enterprise server. One of the known limitations here is that establishing a connection with a privately hosted Git, with a private IP only, is not possible.
Azure Synapse Analytics brings together multiple data storage and analysis tools with security concepts embedded. Having a single control plane, the workspace makes it easier to control security. However, all aspects of in-depth protection remain the same and must be considered. As you can see, several key strategic decisions need to be made carefully concerning aspects such as:
- Enabling Managed Network,
- Enabling Data Exfiltration Protection,
- Working with Private Endpoints,
- Using Customer Managed Encryption Keys,
- Enabling Azure Defender for Provisioned SQL Pools.
As usual, enabling additional security features incurs additional costs as well as introduces additional management overhead. The decisions must be preceded by a business impact analysis that will assess the financial, legal, and reputation loss in case of data being corrupted, lost, compromised or unavailable. Finding the right balance between security measures, cost, maintenance and risk is one of the key aspects to consider when moving data to the cloud and should be performed jointly by architects, legal teams and business representatives.