Performance optimization for AWS Glue using Microsoft SQL Server Spark connector

By sleroy April 9, 2025
Performance optimization for AWS Glue using Microsoft SQL Server Spark connector

alt text

Performance optimization for AWS Glue using Microsoft SQL Server Spark connector

AWS Glue users working with Microsoft SQL Server frequently encounter two significant challenges in their ETL workflows: performance bottlenecks during large-scale operations and limited authentication options. The default JDBC connector, while functional for basic operations, often falls short for enterprises requiring high-performance data processing and advanced authentication methods.

The Microsoft SQL Server Spark connector emerges as a powerful solution to these limitations. During our testing with large-scale datasets, we observed ETL operations executing several times faster than the default connector. Beyond performance improvements, the connector provides native Active Directory authentication support, enhanced connection management capabilities, and improved reliability through configurable consistency levels.

The Microsoft SQL Server Spark connector addresses these limitations by offering:

Key advantages of Microsoft Spark connector

The Microsoft Spark connector delivers superior performance through several architectural optimizations. At its core, the connector implements native bulk copy protocol support, which significantly reduces data transfer overhead. This approach, combined with intelligent memory management and efficient data type handling, minimizes network roundtrips and optimizes resource utilization.

Implementation guide

The implementation process begins with obtaining the official Microsoft SQL Server Spark connector from the official GitHub repository.

1. obtain the official Microsoft SQL Server Spark connector:

Example of verifying file checksum on Linux/MacOS

sha256sum spark-mssql-connector_2.12-1.4.0-BETA.jar 

2. Update your script configuration to load the driver

After downloading the connector, create a dedicated S3 bucket and upload the JAR file:

Create new S3 bucket

aws s3 mb s3://your-company-glue-dependencies

Upload the connector JAR file

aws s3 cp spark-mssql-connector_2.12-1.4.0-BETA.jar s3://your-company-glue-dependencies/

Your AWS Glue execution role requires specific permissions to access this connector. Implement the following IAM policy:

    {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-company-glue-dependencies",
"arn:aws:s3:::your-company-glue-dependencies/*"
]
}
]
}

4. Code migration

Transitioning from the default JDBC driver to the Microsoft Spark connector requires minimal code modifications. The primary change involves updating the format specification in your Spark DataFrame operations. Here’s a practical comparison:

Previous approach with AWS Glue JDBC

df.write \
  .format("jdbc") \
  .option("url", "jdbc:sqlserver://...") \
  .save()

Enhanced approach with Microsoft Spark Connector

df.write \
  .format("com.microsoft.sqlserver.jdbc.spark") \
  .option("url", "jdbc:sqlserver://...") \
  .save()

When creating or editing your AWS Glue job, you need to specify the path to the connector JAR. In the AWS Glue console:

s3://your-company-glue-dependencies/jars/spark-mssql-connector_2.12-1.4.0-BETA.jar

Step 3: Add Job Parameters (Optional)

If you want to make your job more configurable, you can add job parameters for the connector path:

from awsglue.utils import getResolvedOptionsimport sys

args = getResolvedOptions(sys.argv, ['extra-jars'])
extra_jars = args['extra-jars']

Performance optimization features

The Microsoft SQL Server Spark connector brings advanced performance capabilities that seamlessly integrate with AWS Glue’s native features

Connector-specific optimizations

The connector provides granular control over transaction isolation levels, defaulting to READ_COMMITTED. For workloads requiring different isolation characteristics, you can easily modify this setting through configuration:

  df.write \
  .format("com.microsoft.sqlserver.jdbc.spark") \
  .option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
  .option("url", "jdbc:sqlserver://...") \
  .save()

One of the most powerful performance features is the TABLOCK optimization. By enabling table-level locking during write operations, we’ve observed dramatic performance improvements in our testing. Here’s how to implement this optimization:

   df.write \
  .format("com.microsoft.sqlserver.jdbc.spark") \
  .option("tableLock", "true") \
  .option("reliabilityLevel", "BEST_EFFORT") \
  .save()

The connector offers two distinct reliability modes to balance performance and data consistency. The default “BEST_EFFORT” mode prioritizes performance, while “NO_DUPLICATES” ensures data consistency during executor restarts. This flexibility allows you to align the connector’s behavior with your specific requirements and has a real impact on your Glue script performance.

The connector includes built-in schema validation that can be adjusted based on your requirements. For scenarios where strict schema checking isn’t necessary, you can disable this feature:

 df.write \
  .format("com.microsoft.sqlserver.jdbc.spark") \
  .option("schemaCheckEnabled", "false")
  .save()

Integration with AWS Glue Native Features

AWS Glue’s DynamicFrame operations can be seamlessly combined with the connector’s capabilities. This integration enables powerful ETL transformations while maintaining the performance benefits of the connector:

from awsglue.context import GlueContextfrom awsglue.dynamicframe import DynamicFrame

dynamic_frame = DynamicFrame.fromDF(df, glueContext, "dynamic_frame_name")
mapped_frame = ApplyMapping.apply(
    frame=dynamic_frame,
    mappings=your_mappings
)

For incremental processing scenarios, the connector works effectively with AWS Glue job bookmarks. This integration helps manage state across job runs:

job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()  # Enables automatic bookmark management

Conclusion

The Microsoft SQL Server Spark connector for AWS Glue represents a significant advancement in enterprise ETL capabilities, addressing two critical challenges: performance optimization and authentication flexibility. By removing the previous limitations around Windows Authentication, organizations can now maintain their existing security protocols while leveraging AWS Glue’s powerful data processing capabilities.

Through our implementation across multiple production workloads, we’ve observed substantial performance improvements that particularly benefit large-scale data processing scenarios. Organizations leveraging AWS Glue for processing millions of rows from SQL Server databases have experienced notable success with this connector, especially those requiring Windows Authentication that was previously not possible with the default JDBC connector.