When we use ADF to call Databricks we can pass parameters, nice. When we finish running the Databricks notebook we often want to return something back to ADF so ADF can do something with it. Think that Databricks might create a file with 100 rows in (actually big data 1,000 rows) and we then might want to move that file or write a log entry to say that 1,000 rows have been written.
When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don’t exactly sound like they work together, so how does it all work and what do we need to be aware of?
It is a non-null constraint, not a non-ish-null constraint
You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won’t work with your lovely constraints. The problem we have is that RDBMS’s apply to entire column at one time. This columnar approach to constraints means that, although you might have nine hundred and ninety nine thousand rows that will pass the constraints, if you have even one, single solitary, row that fails then the whole constraint fails. It is a non-null constraint, not a non-ish-null constraint
It has been a little while but I have updated SQLCover to include a number of fixes and small features, the majority of which are improvements to the html output:
For full details and to download the latest version see:
https://github.com/GoEddie/SQLCover/releases/tag/0.5.0
or
https://www.nuget.org/packages/GOEddie.SQLCover/0.5.0
If you get any issues please comment below or raise an issue on github.
Highlights
Cobertura
Cobertura is a format for code coverage tools, Azure DevOps supports cobertura files to display code coverage results alongside the build so this is a really nice thing to be able to have, if you use SQLCover in your Azure DevOps builds (or any ci server that supports Cobertura files) then you can use the Cobertura output to generate this:
This is the final part in the four-part series into testing ETL pipelines, how exciting!
- Part 1 - Unit Testing https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/
- Part 2 - Integration Testing https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/
- Part 3 - Validating the upstream data https://the.agilesql.club/2019/09/how-do-test-the-upstream-data-is-good-in-an-etl-process-etl-testing-part-3/
This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works. Now you have code in production you need to do the final type of testing, “testing in production”.
“[Error] [JvmBridge] java.sql.SQLException: No suitable driver” - unable to connect spark to Microsoft SQL Server.
In spark when you want to connect to a database you use Read() passing in the format “jdbc” and include the options of url, driver and either dbtable or query.
DataFrame dataFrame = spark.Read()
.Format("jdbc")
.Option("url","jdbc:sqlserver://localhost;databaseName=dName;")
.Option("user", "user_name")
.Option("password", "password or secret")
.Option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.Option("dbtable", "schema.table_name")
.Load();
The url tells jdbc that we want to connect to sqlserver (jdbc:sqlserver) and then the details of the server to connect to. For spark to be able to find a driver for “sqlserver” you need to do two things, firstly you need to pass the jar to the driver to spark and secondly pass in the name of the driver that can implement a connection to “sqlserver”.
In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn’t give us is a way to validate the data we receive is correct.
I finally got around to updating the tSQLt test adapter for visual studio, you can download it from:
https://marketplace.visualstudio.com/items?itemName=vs-publisher-263684.GoEddietSQLt2019
or the search in visual studio extensions thingy finds it as well.
For details on what this is and how it works see the original post:
https://the.agilesql.club/2016/08/tsqlt-visual-studio-test-adapter/
Steps needed
Getting Apache Spark running on windows involves:
- Installing a JRE 8 (Java 1.8/OpenJDK 8)
- Downloading and extracting SPARK and setting SPARK_HOME
- Downloading winutils.exe and setting HADOOP_HOME
- If using the dotnet driver also downloading the Microsoft.Spark.Worker and setting DOTNET_WORKER_DIR if you are going to use UDF’s
- Making sure java and %SPARK_HOME%\bin are on your path
There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.
When you run an application using spark-dotnet, to launch the application you need to use spark-submit to start a java virtual machine which starts the spark-dotnet driver which then runs your program so that leaves us a problem, how to write our programs in visual studio and press f5 to debug?
There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached - think apps which spawn other processes and they fail in the startup routine. You can add a Debugger.Launch() to your program then when spark executes it, a prompt will be displayed and you can attach Visual Studio to your program. (as an aside I used to do this manually a lot by writing an __asm int 3 into an app to get it to break at an appropriate point, great memories but we don’t need to do that anymore luckily :).