What happens?
When running the following in Spark 3.3.x
linker.profile_columns(
["first_name", "surname"]
)
You get a ''You hit a query analyzer bug." error.
You do not get this on Spark 3.2.1.
I've narrowed down the source of the bug to a UNION ALL combined with count(distinct, see the simpler version of the SQL statement in the 'to reproduce' section.
I have reported the bug to the Spark team here
To Reproduce
import pandas as pd
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession
import splink
from splink.spark.jar_location import similarity_jar_location
path = similarity_jar_location()
conf = SparkConf()
conf.set("spark.jars", path)
conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
spark
from splink.spark.spark_linker import SparkLinker
df_spark = spark.read.csv(
"./tests/datasets/fake_1000_from_splink_demos.csv", header=True
)
linker = SparkLinker(df_spark)
import logging
linker.profile_columns(
["first_name", "surname"]
)
sql = """
SELECT
*
FROM (
SELECT
(
SELECT
COUNT(distinct first_name)
FROM __splink__input_table_0
) AS distinct_value_count
FROM __splink__input_table_0
GROUP BY
first_name
)
UNION ALL
SELECT
*
FROM (
SELECT
(
SELECT
COUNT(DISTINCT surname)
FROM __splink__input_table_0
) AS distinct_value_count
FROM __splink__input_table_0
GROUP BY
surname
)"""
spark.sql(sql).toPandas()
OS:
iOS
Splink version:
3.6.0
Have you tried this on the latest master branch?
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
What happens?
When running the following in Spark 3.3.x
You get a ''You hit a query analyzer bug." error.
You do not get this on Spark 3.2.1.
I've narrowed down the source of the bug to a
UNION ALLcombined withcount(distinct, see the simpler version of the SQL statement in the 'to reproduce' section.I have reported the bug to the Spark team here
To Reproduce
OS:
iOS
Splink version:
3.6.0
Have you tried this on the latest
masterbranch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?