Skip to content

Count distinct bug in Spark 3.3.x causing ''You hit a query analyzer bug." #1021

@RobinL

Description

@RobinL

What happens?

When running the following in Spark 3.3.x

linker.profile_columns(
    ["first_name", "surname"]
)

You get a ''You hit a query analyzer bug." error.

You do not get this on Spark 3.2.1.

I've narrowed down the source of the bug to a UNION ALL combined with count(distinct, see the simpler version of the SQL statement in the 'to reproduce' section.

I have reported the bug to the Spark team here

To Reproduce

import pandas as pd

from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession

import splink

from splink.spark.jar_location import similarity_jar_location

path = similarity_jar_location()


conf = SparkConf()


conf.set("spark.jars", path)


conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
spark


from splink.spark.spark_linker import SparkLinker


df_spark = spark.read.csv(
    "./tests/datasets/fake_1000_from_splink_demos.csv", header=True
)


linker = SparkLinker(df_spark)
import logging


linker.profile_columns(
    ["first_name", "surname"]
)

sql = """
SELECT
  *
FROM (
  SELECT
    (
      SELECT
        COUNT(distinct first_name)
      FROM __splink__input_table_0
    ) AS distinct_value_count
  FROM __splink__input_table_0

  GROUP BY
    first_name

)
UNION ALL
SELECT
  *
FROM (
  SELECT

    (
      SELECT
        COUNT(DISTINCT surname)
      FROM __splink__input_table_0
    ) AS distinct_value_count
  FROM __splink__input_table_0

  GROUP BY
    surname

)"""

spark.sql(sql).toPandas()

OS:

iOS

Splink version:

3.6.0

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions