diff --git a/README.md b/README.md index 3cfaae7..7939491 100644 --- a/README.md +++ b/README.md @@ -174,7 +174,8 @@ someRuleSet.addMinMaxRules("Retail_Price_Validation", col("retail_price"), Bound ### Categorical Rules There are two types of categorical rules which are used to validate against a pre-defined list of valid values. As of 0.2 accepted categorical types are String, Double, Int, Long but any types outside of this can -be input as an array() column of any type so long as it can be evaulated against the intput column +be input as an array() column of any type so long as it can be evaluated against the input column. + ```scala val catNumerics = Array( Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs), @@ -187,6 +188,18 @@ Rule("Valid_Regions", col("region"), Lookups.validRegions) ) ``` +An optional `ignoreCase` parameter can be specified when evaluating against a list of String values to ignore or apply +case-sensitivity. By default, input columns will be evaluated against a list of Strings with case-sensitivity applied. +```scala +Rule("Valid_Regions", col("region"), Lookups.validRegions, ignoreCase=true) +``` + +Furthermore, the evaluation of categorical rules can be inverted by specifying `invertMatch=true` as a parameter. +This can be handy when defining a Rule that an input column cannot match list of invalid values. For example: +```scala +Rule("Invalid_Skus", col("sku"), Lookups.invalidSkus, invertMatch=true) +``` + ### Validation Now that you have some rules built up... it's time to build the ruleset and validate it. As mentioned above, the dataframe can be a simple df or a grouped df by passing column[s] to perform validation at the diff --git a/demo/Example.scala b/demo/Example.scala index ec278ed..1f52a12 100644 --- a/demo/Example.scala +++ b/demo/Example.scala @@ -50,11 +50,12 @@ object Example extends App with SparkSessionWrapper { val catNumerics = Array( Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs), - Rule("Valid_Skus", col("sku"), Lookups.validSkus) + Rule("Valid_Skus", col("sku"), Lookups.validSkus), + Rule("Invalid_Skus", col("sku"), Lookups.invalidSkus, invertMatch=true) ) val catStrings = Array( - Rule("Valid_Regions", col("region"), Lookups.validRegions) + Rule("Valid_Regions", col("region"), Lookups.validRegions, ignoreCase=true) ) //TODO - validate datetime @@ -76,18 +77,18 @@ object Example extends App with SparkSessionWrapper { .withColumn("create_dt", 'create_ts.cast("date")) // Doing the validation - // The validate method will return the rules report dataframe which breaks down which rules passed and which - // rules failed and how/why. The second return value returns a boolean to determine whether or not all tests passed -// val (rulesReport, passed) = RuleSet(df, Array("store_id")) - val (rulesReport, passed) = RuleSet(df) + // The validate method will return two reports - a complete report and a summary report. + // The complete report is verbose and will add all rule validations to the right side of the original + // df passed into RuleSet, while the summary report will contain all of the rows that failed one or more + // Rule evaluations. + val validationResults = RuleSet(df) .add(specializedRules) .add(minMaxPriceRules) .add(catNumerics) .add(catStrings) - .validate(2) + .validate() - rulesReport.show(200, false) -// rulesReport.printSchema() + validationResults.completeReport.show(200, false) } diff --git a/demo/Rules_Engine_Examples.dbc b/demo/Rules_Engine_Examples.dbc index f4770a7..3d21bf7 100644 Binary files a/demo/Rules_Engine_Examples.dbc and b/demo/Rules_Engine_Examples.dbc differ diff --git a/demo/Rules_Engine_Examples.html b/demo/Rules_Engine_Examples.html index 15b5d5a..b85b426 100644 --- a/demo/Rules_Engine_Examples.html +++ b/demo/Rules_Engine_Examples.html @@ -10,33 +10,34 @@ - - - - - - - + + + +
+ +