Training fails if we have too many features (400+)

When using char-dist-features + header features for the domain "dbpedia", we get many features (400+). The training of RandomForestClassifier with Spark fails with the error:
Cause: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

Apparently, there's a bug in Spark, but it's not clear if there is an easy fix for this problem:
https://issues.apache.org/jira/browse/SPARK-16845
http://stackoverflow.com/questions/40044779/find-mean-and-corr-of-10-000-columns-in-pyspark-dataframe
https://issues.apache.org/jira/browse/SPARK-17092

SparkTestSpec reproduces this error currently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails if we have too many features (400+) #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training fails if we have too many features (400+) #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions