2.1 Cache friendly in-memory hash map layout
2.2 Fallback to external-sort-based aggregation when memory is exhausted
2.3 aggregations操作中默认开启Code generation
3.1 Prefer (external) sort-merge join over hash join in shuffle joins (for left/right outer and inner joins), i.e. join 3.2 data size is now bounded by disk rather than memory
3.3 Support using (external) sort-merge join method for left/right outer joins
3.4 Support for broadcast outer join
4.1 Cache-friendly in-memory layout for sorting
4.2 Fallback to external sorting when data exceeds memory size
4.3 Code generated comparator for fast comparisons
Native memory management & representation
5.1 Compact binary in-memory data representation, leading to lower memory usage
5.2 Execution memory is explicitly accounted for, without relying on JVM GC, leading to less GC and more robust memory management
SPARK-8638: Improved performance & memory usage in window functions
Metrics instrumentation, reporting, and visualization
7.1 SPARK-8856: Plan visualization for DataFrame/SQL
7.2 SPARK-8735: Expose metrics for runtime memory usage in web UI
7.3 SPARK-4598: Pagination for jobs with large number of tasks in web UI
SPARK-6797: Support for YARN cluster mode in R
SPARK-6805: GLMs with R formula, binomial/Gaussian families, and elastic-net regularization
SPARK-8742: Improved error messages for R
SPARK-9315: Aliases to make DataFrame functions more R-like
SPARK-8521: New Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
New Estimators in Pipeline API: SPARK-8600 naive Bayes, SPARK-7879 k-means, and SPARK-8671 isotonic regression.
New Algorithms: SPARK-9471 multilayer perceptron classifier, SPARK-6487 PrefixSpan for sequential pattern mining, SPARK-8559 association rule generation, SPARK-8598 1-sample Kolmogorov-Smirnov test, etc.
4.1 LDA: online LDA performance, asymmetric doc concentration, perplexity, log-likelihood, top topics/documents, save/load, etc.
4.2 Trees and ensembles: class probabilities, feature importance for random forests, thresholds for classification, checkpointing for GBTs, etc.
4.3 Pregel-API: more efficient Pregel API implementation for GraphX.
4.4 GMM: distribute matrix inversions.
Model summary for linear and logistic regression.
Python API: distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc.
Tuning and evaluation: train-validation split and multiclass classification evaluator.
Documentation: document the release version of public API methods
SPARK-7398: Backpressure: Automatic and dynamic rate controlling in Spark Streaming for handling bursty input streams. This allows a streaming pipeline to dynamically adapt to changes in ingestion rates and computation loads. This works with receivers, as well as, the Direct Kafka approach.
Python API for streaming sources
2.1 SPARK-8389: Kafka offsets of Direct Kafka streams available through Python API
2.2 SPARK-8564: Kinesis Python API
2.3 SPARK-8378: Flume Python API
2.4 SPARK-5155: MQTT Python API
SPARK-3258: Python API for streaming machine learning algorithms: K-Means, linear regression, and logistic regression
SPARK-9215: Improved reliability of Kinesis streams : No need for enabling write ahead logs for saving and recovering received data across driver failures
Direct Kafka API graduated: Not experimental any more.
SPARK-8701: Input metadata in UI: Kafka offsets, and input files are visible in the batch details UI
SPARK-8882: Better load balancing and scheduling of receivers across cluster
SPARK-4072: Include streaming storage in web UI
Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with code generation for expression evaluation. These features can both be disabled by setting spark.sql.tungsten.enabled to false.
Parquet schema merging is no longer enabled by default. It can be re-enabled by setting spark.sql.parquet.mergeSchema to true.
Resolution of strings to columns in Python now supports using dots (.) to qualify the column or access nested values. For example df[‘table.column.nestedField’]. However, this means that if your column name contains any dots you must now escape them using backticks (e.g., table.
In-memory columnar storage partition pruning is on by default. It can be disabled by setting spark.sql.inMemoryColumnarStorage.partitionPruning to false.
Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum precision of 38. When inferring schema from BigDecimal objects, a precision of (38, 18) is now used. When no precision is specified in DDL then the default remains Decimal(10, 0).
Timestamps are now processed at a precision of 1us, rather than 100ns.
Sum function returns null when all input values are nulls (null before 1.4, 0 in 1.4).
In the sql dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains unchanged.
The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe and thus this output committer will not be used by parquet when speculation is on, independent of configuration.
JSON data source will not automatically load new files that are created by other applications (i.e. files that are not inserted to the dataset through Spark SQL). For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore), users can use REFRESH TABLE SQL command or HiveContext’s refreshTable method to include those new files to the table. For a DataFrame representing a JSON dataset, users need to recreate the DataFrame and the new DataFrame will include new files.
Write Ahead Log does not need to be abled for Kinesis streams. The updated Kinesis receiver keeps track of Kinesis sequence numbers received in each batch, and uses that information re-read the necessary data while recovering from failures.
The number of times the receivers are relaunched on failure are not limited by the max Spark task attempts. The system will always try to relaunch receivers after failures until the StreamingContext is stopped.
Improved load balancing of receivers across the executors, even after relaunching.
Enabling checkpointing when using queueStream throws exception as queueStream cannot be checkpointed. However, we found this to break certain existing apps. So this change will be reverted in Spark 1.5.1.
In the spark.mllib package, there are no breaking API changes but some behavior changes:
1.1 SPARK-9005: RegressionMetrics.explainedVariance returns the average regression sum of squares.
1.2 SPARK-8600: NaiveBayesModel.labels become sorted.
1.3 SPARK-3382: GradientDescent has a default convergence tolerance 1e-3, and hence iterations might end earlier than 1.4.
In the experimental spark.ml package, there exists one breaking API change and one behavior change:
2.1 SPARK-9268: Java’s varargs support is removed from Params.setDefault due to a Scala compiler bug.
2.2 SPARK-10097: Evaluator.isLargerBetter is added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.