final class DataFrameStatFunctions extends Logging
Provides eagerly computed statistical functions for DataFrames.
To access an object of this class, use DataFrame.stat.
- Since
0.2.0
- Alphabetic
- By Inheritance
- DataFrameStatFunctions
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- def action[T](funcName: String)(func: => T): T
- Attributes
- protected
- Annotations
- @inline()
- def approxQuantile(cols: Array[String], percentile: Array[Double]): Array[Array[Option[Double]]]
For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles.
For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles. For example,
result(0)(1)contains the approximate value for columncols(0)at quantilepercentile(1).This function uses the t-Digest algorithm.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") val res = double2.stat.approxQuantile(Array("a", "b"), Array(0, 0.1, 0.6))
prints out the following result:
res: Array(Array(Some(0.05), Some(0.15000000000000002), Some(0.25)), Array(Some(0.45), Some(0.55), Some(0.6499999999999999)))
- cols
An array of column names.
- percentile
An array of double values greater than or equal to 0.0 and less than 1.0.
- returns
A matrix with the dimensions
(cols.size * percentile.size)containing the approximate percentile values. If there is not enough data to calculate the quantile, the method returns None.
- Since
0.2.0
- def approxQuantile(col: String, percentile: Array[Double]): Array[Option[Double]]
For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
This function uses the t-Digest algorithm.
For example, the following code:
import session.implicits._ val df = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 0).toDF("a") val res = df.stat.approxQuantile("a", Array(0, 0.1, 0.4, 0.6, 1))
prints out the following result:
res: Array(Some(-0.5), Some(0.5), Some(3.5), Some(5.5), Some(9.5))
- col
The name of the numeric column.
- percentile
An array of double values greater than or equal to 0.0 and less than 1.0.
- returns
An array of approximate percentile values, If there is not enough data to calculate the quantile, the method returns None.
- Since
0.2.0
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @HotSpotIntrinsicCandidate() @native()
- def corr(col1: String, col2: String): Option[Double]
Calculates the correlation coefficient for non-null pairs in two numeric columns.
Calculates the correlation coefficient for non-null pairs in two numeric columns.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") double res = df.stat.corr("a", "b").get
prints out the following result:
res: 0.9999999999999991- col1
The name of the first numeric column to use.
- col2
The name of the second numeric column to use.
- returns
The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.
- Since
0.2.0
- def cov(col1: String, col2: String): Option[Double]
Calculates the sample covariance for non-null pairs in two numeric columns.
Calculates the sample covariance for non-null pairs in two numeric columns.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") double res = df.stat.cov("a", "b").get
prints out the following result:
res: 0.010000000000000037- col1
The name of the first numeric column to use.
- col2
The name of the second numeric column to use.
- returns
The sample covariance of the two numeric columns, If there is not enough data to generate the covariance, the method returns None.
- Since
0.2.0
- def crosstab(col1: String, col2: String): DataFrame
Computes a pair-wise frequency table (a contingency table) for the specified columns.
Computes a pair-wise frequency table (a contingency table) for the specified columns. The method returns a DataFrame containing this table.
In the returned contingency table:
- The first column of each row contains the distinct values of
col1. - The name of the first column is the name of
col1. - The rest of the column names are the distinct values of
col2. - The counts are returned as Longs.
- For pairs that have no occurrences, the contingency table contains 0 as the count.
Note: The number of distinct values in
col2should not exceed 1000.For example, the following code:
import session.implicits._ val df = Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)).toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show()
prints out the following result:
--------------------------------------------------------------------------------------------- |"KEY" |"CAST(1 AS NUMBER(38,0))" |"CAST(2 AS NUMBER(38,0))" |"CAST(3 AS NUMBER(38,0))" | --------------------------------------------------------------------------------------------- |1 |1 |1 |0 | |2 |2 |0 |1 | |3 |0 |1 |1 | ---------------------------------------------------------------------------------------------
- col1
The name of the first column to use.
- col2
The name of the second column to use.
- returns
A DataFrame containing the contingency table.
- Since
0.2.0
- The first column of each row contains the distinct values of
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def log(): Logger
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logDebug(msg: String, throwable: Throwable): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logDebug(msg: String): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logError(msg: String, throwable: Throwable): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logError(msg: String): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logInfo(msg: String, throwable: Throwable): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logInfo(msg: String): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logTrace(msg: String, throwable: Throwable): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logTrace(msg: String): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logWarning(msg: String, throwable: Throwable): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- def logWarning(msg: String): Unit
- Attributes
- protected[internal]
- Definition Classes
- Logging
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @HotSpotIntrinsicCandidate() @native()
- def sampleBy[T](col: String, fractions: Map[T, Double]): DataFrame
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
For example, the following code:
import session.implicits._ val df = Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)).toDF("name", "age") val fractions = Map("Bob" -> 0.5, "Nico" -> 1.0) df.stat.sampleBy("name", fractions).show()
prints out the following result:
------------------ |"NAME" |"AGE" | ------------------ |Bob |17 | |Nico |8 | ------------------
- T
The type of the stratum.
- col
The name of the column that defines the strata.
- fractions
A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
- returns
A new DataFrame that contains the stratified sample.
- Since
0.2.0
- def sampleBy[T](col: Column, fractions: Map[T, Double]): DataFrame
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
For example, the following code:
import session.implicits._ val df = Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)).toDF("name", "age") val fractions = Map("Bob" -> 0.5, "Nico" -> 1.0) df.stat.sampleBy(col("name"), fractions).show()
prints out the following result:
------------------ |"NAME" |"AGE" | ------------------ |Bob |17 | |Nico |8 | ------------------
- T
The type of the stratum.
- col
An expression for the column that defines the strata.
- fractions
A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
- returns
A new DataFrame that contains the stratified sample.
- Since
0.2.0
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- def transformation(funcName: String)(func: => DataFrame): DataFrame
- Attributes
- protected
- Annotations
- @inline()
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
Deprecated Value Members
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable]) @Deprecated
- Deprecated
(Since version 9)