InsideDarkWeb.com

How to efficiently map over DF and use combination of outputs?

Given a DF, let’s say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).

What is the best way to get a resulting df that will contain the original df A and the 3 added columns?

val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")

def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method3", col("num1")+col("num2"))
}

One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want — with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?

Stack Overflow Asked on November 18, 2021

1 Answers

One Answer

Efficient way to do this is using select.

select is faster than the foldLeft if you have very huge data - Check this post

You can build required expressions & use that inside select, check below code.

scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1   |2   |
|2   |5   |
|3   |7   |
+----+----+
scala> val colExpr = Seq(
                          $"num1",
                          $"num2",
                          ($"num1"/$"num2").as("method1"),
                          ($"num1" * $"num2").as("method2"),
                          ($"num1" + $"num2").as("method3")
)

Final Output

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

Update

Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.

scala> def add(
               num1:Column, // May be you can try to use variable args here if you want.
               num2:Column,
               f: (Column,Column) => Column
             ): Column = f(num1,num2)

For Example, varargs & while invoking this method you need to pass required columns at the end.

def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)

Invoking add function.

scala> val colExpr = Seq(
    $"num1",
    $"num2",
    add($"num1",$"num2",(_ / _)).as("method1"),
    add($"num1", $"num2",(_ * _)).as("method2"),
    add($"num1", $"num2",(_ + _)).as("method3")
)

Final Output

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

Answered by Srinivas on November 18, 2021

Add your own answers!

Related Questions

JS get random value from array and update array

2  Asked on December 27, 2020 by nicolas-schmit

     

Caught and declared exception in Java?

1  Asked on December 26, 2020 by hrvoje-t

   

IEnumerable and Recursion using yield return

8  Asked on December 26, 2020 by jamie-dixon

       

How to parse CSV with node.js?

2  Asked on December 26, 2020 by idarosa

         

Why this program with for loop give zero when y>5 and x=2

2  Asked on December 26, 2020 by vms

 

Null pointer exception. How my connection object is pointing to null

2  Asked on December 26, 2020 by monisha-ravi

     

How do I make contents in HTML by using css

0  Asked on December 26, 2020 by jaeseo-lee

   

How to show Toaster after logout

2  Asked on December 26, 2020

   

How to write to a csv within a pandas UDF in pyspark?

0  Asked on December 26, 2020 by codemaster2020

   

Keycloak permission to restrict account based resources

0  Asked on December 26, 2020 by james-lin

 

CSS flex, full height sidebar inside a modal?

1  Asked on December 25, 2020 by ddulla

   

Cant loop through List and display in DataTable

1  Asked on December 25, 2020 by finchy70

 

Automate and looping through batch script

2  Asked on December 25, 2020 by nck_505

       

issue connecting Heroku PHP stack to Redis using Predis

0  Asked on December 25, 2020 by rob-edlin

       

Ask a Question

Get help from others!

© 2021 InsideDarkWeb.com. All rights reserved.