Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users. Use Cases for Apache Spark often are related to machine/deep learning, graph processing.

0
votes
0answers
3 views

A question about using the thread pool to submit tasks one by one in the form of synchronous queues in spark, and catch an exception in it

My spark program submits tasks to a thread pool in a synchronous blocking queue. The start of the next task depends on the result of the previous task.A single basic task consists of a map stage and a ...
0
votes
0answers
2 views

Connection Problem to Sansa Stack Sparql Server

What am I using? I am using Sansa-Notebooks repository which is running as docker containers. Sansa-Notebooks is built top of Apache Spark and Sansa comes with bundled things like, Zeppelin, Hue, etc....
-4
votes
0answers
11 views

Create a dummy dataframe with 6 columns and one million rows

How can I create a dummy dataframe with 6 columns (e.g. col1, col2, col3, col4, col5, col6) of random integer values, with one million rows?
0
votes
0answers
8 views

Ignore mid line CR character in CR delimited file

hdfs .globStatus(pathPattern) .filter(f => f.isFile) .map{f => f.getPath.getName -> spark .read .option("delimiter", "\u000A") .option("...
0
votes
0answers
5 views

spark-submit not uploading files to HDFS in yarn-cluster mode

We are having a problem, we use spark-submit --master yarn --deploy-mode cluster ... to submit Spark job to yarn cluster, the error is below, which could be seen in spark ui file:/root/.sparkStaging/...
0
votes
0answers
15 views

How use map function on dataset and filter bad data

I want to understand about map function on dataset. i have some dataset and i want to map each column based on there datatype if any wrong type receive then i want to filter that row. I have Input ...
0
votes
0answers
5 views

how to read a .dat file with delimiter /u0001 and record next record will be separating by next line in spark with scala

I have .dat extension file which not having any header 1.fields separated by '\u0001' 2.next record will be in new line how can i read this file in spark with scala and convert to a dataframe.
0
votes
1answer
14 views

Broadcasting TypeSafe Config throws exception User class threw exception: java.io.UTFDataFormatException: encoded string too long: 70601 bytes?

Just as the question title says I am trying to broadcast a TypeSafe config to the executors so my code there can access the config. Unfortunately, I'm getting an exception object AppConfigUtility { ...
0
votes
0answers
14 views

How to pass current row to User Defined Function? [duplicate]

How to pass entire row in dataframe to a UDF? I have a UDF in Python and want to pass the current row into UDF. sample df2 = df1.select(myudf(row)); I can see samples for passing specific column to ...
0
votes
1answer
19 views

Spark scala - Fetch Dataset column and convert to Seq

I have a Dataset case class MyDS ( id: Int, name: String ) I want to fetch all the names in a sequence without using collect. I have gone through various posts and the only solution I found was ...
0
votes
1answer
10 views

apache spark standalone scheduler - why does driver need a whole core in 'cluster' mode?

In spark's 'client' deploy mode the spark driver does not consume cores, only spark apps do. But why in 'cluster' mode does the spark driver need a core for itself?
0
votes
0answers
9 views

Controlling network traffic running pyspark in local mode?

I'm running my preprocessing routine with pyspark in local mode on 12 core mac pro machine. Although I run it as a local mode with --master local[*], I suspect the network traffic actually touches my ...
0
votes
1answer
13 views

Writing file using FileSystem to S3 (Scala)

I'm using scala , and trying to write file with string content, to S3. I've tried to do that with FileSystem , but I getting an error of: "Wrong FS: s3a" val content = "blabla" val fs = ...
0
votes
0answers
14 views

Spark Structured Streaming Databricks Event Hub Schema Defining issue

I am having an issue with defining the structure for the json document. Now i am trying to do the same schema on streamread. val jsonSchema = StructType([ StructField("associatedEntities", struct<...
1
vote
1answer
30 views

create a dataset with data frame from sequence of tuples with out using case class

I have sequence of tuples through which I made RDD and converted that to dataframe. like below. val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3"))) import spark.implicits._ val ...
0
votes
1answer
12 views

Load file with schema information and dynamically apply to data file using Spark

I don't want to use infer schema and headers options. The only way is I should read a file containing only column headers and should use it dynamically to create a dataframe. I am using Spark 2 and ...
3
votes
4answers
55 views

How to change date from yyyy-mm-dd to dd-mm-yyy using Spark function

I'm working on Apache spark project on eclipse using Scala I would like to change my date format from yyyy-mm-dd to dd-mm-yyyy This is my code: val conf = new SparkConf().setMaster("local")....
1
vote
1answer
31 views

Filtering Dataframe by nested array

Given the following structure: root |-- lvl_1: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- l1_Attribute: String .... | | |-- ...
-1
votes
1answer
37 views

How to find the next occurring item from current row in a data frame using Spark Windowing?

I have the following Dataframe: +-----+----------+-------------+--------------------+---------+-----+---------+ |ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ | +-----...
-1
votes
0answers
14 views

Why External Jar complains and Maven is able to fix issue scalac compile yields “object apache is not a member of package org”

On Eclipse, while setting up spark , even after adding external jars to build path to spark-2.4.3-bin-hadoop2.7/jars/<_all.jar>, Complier complains about '“object apache is not a member of ...
-2
votes
0answers
20 views

Custom partitioning

I have the following CustomPartitioner defined where I am passing a map of key, value pairs to CustomPartitioner and for that Id, I want the value to be returned. I have already calculated the ...
1
vote
0answers
38 views

Scala capture method of field without whole instance

I had a piece of code that looks like this: val foo = df.map(parser.parse) // def parse(str: String): ParsedData = { ... } However, I found out that passes a lambda into Scala that captures this, I ...
0
votes
2answers
57 views

Save file without brackets

I would like my final result to be without brackets I have tried this but it gave back so many errors: .map(x => x.mkString(",").saveAsTextFile("/home/amel/new") This my code val x= sc....
1
vote
2answers
33 views

Spark: Row filter based on Column value

I have millions of rows as dataframe like this: val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "...
-1
votes
1answer
19 views

Load CSVs - unable to pass file paths from dataframe

Below code works fine: val Path = Seq ( "dbfs:/mnt/testdata/2019/02/Calls2019-02-03.tsv", "dbfs:/mnt/testdata/2019/02/Calls2019-02-02.tsv" ) val Calls = spark.read .format("com.databricks....
1
vote
2answers
15 views

Unable to recognise windowing function in intellij

Unable to recognize avg and over function in IntelliJ. It is saying cannot resolve symbol avg and over. Can someone please say which library to import. object DomainSpecificLanguage { def main(args: ...
0
votes
1answer
29 views

Drop any row with NULL

I have a small problem. I would like to delete any row that contains 'NULL'. This is my input file: matricule,dateins,cycle,specialite,bourse,sport 0000000001,1999-11-22,Master,IC,Non,Non 0000000002,...
0
votes
0answers
29 views

Show call failing with a grouped and aggregated dataframe

I am trying to use sum after groupBy, like this, val b = a.groupBy($"key").agg(sum($"value")) Here the schema of a is of the following type, |-- key: string (nullable = true) |-- value: integer (...
0
votes
0answers
8 views

Application job submission with out duplication

We are using DataStax Spark 6.0. We are submitting jobs using crontab to run every 5 mins. We wrote script to find if it is running to avoid duplicate submission of same application. Is there a way ...
-3
votes
1answer
13 views

suggestion for clustering algorithm?

I have a dataset of 590000 records after preprocessing and i wanted to find clusters out of it and it contains string data (for now assume i have only one column with 590000 unique values in dataset). ...
-1
votes
1answer
23 views

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below c1 Value A Array[47,97,33,94,6] A Array[59,98,24,83,3] A Array[77,63,93,86,62] B Array[86,71,72,23,27] B ...
2
votes
0answers
21 views

Redirect spark driver logs to console instead of stderr file

I am running spark master and worker in kubernetis container.I am sumbitting job using java(scala).Spark worker have log4j.properties.template file in conf folder as below. Here it specifies console ...
0
votes
0answers
17 views

Hadoop replication property from spark code not working

Hadoop replication property not working from the spark code I have a use case for which I want to override the default hdfs replication factor from my spark code. For this I have set the hadoop ...
0
votes
0answers
16 views

How can I prevent z.put to share variable across different notes in Zeppelin?

My intention is to use z.put and z.get to share variables between paragraphs of the same note. My paragraph is of type %spark. However, I notice that when I use z.put, the variable is visible even in ...
0
votes
0answers
31 views

Loading modified schema to Spark

I have a Postgres database with raw data and I want to analyze it with Spark. However, I don't need all columns to do the manipulations. For example, I have the following columns: name surname age ...
0
votes
2answers
27 views

How to create a dataframe from a string key\tvalue delimited by '\t'

I have a log file with the structure: log_type time_stamp kvs p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-0b1\tvl\t20190605.1833\tvt\t20190605.1833\tvs\t20190508 p 2019-06-05 18:53:20 us\...
2
votes
2answers
40 views

Check particular identifier is present in the other data frame or not

val df1 = Seq(("[1,10,20]", "bat","43243"),("[20,4,10]","mouse","4324432"),("[30,20,3]", "horse","4324234")).toDF("id", "word","userid") val df2 = Seq((1, "raj", "name"),(2, "kiran","name"),(3,"...
1
vote
1answer
24 views

Spark partition on nodes foreachpartition

I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this: JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray); ...
0
votes
0answers
13 views

Generating data with spark-bench is not done in parallel

I am running spark on cluster mode, on top of YARN. The goal is to launch spark-bench (a spark benchmarking suite) to stress I/Os on the cluster. Here is the file (csv-vs-parquet.conf) I use to ...
0
votes
1answer
24 views

Element-wise sum of array across rows of a dataset - Spark Scala

I am trying to group the below dataset based on the column "id" and sum the arrays in the column "values" element-wise. How do I do it in Spark using Scala? Input: (dataset of 2 columns, column1 of ...
1
vote
0answers
21 views

How does Trigger work in Spark Structured Streaming?

I was going through the structured streaming docs on trigger interval: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers I used the trigger interval as 30 ...
-3
votes
0answers
21 views

Multiple input for a transformation in pyspark

I have to work with multiple dataframes in pyspark for a transformation. Without any join or union .is it feesible if so let me know about it.
1
vote
1answer
23 views

How to load by partition into spark without knowing database table schema

I am trying to load a 40 million large table into spark using JDBC connection. Clearly, load by partition is the answer to this. Problem is that I do not know the schema of the table I need to load ...
-1
votes
0answers
22 views

How to sorts a file on key by the field position. in Scalar spark?

I has a data.txt file. This is my Input: 0000000,浜地______,50,F,91 0000000,浜地______,50,F,59 0000000,浜地______,50,F,20 0000000,浜地______,50,F,76 0000003,杉山______,26,F,30 0000003,杉山______,26,F,50 0000003,...
0
votes
0answers
23 views

Spark column with row number

I wanted to make a column with row number in my 50M dataset and then get bunches of 1M rows. i tried this solution: var dfW = cookesWb.withColumn("n", row_number().over(Window.orderBy("uid"))) ...
0
votes
2answers
57 views

Apache Spark: cores vs. executors

Questions related to cores vs executors are asked number of times on SO. Apache Spark: The number of cores vs. the number of executors As each case is different, I'm asking similar question again. ...
-1
votes
1answer
17 views

How to fix TF.IDF functions on pyspark?

I'm trying to develop TF.IDF process on pyspark by MapReduce (The platform is Databricks). Since I'm really new to pyspark, Databricks, and to the whole process of MapReduce, I get some syntax ...
0
votes
1answer
30 views

Spark Scala: “cannot resolve symbol saveAsTextFile (reduceByKey)” - IntelliJ Idea

I suppose some dependencies are not defined in build.sbt file. I've added library dependencies in build.sbt file, but still I'm getting this error mentioned from title of this question. Try to search ...
1
vote
1answer
32 views

replace column values in spark dataframe based on dictionary similar to np.where

My data frame looks like - no city amount 1 Kenora 56% 2 Sudbury 23% 3 Kenora 71% 4 Sudbury 41% 5 ...
0
votes
1answer
31 views

Why does loading BigQuery table requires a bucket?

I'm trying to load a BigQuery table into my program using Spark, Scala but I'm having trouble understanding the role of 'buckets' in BigQuery. I followed the examples on https://github.com/samelamin/...