How to make sure that write csv is complete?

wdmv1981 :

I'm writing a dataset to CSV as follows:

df.coalesce(1)
  .write()
  .format("csv")
  .option("header", "true")
  .mode(SaveMode.Overwrite)
  .save(sink);

sparkSession.streams().awaitAnyTermination();

How do I make sure, that when the streaming job gets terminated, the output is done properly?

I have the problem that the sink folder gets overwritten and is empty if I terminate too early/late.

Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.

Jacek Laskowski :

How do I make sure, that when the streaming job gets terminated, that the output is done properly?

The way Spark Structured Streaming works is that the streaming query (job) runs continuously and "when the streaming job gets terminated, that the output is done properly".

The question I'd ask is how a streaming query got terminated. Is this by StreamingQuery.stop or perhaps Ctrl-C / kill -9?

If a streaming query's terminated in a forceful way (Ctrl-C / kill -9), well, you get what you asked for - a partial execution with no way to be sure that an output is correct since the process (the streaming query) got shut down forcefully.

With StreamingQuery.stop the streaming query will just terminate gracefully and write out all it would at the time.

I have the problem, that the sink folder gets overwritten and that the folder is empty if I terminate too early/late.

If you terminate too early/late, what else would you expect since the streaming query could not finish its work. You should stop it gracefully and you get the expected output.

Additional Info: Particularly if the topic has no messages, my spark job is still running and overwrites the result with an empty file.

That's an interesting observation which requires further exploration.

If there are no messages to be processed, no batch would be triggered so no jobs so no "overwrites the result with an empty file." (as no task would get executed).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=143968&siteId=1