Currently Spark is supported for distributed workloads, but if a data science team wants to continue working in the Python data science stack (numpy, pandas, scikit-learn), there's no distributed solution supported in WSL. The Python community has selected Dask to be the solution to this. While Dask comes installed on WSL, it's distributed functionality is unsupported.
Why is it useful?
|Who would benefit from this IDEA?||Any Data Engineers or Data Scientists who want to work on big data whlie continuing to use the Python data science stack.|
How should it work?
Dask Distributed should be supported, most likely through its Kubernetes implementation describer here. WSL could spin up and down pods according to how the Dask scheduler tells it to.
|Priority Justification||As we ramp up use of WSL, we want our Data Scientists to be able to continue using the Python packages they're comfortable with, and not have to learn an entirely new paradigm with Pyspark, or switch to Scala to work in Spark directly.|
NOTICE TO EU RESIDENTS: per EU Data Protection Policy, if you wish to remove your personal information from the IBM ideas portal, please login to the ideas portal using your previously registered information then change your email to "firstname.lastname@example.org" and first name to "anonymous" and last name to "anonymous". This will ensure that IBM will not send any emails to you about all idea submissions