Developer Tools

Testing PySpark

To run individual PySpark tests, you can use run-tests script under python directory. Test cases are located at tests package under each PySpark packages. Note that, if you add some changes into Scala or Python side in Apache Spark, you need to manually build Apache Spark again before running PySpark tests in order to apply the changes. Running PySpark testing script does not automatically build it.

Also, note that there is an ongoing issue to use PySpark on macOS High Serria+. OBJC_DISABLE_INITIALIZE_FORK_SAFETY should be set to YES in order to run some of tests. See PySpark issue and Python issue for more details.

To run test cases in a specific module:

$ python/run-tests --testnames pyspark.sql.tests.test_arrow

To run test cases in a specific class:

$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests'

To run single test case in a specific class:

$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion'

You can also run doctests in a specific module:

$ python/run-tests --testnames pyspark.sql.dataframe

Lastly, there is another script called run-tests-with-coverage in the same location, which generates coverage report for PySpark tests. It accepts same arguments with run-tests.

$ python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python
...
Name                              Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------
pyspark/__init__.py                  42      4      8      2    84%
pyspark/_globals.py                  16      3      4      2    75%
...

Generating HTML files for PySpark coverage under /…/spark/python/test_coverage/htmlcov You can check the coverage report visually by HTMLs under /…/spark/python/test_coverage/htmlcov.

Please check other available options via python/run-tests[-with-coverage] --help.

Setup Pycharm with PySpark

Import a project to PyCharm: File –> Open –> path_to_project.

After that, open ~/.bash_profile and add the below lines.

# Spark home
export SPARK_HOME=/.../spark-2.4.5-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

Source the ~/.bash_profile to reflect the changes.

source ~/.bash_profile

Install pyspark and pypandoc: PyCharm –> Preferences –> Project Interpreter

https://www.pavanpkulkarni.com/img/2018/04/9_pycharm_project_interpreter.png

Go to PyCharm –> Preferences –> Project Interpreter. Click on Add Content Root. Here you need to add paths

https://www.pavanpkulkarni.com/img/2018/04/10_pycharm_project_structure.png

Restart PyCharm, and then run the project. You should be able to see output.

+---------+-----+
|     word|count|
+---------+-----+
|      ...|    5|
|     July|    2|
|       By|    1|
|   North,|    1|
|   taking|    1|
|    harry|   18|
|     #TBT|    1|
|  Potter:|    3|
|character|    1|
|        7|    2|
|  Phoenix|    1|
|   Number|    1|
|      day|    1|
|   (Video|    1|
| seconds)|    1|
| Hermione|    3|
|    Which|    1|
|      did|    1|
|   Potter|   38|
|Voldemort|    1|
+---------+-----+
only showing top 20 rows

Process finished with exit code 0