To run individual PySpark tests, you can use run-tests script under python directory. Test cases are located at tests package under each PySpark packages. Note that, if you add some changes into Scala or Python side in Apache Spark, you need to manually build Apache Spark again before running PySpark tests in order to apply the changes. Running PySpark testing script does not automatically build it.
Also, note that there is an ongoing issue to use PySpark on macOS High Serria+. OBJC_DISABLE_INITIALIZE_FORK_SAFETY should be set to YES in order to run some of tests. See PySpark issue and Python issue for more details.
OBJC_DISABLE_INITIALIZE_FORK_SAFETY
YES
To run test cases in a specific module:
$ python/run-tests --testnames pyspark.sql.tests.test_arrow
To run test cases in a specific class:
$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests'
To run single test case in a specific class:
$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion'
You can also run doctests in a specific module:
$ python/run-tests --testnames pyspark.sql.dataframe
Lastly, there is another script called run-tests-with-coverage in the same location, which generates coverage report for PySpark tests. It accepts same arguments with run-tests.
$ python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python ... Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------------- pyspark/__init__.py 42 4 8 2 84% pyspark/_globals.py 16 3 4 2 75% ...
Generating HTML files for PySpark coverage under /…/spark/python/test_coverage/htmlcov You can check the coverage report visually by HTMLs under /…/spark/python/test_coverage/htmlcov.
Please check other available options via python/run-tests[-with-coverage] --help.
python/run-tests[-with-coverage] --help
Import a project to PyCharm: File –> Open –> path_to_project.
After that, open ~/.bash_profile and add the below lines.
~/.bash_profile
# Spark home export SPARK_HOME=/.../spark-2.4.5-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
Source the ~/.bash_profile to reflect the changes.
source ~/.bash_profile
Install pyspark and pypandoc: PyCharm –> Preferences –> Project Interpreter
Go to PyCharm –> Preferences –> Project Interpreter. Click on Add Content Root. Here you need to add paths
Restart PyCharm, and then run the project. You should be able to see output.
+---------+-----+ | word|count| +---------+-----+ | ...| 5| | July| 2| | By| 1| | North,| 1| | taking| 1| | harry| 18| | #TBT| 1| | Potter:| 3| |character| 1| | 7| 2| | Phoenix| 1| | Number| 1| | day| 1| | (Video| 1| | seconds)| 1| | Hermione| 3| | Which| 1| | did| 1| | Potter| 38| |Voldemort| 1| +---------+-----+ only showing top 20 rows Process finished with exit code 0