logo
  • Getting Started
  • User Guide
  • API Reference
  • Development
  • Migration Guide
  • Python Package Management
  • Spark SQL
  • Pandas API on Spark
    • Options and settings
    • From/to pandas and PySpark DataFrames
    • Transform and apply a function
    • Type Support in Pandas API on Spark
    • Type Hints in Pandas API on Spark
    • From/to other DBMSes
    • Best Practices
    • FAQ

Pandas API on Spark¶

  • Options and settings
    • Getting and setting options
    • Operations on different DataFrames
    • Default Index type
    • Available options
  • From/to pandas and PySpark DataFrames
    • pandas
    • PySpark
  • Transform and apply a function
    • transform and apply
    • pandas_on_spark.transform_batch and pandas_on_spark.apply_batch
  • Type Support in Pandas API on Spark
    • Type casting between PySpark and pandas API on Spark
    • Type casting between pandas and pandas API on Spark
    • Internal type mapping
  • Type Hints in Pandas API on Spark
    • pandas-on-Spark DataFrame and Pandas DataFrame
    • Type Hinting with Names
  • From/to other DBMSes
    • Reading and writing DataFrames
  • Best Practices
    • Leverage PySpark APIs
    • Check execution plans
    • Use checkpoint
    • Avoid shuffling
    • Avoid computation on single partition
    • Avoid reserved column names
    • Do not use duplicated column names
    • Specify the index column in conversion from Spark DataFrame to pandas-on-Spark DataFrame
    • Use distributed or distributed-sequence default index
    • Reduce the operations on different DataFrame/Series
    • Use pandas API on Spark directly whenever possible
  • FAQ
    • Should I use PySpark’s DataFrame API or pandas API on Spark?
    • Does pandas API on Spark support Structured Streaming?
    • How is pandas API on Spark different from Dask?
Apache Arrow in PySpark Options and settings

© Copyright .
Created using Sphinx 3.0.4.