Scaling Python for Big Data

Video description

If you have some Python experience, and you want to take it to the next level, this practical, hands-on course will be a helpful resource. Video tutorials in this course will show you how to use Python for distributed task processing, and perform large-scale data processing in Spark using the PySpark API.

Publisher resources

View/Submit Errata

Table of contents

  1. Building Data Pipelines with Python
    1. Welcome To The Course
    2. About The Author
    3. Introduction To Automation
    4. Adventures With Servers
    5. Being A Good Systems Caretaker
    6. What Is A Queue?
    7. What Is A Consumer? What Is A Producer?
    8. Why Celery?
    9. Celery Architecture Set Up
    10. Writing Your First Tasks
    11. Deploying Your Tasks
    12. Scaling Your Workers
    13. Monitoring With Flower
    14. Advanced Celery Features
    15. Why Dask?
    16. First Steps With Dask
    17. Dask Bags
    18. Dask Distributed
    19. What Are Data Pipelines? What Is Dag?
    20. Luigi And Airflow: A Comparison
    21. First Steps With Luigi
    22. More Complex Luigi Tasks
    23. Introduction To Hadoop
    24. First Steps With Airflow
    25. Custom Tasks With Airflow
    26. Advanced Airflow: Subdags And Branches
    27. Using Luigi With Hadoop
    28. Apache Spark
    29. Apache Spark Streaming
    30. Django Channels
    31. And Many More
    32. Introduction To Testing With Python
    33. Property-Based Testing With Hypothesis
    34. What's Next?
  2. Introduction to PySpark
    1. Introduction And Course Overview
    2. About The Author
    3. Installing Python
    4. Installing iPython And Using Notebooks
    5. Download And Setup
    6. Running The Spark Shell
    7. Running The Spark Shell With iPython
    8. What Is A Resilient Distributed Dataset - RDD?
    9. Reading A Text File
    10. Actions
    11. Transformations
    12. Persisting Data
    13. Map
    14. Filter
    15. Flatmap
    16. MapPartitions
    17. MapPartitionsWithIndex
    18. Sample
    19. Union
    20. Intersection
    21. Distinct
    22. Cartesian
    23. Pipe
    24. Coalesce
    25. Repartition
    26. RepartitionAndSortWithinPartitions
    27. Reduce
    28. Collect
    29. Count
    30. First
    31. Take
    32. TakeSample
    33. TakeOrdered
    34. SaveAsTextFile
    35. CountByKey
    36. ForEach
    37. GroupByKey
    38. ReduceByKey
    39. AggregateByKey
    40. SortByKey
    41. Join
    42. CoGroup
    43. WholeTextFile
    44. Pickle Files
    45. HadoopInputFormat
    46. HadoopOutputFormat
    47. Broadcast Variables
    48. Accumulators
    49. Using A Custom Accumulator
    50. Partitioning
    51. Spark Standalone Cluster
    52. Mesos
    53. Yarn
    54. Client Versus Cluster Mode
    55. Spark Streaming
    56. Dataframes And SQL
    57. MLlib
    58. Resources And Where To Go From Here
    59. Wrap Up

Product information

  • Title: Scaling Python for Big Data
  • Author(s): O'Reilly Media, Inc.
  • Release date: December 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491977798