Practical Data Cleaning with Python Resources

Posted on Wed 03 May 2017 in trainings

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.

Hope you enjoy!

Libraries / Repositories

  • Course Repository: https://github.com/kjam/data-cleaning-101

Deduplication

  • Dedupe: https://github.com/dedupeio/dedupe
  • CSV Dedupe: https://github.com/dedupeio/csvdedupe

String Matching

  • Fuzzy Wuzzy: https://github.com/seatgeek/fuzzywuzzy
  • TextaCy: https://github.com/chartbeat-labs/textacy

Managing Nulls

  • Pandas functions: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
  • Dora: https://github.com/NathanEpstein/Dora
  • Badfish: https://github.com/harshnisar/badfish

Normalization & Preprocessing

  • Scikit-learn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
  • Pandas stats: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

Specific data cleaning topics

  • Privacy? https://github.com/datascopeanalytics/scrubadub
  • Measurements? http://pint.readthedocs.io/
  • Versioning ML Data? https://github.com/NathanEpstein/Dora
  • Dates? http://arrow.readthedocs.io/en/latest/ or https://github.com/kennethreitz/maya
  • AutoClean? https://github.com/rhiever/datacleaner
  • DIY Parser? https://github.com/datamade/parserator

Simple pipelines / graphs, task processing

  • Dask: https://github.com/dask/dask
  • Distributed: https://github.com/dask/distributed

Schema Validation

  • Voluptuous: https://github.com/alecthomas/voluptuous
  • Validr: https://github.com/guyskk/validr
  • With Serialization: https://marshmallow.readthedocs.io/en/latest/
  • For JVM / Apache: https://avro.apache.org/

Dataframe Validation

  • Engarde: https://github.com/TomAugspurger/engarde
  • Validada: https://github.com/jnmclarty/validada

Constraint Detection

  • TDDA: Test-Driven Data Analysis: https://github.com/tdda/tdda
  • SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html#statistical-functions

Property-based Testing

  • Hypothesis: https://hypothesis.readthedocs.io/
  • Haskell's Quickcheck: https://hackage.haskell.org/package/QuickCheck

More Validation and Testing

  • Model Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html
  • Testing ML features: https://github.com/machinalis/featureforge
  • Built-in Stats: https://docs.python.org/3/library/statistics.html

Unit Testing Basics

  • PyTest: https://docs.pytest.org/en/latest/
  • Mocking: https://docs.python.org/3/library/unittest.mock-examples.html
  • Faking Data with Faker: https://faker.readthedocs.io/en/master/
  • Faker CSVs: https://github.com/pereorga/csvfaker
  • Watch Ned Batchelder’s testing talk
  • Continuous Integration: TravisCI, Jenkins, TeamCity and many more
  • Better Code Reviews: http://www.bettercode.reviews/

Testing Pipelines

  • Data Quality Checks with Spark DataFrames
  • Drunken Data Quality (Spark DF): https://github.com/FRosner/drunken-data-quality
  • Apache Beam: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
  • Tip: Check your framework first!

Open Datasets (to try out your skills!)

Research

That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.