Practical Data Cleaning with Python Resources

Posted on Wed 03 May 2017 in trainings

Practical Data Cleaning Resources

(O'Reilly Live Online Training)

This week I will be giving my first O'Reilly Live Online Training via the Safari platform. I'm pretty excited to share some of my favorite data cleaning libraries and tips for validating and testing your data workflows.

This post hopes to be a resource to those attending the class, but also anyone interested in the subject of practical data cleaning with Python. If you have tips or ideas on extra content or links to add, feel free to comment or reach out via Twitter or email.

Hope you enjoy!

Libraries / Repositories

Course Repository: https://github.com/kjam/data-cleaning-101

Deduplication

Dedupe: https://github.com/dedupeio/dedupe
CSV Dedupe: https://github.com/dedupeio/csvdedupe

String Matching

Fuzzy Wuzzy: https://github.com/seatgeek/fuzzywuzzy
TextaCy: https://github.com/chartbeat-labs/textacy

Managing Nulls

Pandas functions: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
Dora: https://github.com/NathanEpstein/Dora
Badfish: https://github.com/harshnisar/badfish

Normalization & Preprocessing

Scikit-learn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Pandas stats: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

Specific data cleaning topics

Privacy? https://github.com/datascopeanalytics/scrubadub
Measurements? http://pint.readthedocs.io/
Versioning ML Data? https://github.com/NathanEpstein/Dora
Dates? http://arrow.readthedocs.io/en/latest/ or https://github.com/kennethreitz/maya
AutoClean? https://github.com/rhiever/datacleaner
DIY Parser? https://github.com/datamade/parserator

Simple pipelines / graphs, task processing

Dask: https://github.com/dask/dask
Distributed: https://github.com/dask/distributed

Schema Validation

Voluptuous: https://github.com/alecthomas/voluptuous
Validr: https://github.com/guyskk/validr
With Serialization: https://marshmallow.readthedocs.io/en/latest/
For JVM / Apache: https://avro.apache.org/

Dataframe Validation

Engarde: https://github.com/TomAugspurger/engarde
Validada: https://github.com/jnmclarty/validada

Constraint Detection

TDDA: Test-Driven Data Analysis: https://github.com/tdda/tdda
SciPy: https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html#statistical-functions

Property-based Testing

Hypothesis: https://hypothesis.readthedocs.io/
Haskell's Quickcheck: https://hackage.haskell.org/package/QuickCheck

More Validation and Testing

Model Cross Validation: http://scikit-learn.org/stable/modules/cross_validation.html
Testing ML features: https://github.com/machinalis/featureforge
Built-in Stats: https://docs.python.org/3/library/statistics.html

Unit Testing Basics

PyTest: https://docs.pytest.org/en/latest/
Mocking: https://docs.python.org/3/library/unittest.mock-examples.html
Faking Data with Faker: https://faker.readthedocs.io/en/master/
Faker CSVs: https://github.com/pereorga/csvfaker
Watch Ned Batchelder’s testing talk
Continuous Integration: TravisCI, Jenkins, TeamCity and many more
Better Code Reviews: http://www.bettercode.reviews/

Testing Pipelines

Data Quality Checks with Spark DataFrames
Drunken Data Quality (Spark DF): https://github.com/FRosner/drunken-data-quality
Apache Beam: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
Tip: Check your framework first!

Open Datasets (to try out your skills!)

Kaggle Datasets: beyond just competition data, Kaggle also has shared datasets curated by users.
Awesome Datasets GitHub List
Quora: Where can I find large public datasets?
Scikit-learn datasets
Dataquest.io: 17 places to find open datasets for projects
NLTK Data: NLP data such as books, scripts, articles and poems

Research

That's all for now! Check back as I plan to update and evolve this list with more libraries and examples.