As the size of datasets grows beyond the capabilities of even entire teams of humans to curate, there is a growing need to automate the categorization of records and removal of errors. This talk will discuss the advances in machine learning and the types of data processing pipelines that allow for the massive parallel processing of datasets to automatically clean and categorize even the largest of datasets.