Data Cleaning Software Options - Pros And Cons

Posted 2019-07-08 Posted by Tom

One of the biggest grievances data scientists have is that they spend 80% of their time cleaning and prepping data sets. This isn’t just an apocryphal cliche, but a tangible obstacle keeping data scientists from performing the creative and insightful elements of their talents. Our community have highlighted preprocessing as their most common and fundamental issue. Consequently, there are many companies out there that have sought to remedy this. We’ve reviewed several of these software to determine if they live up to their claims, which types of projects they work best for, and if they make the tedious necessity of data cleaning, more tolerable, or a forgotten ailment entirely.

Attaccama One

What is it for:

Data Organisation & Understanding, Data Cleaning, Data Aggregation, Data Access

What does it do:

A ‘Data Curation’ platform that ties together a few tools. It encompasses data organisation ( Master Data Management), and Profiling (free version) through to Data Cleaning and Data Governance.

Pros:

Offering a collaborative workspace environment where teams can share projects and leave comments to improve workflow, Attacama One supports a very wide variety of data input options through an intuitive interface. It allows you the option of previewing an uploaded dataset before spending the time actually uploading it and is adept at recognising more unusual data types such as enums and datetimes.

The string analysis is noteworthy - it looks for a common regex pattern, the variations in length, and string length statistics. It again showcases its versatility by offering private cloud and on-premises solutions.

Cons:

However, it does have its drawbacks and limitations, such as the inability to plot outside what is given, and the necessity of other or additional applications. The profiling is does not have the desired granularity, defaulting to quantiles and counts, and there is no attempt for further analysis beyond this. You cannot plot a distribution.


CSV Explorer

What is it for:

Data Understanding, Data Cleaning

What does it do:

Opens big CSV files with millions of rows. Search, filter, plot, or export to Excel for further analysis.

Pros:

CSV Explorer’s greatest asset is its speed, it works quickly on large datasets, something which is typically an archetypal problem with huge CSV files. Try opening millions of rows and see how much your laptop appreciates that.

Its search, sorting, and filtering functionality is commendable (it can filter rows by regex), allowing quick manipulation of the data.

Cons:

Its simplistic approach limits what you can do with the data subsequently, offering a basic range of options (lines, bar charts, histograms). There is no functionality to correct nulls, requiring the user to remove rows to alleviate.There are good possibilities for working with dates but it failed to recognise dates in the dataset we used.The aggregation functions are also basic (min, max, mean, sum) and there are summary statistics offered. Unlike other software we have reviewed, this one only offers a cloud solution which could be a deal breaker if you work with sensitive data.


Google Cloud Data Prep by Trifacta

What is it for:

Data Exploration, Data Cleaning, Data Aggregation

What does it do:

Google on a cloud-optimized, enterprise solution of their existing Data Wrangler offering. It allows self-service data preparation on large datasets with a GUI.

Pros:

Trifacta seeks to accelerate data preparation and maximize data quality, and its strengths lie in its features such as the native integration with Google Cloud. It has a comprehensive data wrangling functionality from their existing Data Wrangler which can be used with those without advanced technical expertise. The real-time data wrangling offers fast feedback on large data sets and the approachable GUI is aimed at a broad range of technical expertise.

Cons:

There are no tutorial features, leaving the user feeling overwhelmed and unsure of how to accomplish anything unless achieving it via trial and error. There is no functionality to use outside Google Cloud. Users must use data wrangler separately if they require this functionality.

File uploads are capped at a 100mb limit which is hindering, and it attempts to plot distribution even is situations where this is not relevant.

Get priority access to Pivigo news, features, events and networking opportunities

TOP