How to deal with Large Datasets in Machine Learning

Sai Durga Kamesh Kota
Analytics Vidhya
Published in
6 min readJun 19, 2020

--

Not Bigdata….

Illustration of large Datasets

Datasets are a collection of instances that all share a common attribute. Machine learning models will generally contain a few different datasets, each used to fulfill various roles in the system.

When any experienced data scientist is dealing with a project related to ML, 60 percent of work is done for analyzing the dataset, which we call as Exploratory Data Analysis(EDA). That means data plays a major role in machine learning. In the real world, we have huge data to work on, which makes computation and reading data with normal pandas is not feasible as if it will take more time and we generally have limited resources to work on. So to make it feasible, many AI researchers come up with a solution to identify different techniques and ways in dealing with large datasets.

Now I will be sharing the following techniques by taking some examples. Here for practical implementation, I am using google Colab which has a ram capacity of 12.72 GB.

Let's consider a dataset created with random numbers ranges from 0(inclusive) to 10(exclusive) which has 1000000 rows and 400 columns.

The CPU time and Wall time for executing the above code is as follows:-

Now let's convert this data frame to a CSV file.

The CPU time and Wall time for executing the above code is as follows:-

Now Load your dataset generated now (nearly 763 MB) using pandas and see what happens.

When you execute the above line of code your notebook gets crashed because of the unavailability of RAM. Here I have taken a relatively small dataset with the size around 763MB then think of a scenario where you need to work with tons of data. What is the next plan to solve this issue?

Techniques of handling Large datasets:-

1. Reading CSV files in chunk size:-

When we read large CSV files by specifying chunk_size then the original data frame is broken into chunks and stored in a pandas parser object. We iterate the object in such a way and concatenate to form an original data frame that takes a lower time.

Here in the above-generated CSV file consists of 1000000 rows and 400 columns, so if we read our CSV file in 100000 rows as the chunk size then

The CPU time and Wall time for executing the above code is as follows:-

Now we need to iterate the chunks in a list and then we need to store them in a list and concatenate to form a complete dataset.

The CPU time and Wall time for executing the above code is as follows:-

We can observe how drastically reading time is improved. In this way, we can read large datasets and reduce reading time and sometimes avoid system crashes.

2. Changing the size of datatypes:-

If we want to improve our performance while performing any operations on large datasets it will take more time to avoid this reason we can change the size of data types of certain columns like (int64 →int32), (float64 →float32) to reduce the space it stored and save it in a CSV file for further implementation.

for example, if we apply this to the data frame after chunking and compare the memory usage before and after your size of the file is reduced to half and memory usage is reduced to half which finally leads to decreasing in CPU time

Memory usage before and after conversion of a datatype is as follows:-

Info of data frame before conversion
Info of data frame after conversion

Here we can clearly observe that 3 GB is memory usage before datatype conversion and 1.5 GB is memory usage after datatype conversion. If we calculate performance-wise by calculating the mean before and after to the data frame CPU time reduces and our goal is achieved.

3. Removing unwanted columns from the data frame:-

We can remove unwanted columns from our dataset so that memory usage by the data frame loaded is reduced which can improve the performance of CPU when we are performing different operations in the dataset.

4. Change the Data Format:-

Is your data stored in raw ASCII text, like a CSV file?

Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like GRIB, NetCDF, or HDF. There are many command-line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory. Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.

5. Object Size reduction with correct datatypes:-

Generally, the memory usage of the data frame can be reduced by converting them to correct datatypes. Almost all the datasets include object datatype which is generally in string format which is not memory efficient. When you consider the date, categorical features like region, city, place names these were in the string which takes more memory so if we convert these to respective data types like DateTime, categorical which makes memory usage reduced by more than 10 times as consumed before.

6. Using Fast loading libraries like Vaex:-

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates statistics such as mean, sum, count, standard deviation, etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots, and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy, and lazy computations for best performance (no memory wasted).

Now let's implement the vaex library in the above randomly generated dataset to observe the performance.

  1. Firstly we need to install a vaex library using command prompt/shell depending on the OS you are using.

2. Then, we need to convert the CSV file to the hdf5 file using the vaex library.

After executing the above code, a dataset.csv.hdf5 file is generated in your working directory. Memory usage before and after conversion of a datatype is as follows:-

It is observed that it took nearly 39 sec to convert CSV to hdf5 file which is less time relative to the size of the file.

3. Reading hdf5 file using vaex:-

Now we need to open hdf5 file by open function in the vaex library.

After observing the above code if we see the output it looks like it took 697 ms to read an hdf5 file by this we can understand how fast it is executed to read a 3GB hdf5 file. This is the actual advantage of the vaex library.

By using vaex we can be able to perform different operations on the large data frames like

  1. Expression System
  2. Out of core data frame
  3. Fast groupby / aggregations
  4. Fast and efficient join

If you want to explore more about vaex library check out here.

Conclusion:-

In this way, we can follow these techniques in handling large datasets in machine learning.

If you like this article do 👏 this article.If you wanna connect me on linkedin here is the link given below.

--

--

Sai Durga Kamesh Kota
Analytics Vidhya

SDE 1 @ Angel One • Fintech • Backend • MLH Fellow 2021 • 2k18–2k22 @ NIT Patna