MATLAB Answers

Processing Big Data Files

9 views (last 30 days)
Ugur Acar
Ugur Acar on 24 Oct 2019
Answered: Fangjun Jiang on 24 Oct 2019
I have txt file of 120MB. It has around 3600000 rows in it. I need to read this data using script generated from import data menu.
But when i tried to run script it gives out of memory error. Is there any other way to read that big data ?
I have i7-7700HQ cpu @2.80Ghz and 8 gb of RAM, msi laptop computer.
%% Initialize variables.
filename = 'sicaklik.txt';
delimiter = '|';
startRow = 2;
formatSpec = '%s%s%s%s%s%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r','n','UTF-8');
%% Skip the BOM (Byte Order Mark).
fseek(fileID, 3, 'bof');
%%Read columns of data according to the format.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
%% Close the text file.
fclose(fileID);
% Convert the contents of columns containing numeric text to numbers.
%% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
%%
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
%%
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,3,4,5,6,7]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains(',')
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = textscan(char(strrep(numbers, ',', '')), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,3,4,5,6,7]);
rawStringColumns = string(raw(:, 2));
%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
idx = (rawStringColumns(:, 1) == "<undefined>");
rawStringColumns(idx, 1) = "";
%% Create output variable
all_cities = table;
all_cities.Istasyon_No = cell2mat(rawNumericColumns(:, 1));
all_cities.Istasyon_Adi = categorical(rawStringColumns(:, 1));
all_cities.YIL = cell2mat(rawNumericColumns(:, 2));
all_cities.AY = cell2mat(rawNumericColumns(:, 3));
all_cities.GUN = cell2mat(rawNumericColumns(:, 4));
all_cities.SAAT = cell2mat(rawNumericColumns(:, 5));
all_cities.SICAKLIK_C = cell2mat(rawNumericColumns(:, 6));
%Clear temporary variables
clearvars filename delimiter startRow formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp rawNumericColumns rawStringColumns idx;

  0 Comments

Sign in to comment.

Answers (1)

Fangjun Jiang
Fangjun Jiang on 24 Oct 2019
Split the large file to smaller files and apply Tall Array

  0 Comments

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by