Access data in a nested structure array and generate an excel file
4 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Hello everyone,
I have a bunch of .json files in a folder called "JSON files". Each file is an article. I would like to put the paper_id, title, abstract, and authors' affiliation in an excel files.
files = dir('JSON files/*.json');
for i=1:numel(files)
filename = fullfile(files(i).folder, files(i).name);
data = jsondecode(fileread(filename));
end
the "data" in the above code for each i looks like
data =
struct with fields:
paper_id: '0015023cc06b5362d332b3baf348d11567ca2fbb'
metadata: [1×1 struct]
abstract: [2×1 struct]
body_text: [20×1 struct]
bib_entries: [1×1 struct]
ref_entries: [1×1 struct]
back_matter: [1×1 struct]
and
data.metadata =
struct with fields:
title: 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3'
authors: [17×1 struct]
and
data. metadata.authors
ans =
17×1 struct array with fields:
first
middle
last
suffix
affiliation
email
for this specific article the affiliation is empty. However most of the .json files have an authors' affiliation. Moreover, each author may have different affiliation from her co-authors. I would like to keep all afiliations.
Any idea how I can solve this problem? Any input would be greatly appreciated!
0 comentarios
Respuesta aceptada
Sindar
el 6 de Mayo de 2020
Editada: Sindar
el 6 de Mayo de 2020
[Edited with fixed answer, x2]
This throws no errors and appears to produce a correct excel file from your sample data:
files = dir('JSON files/*.json');
% start off a table to put info in, with the correct size and column names
mytable=table('Size',[numel(files) 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
for ind=1:numel(files)
% load data
filename = fullfile(files(ind).folder, files(ind).name);
data = jsondecode(fileread(filename));
% for the ind-th row, fill in the pid column
mytable{ind,'pid'} = {data.paper_id};
% title column
mytable{ind,'title'} = {data.metadata.title};
% abstract column
% if the abstract is empty, it will get left as <missing> and print an empty cell in excel
if ~isempty(data.abstract)
this_abstract = {data.abstract.text};
% concatenate abstract lines with spaces
mytable{ind,'abstract'} = {strjoin(this_abstract,' ')};
end
% affiliations column
% check which authors have affiliation info
tmp=false([0 0]);
for ind_A = 1:length(data.metadata.authors)
tmp(ind_A) = ~isempty(fieldnames(data.metadata.authors(ind_A).affiliation));
end
% put all affiliations in a single struct array
if ~isempty(tmp) && nnz(tmp)>0
this_affiliation = [data.metadata.authors(tmp).affiliation];
% put all institutions in a single cell array
this_affiliation = {this_affiliation.institution};
% remove duplicates, sort, and make sure the first element is an empty string
this_affiliation = unique([{''} this_affiliation]);
% combine all the affiliations, separating by '; ' (ignore empty first string)
mytable{ind,'affiliations'} = {strjoin(this_affiliation(2:end),'; ')};
end
% clear temporary variable (tmp especially)
clear tmp this_abstract this_affiliation
end
% write table to myData.xls
writetable(mytable,'myData.xls')
If you have a lot of files and don't want to store all the data in a table before writing, use the append option:
mytable=table('Size',[1 4],'VariableTypes',{'string';'string';'string';'string'},'VariableNames',{'pid';'title';'abstract';'affiliations'});
writetable(mytable,'myData.xls');
for ind
...
mytable{1,'pid'} = {data.paper_id};
...
writetable(mytable,'myData.xls','WriteMode','Append','WriteVariableNames',false)
end
16 comentarios
Sindar
el 6 de Mayo de 2020
BTW, this was an interesting problem. I have a "catstructfield" function that does this sort of thing for nice structure arrays. I actually routinely use it on data loaded from JSON files. Luckily for me, the files I work with don't rarely have any missing data, so I didn't have to worry about that. I may eventually update the code to the point that it could be generally-usable and worth sharing. This question/dataset will certainly help with that!
I tried it on your data and it threw no errors! Because I have a try-catch block that just doesn't return any part of the structure that didn't work. So, I ended up with a mostly empty structure...
Más respuestas (0)
Ver también
Categorías
Más información sobre JSON Format en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!