How can I acces data from an hdfs in parquet format
    5 visualizaciones (últimos 30 días)
  
       Mostrar comentarios más antiguos
    
    André Horn
 el 17 de Sept. de 2018
  
    
    
    
    
    Comentada: Hatem Helal
    
 el 10 de Abr. de 2019
            We have a large dataset stored in parquet files on an hadoop file system and would like to use a matlab datastore to analyse them. Unfortunately I couldn't find any reports, that anybody has done this yet.
Does mathworks provide a native way to access parquet data? Perhaps one can use the fileDatastore or a matlab custom datastore? Is there a template for that?
0 comentarios
Respuesta aceptada
  Hitesh Kumar Dasika
    
 el 20 de Dic. de 2018
        Mathworks has added support for Parquet files. it is available in the following link.
5 comentarios
  Knut Voigtlaender
 el 16 de En. de 2019
				Indeed, I'm able to access parquet files hosted on a remote hadoop linux cluster from a local Windows PC Matlab.
For me it worked considering following steps:
1. I got a local Hadoop Windows installation according to
https://github.com/MuhammadBilalYar/Hadoop-On-Window/wiki/Step-by-step-Hadoop-2.8.0-installation-on-Window-10
2.  log4j.properties must be copied from \hadoop-X.X.X\ect\hadoop\ to \matlab-parquet-master\Software\MATLAB\lib\jar\
3. The HADOOP_HOME environment variable should then point to the local hadoop home directory instead to Winutils.exe
4. The check for unix-style filename was removed in  matlab-parquet-master\Software\MATLAB\+bigdata\+parquet\Reader.m
5. The OS-check must be removed from  \matlab-parquet-master\Software\MATLAB\functions\parquetDatastore.m
or like proposed above it is possible to use directly \matlab-parquet-master\Software\MATLAB\+bigdata\+parquet\ParquetDatastore.m
After this steps I could initialize a Datastore via a remote hadoop url like
 ds=bigdata.parquet.ParquetDatastore('hdfs://server:port/dir','IncludeSubfolders',true)
  Hatem Helal
    
 el 10 de Abr. de 2019
				R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.
Más respuestas (2)
  Hatem Helal
    
 el 10 de Abr. de 2019
        MATLAB R2019a adds support for reading and writing Apache Parquet files (doc).  Here are the relevant release notes:
1. Import and export column-oriented data from Parquet files in MATLAB. Parquet is a columnar storage format that supports efficient compression and encoding schemes. To work with the Parquet file format, use these functions.
- parquetread — Read columnar data from a Parquet file.
- parquetwrite — Write columnar data to a Parquet file.
- parquetinfo — Get information about a Parquet file.
2. The write function now supports writing tall arrays to Parquet files. To write a tall array, set the FileType parameter to 'parquet', for example:
write('C:\myData',tX,'FileType','parquet')
0 comentarios
  Hitesh Kumar Dasika
    
 el 24 de Sept. de 2018
        Currently, there is no support to Apache Arrow and Parquet files in MATLAB.
3 comentarios
  Hitesh Kumar Dasika
    
 el 24 de Sept. de 2018
				Thank you for your feedback. We have raised this concern with our developers and they are actively looking at including this feature in our future releases. Unfortunately, there is no workaround in this case for now. Sorry for the trouble.
  Hatem Helal
    
 el 10 de Abr. de 2019
				R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.
Ver también
Categorías
				Más información sobre Workspace Variables and MAT Files en Help Center y File Exchange.
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!