Download presentation
Presentation is loading. Please wait.
1
CarbonData Data Load Design
2
Background Current data loading interface is rigid, not extensible and it uses kettle engine to load the data, main drawback of this approach is it can’t implement output format. So it cannot be easily ported to multiple compute engines. So we should remove the kettle framework from CarbonData and new data loading design should be in place.
3
New Data Load Design The flow diagram describes new data load design by using InputFormat.
4
/** * This base interface for data loading. It can do transformation jobs as per the implementation. * */ public abstract class AbstractDataLoadProcessorStep { public AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration, AbstractDataLoadProcessorStep child) { } * The output meta for this step. The data returns from this step is as per this meta. public abstract DataField[] getOutput(); * Intialization process for this step. configuration child CarbonDataLoadingException Public abstract void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws CarbonDataLoadingException; * Tranform the data as per the implemetation. Iterator of data public Iterator<Object[]> execute() throws CarbonDataLoadingException; * Any closing of resources after step execution can be done here. void close(); Major Interfaces
5
Working InputProcessorStep: It does two jobs, 1. It reads data from RecordReader of InputFormat Parse each field of column as per the data type. EncoderProcessorStep: It encodes each field with dictionary if requires.And combine all no dictionary columns to single byte array. SortProcessorStep: It sorts the data on dimension columns and write to intermediate files. DataWriterProcessorStep: It merge sort the data from intermediate temp files and generate mdk key and writes the data in carbondata format to store.
6
Integration with CarbonOutputFormat
For CarbonOutputFormat, it will use RecordBufferedWriterIterator and collects the data in batches and iterated over InputProcessorStep.
7
Dictionary Generation
/** * Generates dictionary for the column. The implementation classes can be pre-defined or * local or global dictionary generations. */ public interface ColumnDictionaryGenerator { * Generates dictionary value for the column data data dictionary value int generateDictionaryValue(Object data); * Returns the actual value associated with dictionary value. dictionary actual value. Object getValueFromDictionary(int dictionary); * Returns the maximum value among the dictionary values. It is used for generating mdk key. max dictionary value. int getMaxDictionaryValue(); } Dictionary Generation
8
Dictionary Generation Implementation
This ColumnDictionaryGenerator interface can have 3 implementations, PreGeneratedColumnDictionaryGenerator: It gets the dictionary values from already generated and loaded dictionary. GlobalColumnDictionaryGenerator: It generates global dictionary online by using KV store or distributed map. LocalColumnDictionaryGenerator: It generates local dictionary only for that executor.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.