OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Asking about iterating the elements in ListVector. Thanks. [Arrow JAVA API]


Thanks a lot.

On Sat, Sep 1, 2018 at 12:08 AM Jacques Nadeau <jacques@xxxxxxxxxx> wrote:

> Slight correction on code:
>
> int recordIndexToRead = ...
> ListVector lv = ...
> ArrowBuf offsetVector = lv.getOffsetBuffer();
> VarCharVector vc = lv.getDataVector();
> int listStart = offsetVector.getInt((recordIndexToRead ) * 4) ;
> int listEnd = offsetVector.getInt((recordIndexToRead + 1) * 4);
> NullableVarCharHolder nvh = new NullableVarCharHolder();
> for(int i = listStart; i < listEnd; i++){
>   vc.get(i, nvh);
>   // do something with data.
> }
>
> On Fri, Aug 31, 2018 at 9:04 AM Jacques Nadeau <jacques@xxxxxxxxxx> wrote:
>
>> Adding the Arrow dev list.
>>
>> Yes, VarCharVector.get(int index, NullableVarCharHolder holder) is a
>> cheaper method.
>>
>> You can get the offsets from list vector and then using the holder to
>> retrieve pointers into the exist memory. That memory is offheap so you'll
>> have to do a copy if you want a byte array.
>>
>> Pseudo code:
>>
>> int recordIndexToRead = ...
>> ListVector lv = ...
>> ArrowBuf offsetVector = lv.getOffsetBuffer();
>> VarCharVector vc = lv.getDataVector();
>> int listStart = lv.offsetBuffer.getInt((recordIndexToRead ) * 4) ;
>> int listEnd = lv.offsetBuffer.getInt((recordIndexToRead + 1) * 4);
>> NullableVarCharHolder nvh = new NullableVarCharHolder();
>> for(int i = listStart; i < listEnd; i++){
>>   vc.get(i, nvh);
>>   // do something with data.
>> }
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 31, 2018 at 2:08 AM Xu,Wenjian <zeroxwj@xxxxxxxxx> wrote:
>>
>>> Hi Jacques,
>>>
>>> I have a question about ListVector in Arrow Java API. Thanks for your
>>> kind help.
>>>
>>> I would like to iterate through *array<string>* in SQL semantics.
>>>
>>> I understand that , in order to represent *array<string>* in Arrow
>>> format, I could use ListVector with VarCharVector as the inner list. My
>>> question is, how to efficiently access all the elements (i.e., each byte[]
>>> as string)?
>>>
>>> By checking the test code:
>>>
>>> https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
>>>
>>> one option is to use ListVector.getObject(int index) to get each
>>> ArrayList<Text>, and then access each element in ArrayList<Text>. But this
>>> method is expensive because:
>>>
>>> 1) it calls VarCharVector.get(int index) which involves memory copy
>>> 2) it calls Text.set(byte[]) which assemble the Text from byte array.
>>>
>>> My goal is just to retrieve each byte[] and do some filtering. Is there
>>> any other less expensive method to achieve my goal? For example,
>>> VarCharVector.get(int index, NullableVarCharHolder holder) seems to be a
>>> less-expensive operation. But how to use this method in my case?
>>>
>>> Thanks again.
>>>
>>> Best regards,
>>> Wenjian
>>>
>>