osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

regex pattern to extract repeating groups


On 2018-08-28 00:58, Malcolm wrote:
> On 28/08/2018 7:09 AM, John Pote wrote:
>> On 26/08/2018 00:55, Malcolm wrote:
>>> I am trying to understand why regex is not extracting all of the 
>>> characters between two delimiters.
>>>
>>> The complete string is the xmp IFD data extracted from a .CR2 image 
>>> file.
>>>
>>> I do have a work around, but it's messy and possibly not future proof.
>> Do you mean future proof your workaround or Cannon's .CR2 raw image 
>> files might change? I guess .CR2's won't change but Cannon have 
>> brought out the new .CR3 raw image file for which I needed to upgrade 
>> my photo editing suit (at least I didn't but used their tool to 
>> convert .CR3s from the camera to the digital negative format which 
>> many photo editors can handle.) Can send you sample .CR3 if you want 
>> to compare.
>>
>> Regards,
>> John
> John
> 
> Thank you.
> 
> Some background
> The application is for personal use. Why I'm familiar with python
> generally (and thanks to all who post code and answer questions), this
> is the first time I have used structs to read a binary file, xml parsers
> to parse some of the RFD contents and re.
> 
> First
> I have now discovered that when print the return of re.search that the
> matched='truncates the matched characters'.? To see/get all found
> characters I need to use the span as indexes to the original string. I'm
> not sure if this is mentioned in the re documentation. But all the
> samples I've seen on the web use only small strings. This was the cause
> of my question.
> 
re.search returns a "match object". When you print it, you get what is 
basically a summary. If you want the matched portion of the string, use 
the match object's .group method:

[snip]

re_pattern = r'( *<dc:.*</dc:)'
x = re.search(re_pattern, data, re.DOTALL)
print(x.group())