logo       

Re: Fusion-ifying proto parse trees: msg#00036

parsers.spirit.devel

Subject: Re: Fusion-ifying proto parse trees

Joel de Guzman wrote:
Joel de Guzman wrote:
Joel de Guzman wrote:
Eric Niebler wrote:

Eric Niebler wrote:

Joel de Guzman wrote:

Eric Niebler wrote:

The segmented algorithm looks complicated, but fortunately, I think all the algorithms will follow a similar pattern, so we should be able to crank them out without too much fuss. It is perfectly general and will work with any segmented data structure, not just binary trees, assuming you give it suitably defined iterators.

I have no performance numbers, but I have a good feeling about it. :-)

Thanks! I'm perusing the code now. This would be a good addition
to Fusion. I wonder how algorithms that return views would look
like, or if they are as efficient as suggested.

A view over a segmented sequence is itself a segmented sequence and would need segmented iterators. It would probably be best at this point to come up with some perf numbers to justify further work in this area.

I've done a little perf testing of the various approaches we have for traversing parse trees. I've attached the test.

Very interesting! Alas, I can't compile it :(

Ok, I got it to compile. Be wary of using namespace in the
global scope with VC7.1. I get lots of problems doing that
and I never do it again. In fact, this is most probably the
reason for the VC8 ICE, because now, I can compile on VC8.
See attached main.cpp.

I got mixed results testing with VC7.1, VC8.0 and g++3.4
on my laptop (1.5GHZ centrino) using fusion list and vector
(I fixed the fusion vector bug. It was a minor typo. please
do a cvs update). Fusion gives better results, except for
VC7.1/list/long-test, which I don't quite understand.
See attached results.txt

It's plausible that VC7.1 has an optimization bug, or it
fails to optimize the case for Fusion. I find it hard to
explain the 3x speedup. On VC8.0, that gain is lost.

The only way to verify this is to rewrite the test with
deterministic results. As it is, the numbers outputed
by the accumulator functor do not make sense. The results
of the traversals should be deterministic so that we can
determine that the code is doing the right thing. We
need reproducable results for all the tests.

Hi again,

I'm befuddled. To get a deterministic result from the accumulator,
I simply placed an un-timed for_each_s call at the top and printed
the result:

template<typename T>
double time_for_each_s(T const &t)
{
int i = 0;
boost::fusion::for_each_s(t, accumulator(i));
std::cout << i << std::endl;
...


I think the problem with doing it this way is that the compiler may see that the assignments to i which happen further down in the times calls are never used. They might be getting optimized away.

I've attached a version that tries harder to force the compiler to actually generate code. The results are interesting. I still see the bizarre 3x speed-up on VC7.1 for large ET trees, but now segmented proto is beating the pre-flattened lists for small ET trees, too!

// VC7.1, fusion list
Short test ...
80
80
80
Fusion list time : 2.84985e-009
Non-segmented proto time : 2.79397e-009
Segmented proto time : 2.79397e-009
Long test ...
480
480
480
Fusion list time : 4.1008e-008
Non-segmented proto time : 4.17233e-007
Segmented proto time : 1.86265e-008


Proto still lags badly with gcc:

// gcc, fusion list
Short test ...
80
80
80
Fusion list time : 2.7895e-08
Non-segmented proto time : 1.13606e-07
Segmented proto time : 1.19209e-07
Long test ...
480
480
480
Fusion list time : 1.74999e-07
Non-segmented proto time : 1.10245e-06
Segmented proto time : 9.53674e-07


The reason I used a list in the test is because that's what Spirit-2 was using. Would you really consider switching to vector, which would place limits on the number of nodes in a sequence?

Also, these tests leave out one detail ... the cost of building the expression template. Is building an unflattened proto parse tree more or less expensive than flattening into a list/vector eagerly? We should really be taking that into account. The cost of building the ET would probably dominate.

Regarding your question about taking the minimum result, I lifted this code from John Maddock's regex performance shoot-out. This is how he collects his timing. Yes, this is a black art -- I don't really understand it.


--
Eric Niebler
Boost Consulting
www.boost-consulting.com
/*=============================================================================
Copyright (c) 2006 Eric Niebler

Use, modification and distribution is subject to the Boost Software
License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt)
==============================================================================*/

#define BOOST_PROTO_FUSION_V2
#define FUSION_MAX_LIST_SIZE 25
#define FUSION_MAX_VECTOR_SIZE 25

#ifdef _MSC_VER
// inline aggressively
# pragma inline_recursion(on) // turn on inline recursion
# pragma inline_depth(255) // max inline depth
#endif

#include <iostream>
#include <boost/timer.hpp>
#include <boost/xpressive/proto/fusion.hpp>
#include <boost/fusion/sequence/container/list.hpp>
#include "./for_each_s.hpp"
#include "./segmented_proto.hpp"

#define THE_SEQUENCE list
//~ #define THE_SEQUENCE vector

namespace test
{
using boost::proto::lit;
using boost::proto::unary_op;
using boost::proto::noop_tag;

struct accumulator
{
accumulator(int &i_)
: i(i_)
{}

template<typename T>
void operator()(T const &t) const
{
this->i += arg(t);
}

int &i;
};

int const REPEAT_COUNT = 10;

template<typename T>
double time_for_each_s(T const &t, int &j)
{
boost::timer tim;
int i = 0;
long long iter = 65536;
long long counter, repeats;
double result = 0;
double run;
do
{
tim.restart();
for(counter = 0; counter < iter; ++counter)
{
i = 0;
boost::fusion::for_each_s(t, accumulator(i));
static_cast<void>(i);
}
result = tim.elapsed();
iter *= 2;
} while(result < 0.5);
iter /= 2;

// repeat test and report least value for consistency:
for(repeats = 0; repeats < REPEAT_COUNT; ++repeats)
{
tim.restart();
for(counter = 0; counter < iter; ++counter)
{
i = 0;
boost::fusion::for_each_s(t, accumulator(i));
j += i;
}
run = tim.elapsed();
result = (std::min)(run, result);
}
std::cout << i << std::endl;
return result / iter;
}

#define EXPR1 (lit(3) >> 42) >> (lit(6) >> 29)
#define EXPR2 EXPR1 >> EXPR1 >> EXPR1 >> EXPR1 >> EXPR1 >> EXPR1

int test_short()
{
int j = 0;
typedef unary_op<int, noop_tag> T;
boost::fusion::THE_SEQUENCE<T,T,T,T> const l(
lit(3),lit(42),lit(6),lit(29));

double list_time =
time_for_each_s( l, j );

double non_segmented_time =
time_for_each_s( EXPR1, j );

double segmented_time =
time_for_each_s( boost::proto::make_segmented_view( EXPR1 ), j );

std::cout << "Fusion list time : " << list_time << std::endl;
std::cout << "Non-segmented proto time : " << non_segmented_time <<
std::endl;
std::cout << "Segmented proto time : " << segmented_time <<
std::endl;

return j;
}

int test_long()
{
int j = 0;
typedef unary_op<int, noop_tag> T;
boost::fusion::list<T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T,T>
const l(
lit(3),lit(42),lit(6),lit(29)
,lit(3),lit(42),lit(6),lit(29)
,lit(3),lit(42),lit(6),lit(29)
,lit(3),lit(42),lit(6),lit(29)
,lit(3),lit(42),lit(6),lit(29)
,lit(3),lit(42),lit(6),lit(29));

double list_time =
time_for_each_s( l, j );

double non_segmented_time =
time_for_each_s( EXPR2, j );

double segmented_time =
time_for_each_s( boost::proto::make_segmented_view( EXPR2 ), j );

std::cout << "Fusion list time : " << list_time << std::endl;
std::cout << "Non-segmented proto time : " << non_segmented_time <<
std::endl;
std::cout << "Segmented proto time : " << segmented_time <<
std::endl;

return j;
}
}

int main()
{
std::cout << "Short test ... \n";
int i = test::test_short();

std::cout << "Long test ... \n";
int j = test::test_long();

return i + j;
}
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise