I got several interesting and useful replies from various people. Some of those people have experience the same problem before. Except for improving the algorithm, they give various solution, like 1): change to the compilers which can compile your code optimally. like icc. 2): optimal cache/memory. 3) MMX. 4) calling intel performance lib...
I gather them as following
- Intel P4 Manual
- no offense, but i think these made-up test cases are waste of time
- He at least speed up his own code by 15% by trying different things.
tiling/padding the for loop to improve the cache performance. - Yes, support LZ.
- These test cases are not designed correctly. For example, looping a lot oftimes to do a simple assignment is not right. If you compile in releasemode with optimization on, it will be optimized. In VC, just compiling in release mode usually will double the speed. Your 15% is really nothing. Atleast, you need to do all the test again in release mode, which might already optimized the performance and you modification might not helpat all.Double precision calculation is slow. But float is quite fast. Look at the benchmark of the functions in intel performance primitive lib. Many float point based operations are faster than integer.Memory bandwith is another bottle neck. Trying to improve L1/L2 cache hit rate is important.I once optimized an algorithm and got it more than 10 times faster (of courseincluding algorithm chnage and MMX and calling intel performance lib).
- well it's good to do some experiments, but i don't think his experiment is really valuable...
- I don't think he has much clue either.I think we should still encourage the spirit of trying.
- the most part for your 10 fold increasing is likely due to algorithm change,other code optimization is not likely to improve the efficiency sodramatically.
- For large dataset applications,optimize toward cache/memory can even speed up 100x or more.
- Of course. I just want to point out that the most important thing isto optimize the algorithm and data/work flow. Then some mmx intrinsicand cache and so on. He should start from the code he had instead of testing some trivialthings (not mentioning those unrealistic testing cases). Find the most time consuming parts first and then focus on those parts.I have found out some approximated calculation of arctan is even faster than lookup table. But those optimization depends on your algorithm andthe accuracy you need. Get deep understanding of the code you are optimizing is the most important thing.
- The utmost rule for optimization, is to shorten your CODING time.Only in very rare case the code need special optimization tohit the speed requirement on modern CPUs, even for commercial programs.90% of codes have much more coding time than run time......If I estimate that a program can get result in one week on one CPU,I will not optimize it at all. I have to use SSE float instructionsto optimize some of my programs, because they runs more than onemonth on 7 CPUs. I give up to optimize a program running 2 weekson 6 CPUs recently, although I know I can speed it up 2x or more.The coding time, is much longer and need more attention.
- Faint, I wouldn't suggest SSE optimization for this. I still suggestlooking into the algorithm or optimizing it by multi-threading as youmight have done.If you need to debug, tune parameters, I am not sure how can you not runit several times. Then, several months have passed. Are you saying youwrite the code for more than 1 year? Otherwise, I think it belongs to the10% code you categerized as having more running time than coding time andshould be worth the optimization.Anyway, I prefer to write code that will be used more frequently at acceptable speed. But for different fields, it is hard to compare.
- Once you turn the optimization on, compilers can make usually makebetter decisions regarding how to make array access fast. Optimizationby hand is not recommended. But then again, not every compiler can optimize well enough. GCC forexample is not strong at that.
- nod. different compilers perform differently. For some c++ codes I recentlywrote, optimized executable by g++ runs faster than icpc optimized executable.
- that means you didn't turn on right flags of icpc. gcc/g++ is very hard to beat icc.
- Probably not. I only turn on -O2 for both g++ and icpc though.What options are usually recommended for icpc?
- what i usually use is -ipo -fast -fp-model -unroll0
- I recall that -ipo requires some extra work on the programming side, is thatcorrect?
- that's true for some code.
- If there is any possible of algorithm optimization, how could I performSSE optimization? I use 7 CPUs of cluster, if not multi-threading,I'm really mad, and maybe the most stupid person in the world.It can burn hundreds of CPUs at the same time if applicapable.The SSE speeds up around 4-5x. It's certainly worth to save the 3-4months of run time on 7-CPU cluster. And the optimization is onlyperformed on less than 500 lines of C++ code, within a project ofmore than 10k lines. And no one will run debug version on full dataset. There will alwaysbe a small dataset for debuging. And the program should write intermediatedata to disk and restart from last saved state, if you really needrun it for months.You write code of what you need, not what you prefered.You have NO choice.For more than 90% of the code, execution speed is not a issue at all,even it will run millions of times.