Lazy tree splitting

LARS BERGSTROM; MATTHEW FLUET; MIKE RAINEY; JOHN REPPY; ADAM SHAW

doi:10.1017/S0956796812000172

Lazy tree splitting

Part of: JFP Research Articles

Published online by Cambridge University Press: 15 August 2012

JOHN REPPY and

LARS BERGSTROM: Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: larsberg@cs.uchicago.edu)
MATTHEW FLUET: Affiliation:
Department of Computer Science, Rochester Institute of Technology, Rochester NY 14623-5603, USA (e-mail: mtf@cs.rit.edu)
MIKE RAINEY: Affiliation:
Max Planck Institute for Software Systems, D-67663 Kaiserslautern, Rheinland-PhalzGermany (e-mail: mrainey@mpi-sws.org)
JOHN REPPY: Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: jhr@cs.uchicago.edu)
ADAM SHAW: Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: ams@cs.uchicago.edu)

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Nested data-parallelism (NDP) is a language mechanism that supports programming irregular parallel applications in a declarative style. In this paper, we describe the implementation of NDP in Parallel ML (PML), which is a part of the Manticore system. One of the main challenges of implementing NDP is managing the parallel decomposition of work. If we have too many small chunks of work, the overhead will be too high, but if we do not have enough chunks of work, processors will be idle. Recently, the technique of Lazy Binary Splitting was proposed to address this problem for nested parallel loops over flat arrays. We have adapted this technique to our implementation of NDP, which uses binary trees to represent parallel arrays. This new technique, which we call Lazy Tree Splitting (LTS), has the key advantage of performance robustness, i.e., it does not require tuning to get the best performance for each program. We describe the implementation of the standard NDP operations using LTS and present experimental data that demonstrate the scalability of LTS across a range of benchmarks.

Type: Articles
Information: Journal of Functional Programming , Volume 22 , Special Issue 4-5: ICFP 2010 , September 2012 , pp. 382 - 438

DOI: https://doi.org/10.1017/S0956796812000172 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

References

Appel, Andrew W. (1989) Simple generational garbage collection and fast allocation. Softw. Pract. Exp. 19 (2), 171–183.CrossRef Google Scholar

Appel, Andrew W. (1992) Compiling with Continuations. Cambridge, UK: Cambridge University Press.Google Scholar

Barnes, J. & Hut, P. (1986) A hierarchical O(N log N) force calculation algorithm. Nature 324 (Dec.), 446–449.CrossRef Google Scholar

Blelloch, Guy E. (1990a, Nov.) Prefix Sums and Their Applications. Tech. rept. CMU-CS-90-190. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar

Blelloch, Guy E. (1990b) Vector Models for Data-Parallel Computing. Cambridge, MA: MIT Press.Google Scholar

Blelloch, Guy E. (1996) Programming parallel algorithms. Commun. ACM 39 (3), 85–97.CrossRef Google Scholar

Blelloch, Guy E. & Greiner, J. (1996) A provable time and space-efficient implementation of NESL. In Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, Philadelphia, PA, USA. New York, NY: ACM, pp. 213–225.Google Scholar

Blelloch, Guy E., Chatterjee, S., Hardwick, Jonathan C., Sipelstein, J. & Zagha, M. (1994) Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput. 21 (1), 4–14.CrossRef Google Scholar

Blumofe, Robert D. & Leiserson, Charles E. (1999) Scheduling multi-threaded computations by work stealing. J. ACM 46 (5), 720–748.CrossRef Google Scholar

Boehm, Hans-J., Atkinson, R. & Plass, M. (1995) Ropes: An alternative to strings. Softw. Pract. Exp. 25 (12), 1315–1330.CrossRef Google Scholar

Burton, F. Warren & Sleep, M. Ronan. (1981) Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81). New York, NY: ACM, pp. 187–194.Google Scholar

Carver, T. (2010, Mar) Magny-Cours and Direct Connect Architecture 2.0. Accessed January 2012. Available at: http://developer.amd.com/documentation/articles/pages/Magny-Cours-Direct-Connect-Architecture-2.0.aspx.Google Scholar

Chakravarty, Manuel M. T., Leshchinskiy, R., Peyton Jones, S. & Keller, G. (2008) Partial vectorisation of Haskell programs. Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, Nice, France. New York, NY: ACM.Google Scholar

Chakravarty, Manuel M. T., Leshchinskiy, R., Peyton Jones, S., Keller, G. & Marlow, S. (2007) Data parallel Haskell: A status report. In Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, San Francisco, CA, USA. New York, NY: ACM, pp. 10–18.Google Scholar

Chatterjee, S. (1993) Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Program. Lang. Syst. 15 (3), 400–462.CrossRef Google Scholar

Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K. & Hughes, B. (2010) Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 16–29.CrossRef Google Scholar

Fluet, M., Rainey, M., Reppy, J., Shaw, A. & Xiao, Y. (2007a) Manticore: A heterogeneous parallel language. In Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, San Francisco, CA, USA. New York, NY: ACM, pp. 37–44.Google Scholar

Fluet, M., Ford, N., Rainey, M., Reppy, J., Shaw, A. & Xiao, Y. (2007b) Status report: The manticore project. In Proceedings of the 2007 ACM SIGPLAN Workshop on ML, Victoria, BC, Canada. New York, NY: ACM, pp. 15–24.Google Scholar

Fluet, M., Rainey, M., Reppy, J. & Shaw, A. (2008a) Implicitly threaded parallelism in Manticore. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming, Victoria, BC, Canada. New York, NY: ACM, pp. 119–130.CrossRef Google Scholar

Fluet, M., Rainey, M. & Reppy, J. (2008b) A scheduling framework for general-purpose parallel languages. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming, Victoria, BC, Canada. New York, NY: ACM, pp. 241–252.CrossRef Google Scholar

Frigo, M., Leiserson, Charles E. & Randall, Keith H. (1998, Jun) The implementation of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI '98), Montreal, Canada. New York, NY: ACM, pp. 212–223.Google Scholar

Ghuloum, A., Sprangle, E., Fang, J., Wu, G. & Zhou, X. (2007, Oct) Ct: A Flexible Parallel Programming Model for Tera-Scale Architectures. Tech. rept. Intel. Accessed January 2012. Available at: http://techresearch.intel.com/UserFiles/en-us/File/terascale/Whitepaper-Ct.pdf.Google Scholar

Halstead, Robert H. Jr. (1984) Implementation of multilisp: LISP on a multiprocessor. In Conference Record of the 1984 ACM Symposium on LISP and Functional Programming. New York, NY: ACM, pp. 9–17.CrossRef Google Scholar

Hinze, R. & Paterson, R. (2006) Finger trees: A simple general-purpose data structure. J. Funct. Program. 16 (2), 197–217.CrossRef Google Scholar

Huet, G. (1997) The zipper. J. Funct. Program. 7 (5), 549–554.CrossRef Google Scholar

Intel (2008) Intel Threading Building Blocks Reference Manual. Santa Clara, CA: Intel Corporation. Available at: http://www.threadingbuildingblocks.org/.Google Scholar

Keller, G. (1999) Transformation-Based Implementation of Nested Data Parallelism for Distributed Memory Machines. PhD thesis, Technische Universität Berlin, Berlin, Germany.Google Scholar

Leiserson, Charles E. (2009) The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, San Francisco, CA, USA. New York, NY: ACM, pp. 522–527.CrossRef Google Scholar

Leshchinskiy, R. (2005) Higher-Order Nested Data Parallelism: Semantics and Implementation. PhD thesis, Technische Universität Berlin, Berlin, Germany.Google Scholar

Loidl, H.-W. & Hammond, K. (1995) On the granularity of divide-and-conquer parallelism. In Proceedings of the Glasgow Workshop on Functional Programming, Ullapool, Scotland. New York: Springer-Verlag, pp. 8–10.Google Scholar

Lopez, P., Hermenegildo, M. & Debray, S. (1996) A methodology for granularity-based control of parallelism in logic programs. J. Symb. Comput. 21 (Jun), 715–734.CrossRef Google Scholar

McBride, C. (2008) Clowns to the left of me, jokers to the right (pearl): Dissecting data structures. In Conference Record of the 35th Annual ACM Symposium on Principles of Programming Languages(POPL '08), San Francisco, CA, USA. New York, NY: ACM, pp. 287–295.Google Scholar

Milner, R., Tofte, M., Harper, R. & MacQueen, D. (1997) The Definition of Standard ML (revised). Cambridge, MA: MIT Press.CrossRef Google Scholar

MLton (n.d.) The MLton Standard ML Compiler. Accessed January 2011. Available at: http://mlton.org.Google Scholar

Narlikar, Girija J. & Blelloch, Guy E. (1999) Space-efficient scheduling of nested parallelism. ACM Trans. Program. Lang. Syst. 21 (1), 138–173.CrossRef Google Scholar

Nikhil, Rishiyur S. (1991, Jul) ID Language Reference Manual. Cambridge, MA: Laboratory for Computer Science, MIT.Google Scholar

Peyton Jones, S., Leshchinskiy, R., Keller, G. & Chakravarty, Manuel M. T. (2008) Harnessing the multicores: Nested data parallelism in Haskell. In Proceedings of the 6th Asian Symposium on Programming Languages and Systems. New York, NY: Springer-Verlag, pp. 138–138.CrossRef Google Scholar

Plummer, H. C. (1911) On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71 (Mar), 460–470.CrossRef Google Scholar

Rainey, M. (2007, Jan) The Manticore Runtime Model. M.Phil. thesis, University of Chicago, Illinois, USA. Available at: http://manticore.cs.uchicago.edu.Google Scholar

Rainey, M. (2009) Prototyping nested schedulers. In Semantics Engineering with Plt Redex, Felleisen, M., Findler, R. & Flatt, M. (eds.). Cambridge, MA: MIT Press.Google Scholar

Robison, A., Voss, M. & Kukanov, A. (2008) Optimization via reflection on work stealing in TBB. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. Los Alamitos, CA: IEEE Computer Society Press.Google Scholar

Scandal Project (n.d.) A Library of Parallel Algorithms Written NESL. Accessed January 2012. Available at: http://www.cs.cmu.edu/~scandal/nesl/algorithms.html.Google Scholar

So, B., Ghuloum, A. & Wu, Y. (2006) Optimizing data parallel operations on many-core platforms. First Workshop on Software Tools for Multi-Core Systems, (STMCS), Manhattan, NY.Google Scholar

Tick, E. & Zhong, X. (1993) A compile-time granularity analysis algorithm and its performance evaluation. In Selected Papers of the International Conference on Fifth Generation Computer Systems (FGCS '92). New York, NY: Springer-Verlag, pp. 271–295.Google Scholar

Trinder, Philip W., Hammond, K., Loidl, H.-W. & Peyton Jones, S. L. (1998) Algorithm + strategy = parallelism. J. Funct. Program. 8 (1), 23–60.CrossRef Google Scholar

Tzannes, A., Caragea, G. C., Barua, R. & Vishkin, U. (2010) Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 2010 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming. New York, NY: ACM, pp. 179–190.Google Scholar

Weeks, S. (2006, Sep) Whole Program Compilation in MLton. Invited talk at ML'06 workshop. Accessed January 2011. Slides available at: http://mlton.org/pages/References/attachments/060916-mlton.pdf.CrossRef Google Scholar

Submit a response

Discussions

No Discussions have been published for this article.

Article contents

Lazy tree splitting

Abstract

References

Discussions

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests