Hostname: page-component-78c5997874-g7gxr Total loading time: 0 Render date: 2024-11-02T19:13:52.317Z Has data issue: false hasContentIssue false

Lazy tree splitting

Published online by Cambridge University Press:  15 August 2012

LARS BERGSTROM
Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: [email protected])
MATTHEW FLUET
Affiliation:
Department of Computer Science, Rochester Institute of Technology, Rochester NY 14623-5603, USA (e-mail: [email protected])
MIKE RAINEY
Affiliation:
Max Planck Institute for Software Systems, D-67663 Kaiserslautern, Rheinland-PhalzGermany (e-mail: [email protected])
JOHN REPPY
Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: [email protected])
ADAM SHAW
Affiliation:
Department of Computer Science, University of Chicago, Chicago, IL 60637, USA (e-mail: [email protected])
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Nested data-parallelism (NDP) is a language mechanism that supports programming irregular parallel applications in a declarative style. In this paper, we describe the implementation of NDP in Parallel ML (PML), which is a part of the Manticore system. One of the main challenges of implementing NDP is managing the parallel decomposition of work. If we have too many small chunks of work, the overhead will be too high, but if we do not have enough chunks of work, processors will be idle. Recently, the technique of Lazy Binary Splitting was proposed to address this problem for nested parallel loops over flat arrays. We have adapted this technique to our implementation of NDP, which uses binary trees to represent parallel arrays. This new technique, which we call Lazy Tree Splitting (LTS), has the key advantage of performance robustness, i.e., it does not require tuning to get the best performance for each program. We describe the implementation of the standard NDP operations using LTS and present experimental data that demonstrate the scalability of LTS across a range of benchmarks.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012

References

Appel, Andrew W. (1989) Simple generational garbage collection and fast allocation. Softw. Pract. Exp. 19 (2), 171183.CrossRefGoogle Scholar
Appel, Andrew W. (1992) Compiling with Continuations. Cambridge, UK: Cambridge University Press.Google Scholar
Barnes, J. & Hut, P. (1986) A hierarchical O(N log N) force calculation algorithm. Nature 324 (Dec.), 446449.CrossRefGoogle Scholar
Blelloch, Guy E. (1990a, Nov.) Prefix Sums and Their Applications. Tech. rept. CMU-CS-90-190. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
Blelloch, Guy E. (1990b) Vector Models for Data-Parallel Computing. Cambridge, MA: MIT Press.Google Scholar
Blelloch, Guy E. (1996) Programming parallel algorithms. Commun. ACM 39 (3), 8597.CrossRefGoogle Scholar
Blelloch, Guy E. & Greiner, J. (1996) A provable time and space-efficient implementation of NESL. In Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, Philadelphia, PA, USA. New York, NY: ACM, pp. 213225.Google Scholar
Blelloch, Guy E., Chatterjee, S., Hardwick, Jonathan C., Sipelstein, J. & Zagha, M. (1994) Implementation of a portable nested data-parallel language. J. Parallel Distrib. Comput. 21 (1), 414.CrossRefGoogle Scholar
Blumofe, Robert D. & Leiserson, Charles E. (1999) Scheduling multi-threaded computations by work stealing. J. ACM 46 (5), 720748.CrossRefGoogle Scholar
Boehm, Hans-J., Atkinson, R. & Plass, M. (1995) Ropes: An alternative to strings. Softw. Pract. Exp. 25 (12), 13151330.CrossRefGoogle Scholar
Burton, F. Warren & Sleep, M. Ronan. (1981) Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81). New York, NY: ACM, pp. 187194.Google Scholar
Carver, T. (2010, Mar) Magny-Cours and Direct Connect Architecture 2.0. Accessed January 2012. Available at: http://developer.amd.com/documentation/articles/pages/Magny-Cours-Direct-Connect-Architecture-2.0.aspx.Google Scholar
Chakravarty, Manuel M. T., Leshchinskiy, R., Peyton Jones, S. & Keller, G. (2008) Partial vectorisation of Haskell programs. Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, Nice, France. New York, NY: ACM.Google Scholar
Chakravarty, Manuel M. T., Leshchinskiy, R., Peyton Jones, S., Keller, G. & Marlow, S. (2007) Data parallel Haskell: A status report. In Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, San Francisco, CA, USA. New York, NY: ACM, pp. 1018.Google Scholar
Chatterjee, S. (1993) Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Program. Lang. Syst. 15 (3), 400462.CrossRefGoogle Scholar
Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K. & Hughes, B. (2010) Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 1629.CrossRefGoogle Scholar
Fluet, M., Rainey, M., Reppy, J., Shaw, A. & Xiao, Y. (2007a) Manticore: A heterogeneous parallel language. In Proceedings of the ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, San Francisco, CA, USA. New York, NY: ACM, pp. 3744.Google Scholar
Fluet, M., Ford, N., Rainey, M., Reppy, J., Shaw, A. & Xiao, Y. (2007b) Status report: The manticore project. In Proceedings of the 2007 ACM SIGPLAN Workshop on ML, Victoria, BC, Canada. New York, NY: ACM, pp. 1524.Google Scholar
Fluet, M., Rainey, M., Reppy, J. & Shaw, A. (2008a) Implicitly threaded parallelism in Manticore. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming, Victoria, BC, Canada. New York, NY: ACM, pp. 119130.CrossRefGoogle Scholar
Fluet, M., Rainey, M. & Reppy, J. (2008b) A scheduling framework for general-purpose parallel languages. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming, Victoria, BC, Canada. New York, NY: ACM, pp. 241252.CrossRefGoogle Scholar
Frigo, M., Leiserson, Charles E. & Randall, Keith H. (1998, Jun) The implementation of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI '98), Montreal, Canada. New York, NY: ACM, pp. 212223.Google Scholar
Ghuloum, A., Sprangle, E., Fang, J., Wu, G. & Zhou, X. (2007, Oct) Ct: A Flexible Parallel Programming Model for Tera-Scale Architectures. Tech. rept. Intel. Accessed January 2012. Available at: http://techresearch.intel.com/UserFiles/en-us/File/terascale/Whitepaper-Ct.pdf.Google Scholar
Halstead, Robert H. Jr. (1984) Implementation of multilisp: LISP on a multiprocessor. In Conference Record of the 1984 ACM Symposium on LISP and Functional Programming. New York, NY: ACM, pp. 917.CrossRefGoogle Scholar
Hinze, R. & Paterson, R. (2006) Finger trees: A simple general-purpose data structure. J. Funct. Program. 16 (2), 197217.CrossRefGoogle Scholar
Huet, G. (1997) The zipper. J. Funct. Program. 7 (5), 549554.CrossRefGoogle Scholar
Intel (2008) Intel Threading Building Blocks Reference Manual. Santa Clara, CA: Intel Corporation. Available at: http://www.threadingbuildingblocks.org/.Google Scholar
Keller, G. (1999) Transformation-Based Implementation of Nested Data Parallelism for Distributed Memory Machines. PhD thesis, Technische Universität Berlin, Berlin, Germany.Google Scholar
Leiserson, Charles E. (2009) The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, San Francisco, CA, USA. New York, NY: ACM, pp. 522527.CrossRefGoogle Scholar
Leshchinskiy, R. (2005) Higher-Order Nested Data Parallelism: Semantics and Implementation. PhD thesis, Technische Universität Berlin, Berlin, Germany.Google Scholar
Loidl, H.-W. & Hammond, K. (1995) On the granularity of divide-and-conquer parallelism. In Proceedings of the Glasgow Workshop on Functional Programming, Ullapool, Scotland. New York: Springer-Verlag, pp. 810.Google Scholar
Lopez, P., Hermenegildo, M. & Debray, S. (1996) A methodology for granularity-based control of parallelism in logic programs. J. Symb. Comput. 21 (Jun), 715734.CrossRefGoogle Scholar
McBride, C. (2008) Clowns to the left of me, jokers to the right (pearl): Dissecting data structures. In Conference Record of the 35th Annual ACM Symposium on Principles of Programming Languages(POPL '08), San Francisco, CA, USA. New York, NY: ACM, pp. 287295.Google Scholar
Milner, R., Tofte, M., Harper, R. & MacQueen, D. (1997) The Definition of Standard ML (revised). Cambridge, MA: MIT Press.CrossRefGoogle Scholar
MLton (n.d.) The MLton Standard ML Compiler. Accessed January 2011. Available at: http://mlton.org.Google Scholar
Narlikar, Girija J. & Blelloch, Guy E. (1999) Space-efficient scheduling of nested parallelism. ACM Trans. Program. Lang. Syst. 21 (1), 138173.CrossRefGoogle Scholar
Nikhil, Rishiyur S. (1991, Jul) ID Language Reference Manual. Cambridge, MA: Laboratory for Computer Science, MIT.Google Scholar
Peyton Jones, S., Leshchinskiy, R., Keller, G. & Chakravarty, Manuel M. T. (2008) Harnessing the multicores: Nested data parallelism in Haskell. In Proceedings of the 6th Asian Symposium on Programming Languages and Systems. New York, NY: Springer-Verlag, pp. 138138.CrossRefGoogle Scholar
Plummer, H. C. (1911) On the problem of distribution in globular star clusters. Mon. Not. R. Astron. Soc. 71 (Mar), 460470.CrossRefGoogle Scholar
Rainey, M. (2007, Jan) The Manticore Runtime Model. M.Phil. thesis, University of Chicago, Illinois, USA. Available at: http://manticore.cs.uchicago.edu.Google Scholar
Rainey, M. (2009) Prototyping nested schedulers. In Semantics Engineering with Plt Redex, Felleisen, M., Findler, R. & Flatt, M. (eds.). Cambridge, MA: MIT Press.Google Scholar
Robison, A., Voss, M. & Kukanov, A. (2008) Optimization via reflection on work stealing in TBB. Proceedings of the IEEE International Symposium on Parallel and Distributed Processing. Los Alamitos, CA: IEEE Computer Society Press.Google Scholar
Scandal Project (n.d.) A Library of Parallel Algorithms Written NESL. Accessed January 2012. Available at: http://www.cs.cmu.edu/~scandal/nesl/algorithms.html.Google Scholar
So, B., Ghuloum, A. & Wu, Y. (2006) Optimizing data parallel operations on many-core platforms. First Workshop on Software Tools for Multi-Core Systems, (STMCS), Manhattan, NY.Google Scholar
Tick, E. & Zhong, X. (1993) A compile-time granularity analysis algorithm and its performance evaluation. In Selected Papers of the International Conference on Fifth Generation Computer Systems (FGCS '92). New York, NY: Springer-Verlag, pp. 271295.Google Scholar
Trinder, Philip W., Hammond, K., Loidl, H.-W. & Peyton Jones, S. L. (1998) Algorithm + strategy = parallelism. J. Funct. Program. 8 (1), 2360.CrossRefGoogle Scholar
Tzannes, A., Caragea, G. C., Barua, R. & Vishkin, U. (2010) Lazy binary-splitting: A run-time adaptive work-stealing scheduler. In Proceedings of the 2010 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming. New York, NY: ACM, pp. 179190.Google Scholar
Weeks, S. (2006, Sep) Whole Program Compilation in MLton. Invited talk at ML'06 workshop. Accessed January 2011. Slides available at: http://mlton.org/pages/References/attachments/060916-mlton.pdf.CrossRefGoogle Scholar
Submit a response

Discussions

No Discussions have been published for this article.