We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we focus on statistics and measures that quantify a networks structure and characterize how it is organized. These measures have been central to much of network science, and a vast array of material is available to us, spanning across all scales of the network. The measures we discuss include general-purpose measures and those specialized to particular circumstances, which allow us to better get a handle on the network data. Network science has generated a dizzying array of valuable measures over the years. For example, we can measure local structures, motifs, patterns of correlations within the network, clusters and communities, hierarchy, and more. These measures are used for exploratory and confirmatory analyses, which we discussed in the previous chapter. With the measures of this chapter, we can understand the patterns in our networks, and using statistical models, we can put those patterns on a firm foundation.
Most scientists receive training in their domain of expertise but, with the possible exception of computer science, students of science receive little training in computer programming. While software engineering has brought forth sound principles for programming, training in software engineering only translates partially to scientific coding. Simply put, coding for science is not the same as coding for software. This chapter discusses best practices for writing correct, clear, and concise scientific code. We aim to ensure code is readable to others and supports data provenance, not hinders it. We also want the code to be a lasting recording of work performed, helping research reproducibility. Practices to address these concerns that we cover include clear variable names and code comments, favoring simple code, carefully documenting program dependencies and inputs, and using version control and logging. Together, these practices will enable your code to work better and more reliably for yourself and your collaborators.
Network science has exploded in popularity since the late 1990s. But it flows from a long and rich tradition of mathematical and scientific understanding of complex systems. We can no longer imagine the world without evoking networks. And network data is at the heart of it. In this chapter, we set the stage by highlighting network sciences ancestry and the exciting scientific approaches that networks have enabled, followed by a tour of the basic concepts and properties of networks.
Much of the power of networks lies in their flexibility. Networks can successfully describe many different kinds of complex systems. These descriptions are useful in part because they allow us to organize data associated with the system in meaningful ways. These associated attributes and their connections to the network are often the key drivers behind new insights. For example, in a social network, these may be demographic features, such as the ages and occupations of members of a firm. In a protein interaction network, gene ontology terms may be gathered by biologists studying the human genome. We can gain insight by collecting data on those features and associating them with the network nodes or links. In this chapter, we study ways to associate data with the network elements, the nodes and links. We describe ways to gather and store these attributes, what analysis we can do using them, and the most crucial questions to ask about these attributes and their interplay with our networks.
Many tools exist to help scientists work computationally. In addition to general-purpose and domain-specific programming languages, a wide assortment of programs exist to accomplish specific tasks. We call attention to a number of tools in this chapter, focusing on good practices when using them, good practices computationally and good practices scientifically. Important computing tools for data scientists include computational notebooks, data pipelines and file transfer tools, UNIX-style operating systems, version control systems, and data backup systems. Of course, the world of computing moves fast, and tools are always coming and going, so we conclude with advice and a brief workflow to guide you through evaluating new tools.
Scientists must be ethical and conscientious, always. Data bring with them much promise to improve our understanding of the world around us, and improve our lives within it. But there are risks as well. Scientists must understand the potential harms of their work, and follow norms and standards of conduct to mitigate those concerns. But network data are different. As we discuss in this chapter, network data are some of the most important but also most sensitive data. Before we dive into the data, we discuss the ethics of data science in general and network data in specific. The ethical issues that we face often do not have clear solutions but require thoughtful approaches and understanding complex contexts and difficult circumstances.
In this chapter, we discuss how to represent network data inside a computer, with some examples of computational tasks and the data structures that enable those computations. When working with network data using code, you have many choices of data structures---but which ones are best for our given goals? Writing your own code to process network data can be valuable, yet existing libraries, which feature extensively-tested and efficiently-engineered functionalities, are worth considering as well. Python and R, both excellent programming languages for data science, come well-equipped with third-party libraries for working with network data, and we describe some examples. We also discuss choosing and using typical file formats for storing network data, as many standard formats exist.
Network data, like all data, are imperfect measures of objects of study. There may be missing information or false information. For networks, these measurement errors can lead to missing nodes or links (network elements that exist in reality but are absent from the network data) or spurious nodes or links (nodes or links present in the data but absent in reality). More troubling is that these conditions exist in a continuum, and there is a spectrum of scenarios where nodes or links may exist but not be meaningful in some way. In this chapter, we describe how such errors can appear and affect network data and introduce some ways to handle such errors in the data processing steps. Fixes for errors can lead to different networks, before and after processing, for example, and we must be careful and circumspect in identifying and planning for such errors.
What are the nodes? What are the links? These questions are not the start of your work—the upstream task makes sure of that—but they are an inflection point. Keep them front of mind. Your methods, the paths you take to analyze and interrogate your data, all unfold from the answers (plural!) to these questions. This chapter reflects on where we have gone, where we can go for more, and, perhaps, what the future has in store for data science, networks and network data.
Machine learning has revolutionized many fields, including science, healthcare, and business. It is also widely used in network data analysis. This chapter provides an overview of machine learning methods and how they can be applied to network data. Machine learning can be used to clean, process, and analyze network data, as well as make predictions about networks and network attributes. Methods that transform networks into meaningful representations are especially useful for specific network prediction tasks, such as classifying nodes and predicting links. The challenges of using machine learning with network data include recognizing data leakage and detecting dataset shift. As with all machine learning, effective use of machine learning on networks depends on practicing good data hygiene when evaluating a predictive model’s performance.
Some networks, many in fact, vary with time. They may grow in size, gaining nodes and links. Or they may shrink, losing links and becoming sparser over time. Sitting behind many networks are drivers that change the structure, predictably or not, leading to dynamic networks that exhibit all manner of changes. This chapter focuses on describing and quantifying such dynamic networks, recognizing the challenges that dynamics bring, and finding ways to address those challenges. We show how to represent dynamic networks in different ways, how to devise null models for dynamic networks, and how to compare and contrast dynamical processes running on top of the network against a network structure that is itself dynamic. Dynamic network data also brings practical issues, and we discuss working with date and time data and file formats.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.
In working with network data, data acquisition is often the most basic yet the most important and challenging step. The availability of data and norms around data vary drastically across different areas and types of research. A team of biologists may spend more than a decade running assays to gather a cells interactome; another team of biologists may only analyze publicly available data. A social scientist may spend years conducting surveys of underrepresented groups. A computational social scientist may examine the entire network of Facebook. An economist may comb through large financial documents to gather tables of data on stakes in corporate holdings. In this chapter, we move one step along the network study life-cycle. Key to data gathering is good record-keeping and data provenance. Good data gathering sets us up for future success—otherwise, garbage in, garbage out—making it critical to ensure the best quality and most appropriate data is used to power your investigation.
This chapter covers ways to explore your network data using visual means and basic summary statistics, and how to apply statistical models to validate aspects of the data. Data analysis can generally be divided into two main approaches, exploratory and confirmatory. Exploratory data analysis (EDA) is a pillar of statistics and data mining and we can leverage existing techniques when working with networks. However, we can also use specialized techniques for network data and uncover insights that general-purpose EDA tools, which neglect the network nature of our data, may miss. Confirmatory analysis, on the other hand, grounds the researcher with specific, preexisting hypotheses or theories, and then seeks to understand whether the given data either support or refute the preexisting knowledge. Thus, complementing EDA, we can define statistical models for properties of the network, such as the degree distribution, or for the network structure itself. Fitting and analyzing these models then recapitulates effectively all of statistical inference, including hypothesis testing and Bayesian inference.
Realistic networks are rich in information. Often too rich for all that information to be easily conveyed. Summarizing the network then becomes useful, often necessary, for communication and understanding but, being wary, of course, that a summary necessarily loses information about the network. Further, networks often do not exist in isolation. Multiple networks may arise from a given dataset or multiple datasets may each give rise to different views of the same network. In such cases and more, researchers need tools and techniques to compare and contrast those networks. In this chapter, In this chapter, well show you how to summarize a network, using statistics, visualizations, and even other networks. From these summaries we then describe ways to compare networks, defining a distance between networks for example. Comparing multiple networks using the techniques we describe can help researchers choose the best data processing options and unearth intriguing similarities and differences between networks in diverse fields.
Machine learning, especially neural network methods, is increasingly important in network analysis. This chapter will discuss the theoretical aspects of network embedding methods and graph neural networks. As we have seen, much of the success of advanced machine learning is thanks to useful representations—embeddings—of data. Embedding and machine learning are closely aligned. Translating network elements to embedding vectors and sending those vectors as features to a predictive model often leads to a simpler, more performant model than trying to work directly with the network. Embeddings help with network learning tasks, from node classification to link prediction. We can even embed entire networks and then use models to summarize and compare networks. But not only does machine learning benefit from embeddings, but embeddings benefit from machine learning. Inspired by the incredible recent progress with natural language data, embeddings created by predictive models are becoming more useful and important. Often these embeddings are produced by neural networks of various flavors, and we explore current approaches for using neural networks on network data.
This chapter discusses record keeping, like maintaining a lab notebook. Historically, lab notebooks were analog, pen-and-paper affairs. With so much work being performed on the computer and with most scientific instruments creating digital data directly, most record-keeping efforts are digital. Therefore, we focus on strategies for establishing and maintaining records of computer-based work. Keeping good records of your work is essential. These records inform your future thoughts as you reflect on the work you have already done, acting as reminders and inspiration. They also provide important details for collaborators, and scientists working in large groups often have predefined standards for group members to use when keeping lab notebooks and the like. Computational work differs from traditional bench science, and this chapter describes practices for good record-keeping habits in the more slippery world of computer work.