So whether A trusts B, B trusts A, or even if there is a reciprocal trusting relationship between A and B, then you would treat A and B as having a trust relationship where you are ignoring the directionality of the trust. For either of these reasons, R makes it easy to transform a directed network into a non-directed network. To do this you can use the symmetrize function. The name of the function should remind you that when network data are stored in a sociomatrix, if the data are symmetric around the diagonal that indicates that the ties are non-directed. The symmetrize procedure is relatively straightforward, except it returns a soc- iomatrix or, optionally, an edgelist.
So we need to then turn it into a network object, as we have done previously. This creates a symmetric network where the only ties preserved are the fully reciprocated ties. Above all else, show the data. Edward R. As suggested in Chap. The overall purpose of a network graphic as with any information graphic is to highlight the important information contained in the underlying data. However, there are innumerable ways to visually layout network nodes and ties in two-dimensional space, as well as using graphical elements e.
In the next three chapters we go over basic principles of effective network graph design, and how to produce effective network visualizations in R. An effective network graphic will convey the important information in a social network, such as the overall structure, location of important actors in the network, presence of distinctive subgroups, etc. At the same time, the graphic should do its best to minimize irrelevant information. For example, tie length in a network graphic is arbitrary in the sense that the length of a tie is not meaningful.
An effective net- work figure will be designed and laid out in a way that minimizes the chance that a viewer will misinterpret the meaning of tie lengths. The purpose of this chapter is to introduce basic plotting techniques for networks in R, and discuss the various options for specifying the layout of the network on the screen or page. The following example shows how interpretation of a network graphic can be impeded or enhanced by its basic layout. At first glance it may appear that the figures are showing two quite different net- works. In fact, they are two different visual representations of the same underlying.
Despite rep- resenting the same network data, the righthand figure is easier for us to interpret. In particular, it is much easier to see that the network is made up of two separate com- ponents, and that the large component has two fairly distinct cohesive subgroups. That is, the important structural characteristics of the network are easier to deter- mine with the second layout compared to the first.
Although it is possible to lay out a network in 3D-space, the vast majority of network visualizations are two-dimensional. Nodes are represented by shapes, typ- ically circles, and ties are represented by straight or sometimes curved lines. The lines themselves can be tricky to interpret for somebody new to network visualiza- tion. In particular, the length of the line has no real meaning. Consider the following two graphs, which display the same simple network Fig. At a quick glance it might appear that node D is further away from B and C in the second graph. But the ties simply indicate which nodes are adjacent to one another, so the length of each line does not communicate any substantive information.
However, as the Moreno 4th grade friendship network example illustrated Fig. This is the fundamental challenge of network visualization: to reveal important structural characteristics of the network without distortion or as Edward Tufte stated, The minimum we should hope for with any display technology is that it should do no harm Tufte Although there are not in fact an infinite number of ways to display a network on a screen, the number of possibilities might as well be. For example, consider a moderate sized network of 50 nodes, and a display grid of 10 by In actuality, the display grid would be much larger than this.
The first node in the network could be placed in any one of positions, the 2nd node in 99 positions, and so on. In this example, there are 3. Most of the possibilities will produce ugly or confusing layouts, therefore there must be some way to pick a layout that has a better than average chance of being visually acceptable. Fortunately, network and visualization scientists have studied what makes network graph layouts easier to understand and interpret.
What has emerged from this line of work is a set of aesthetic principles that can be used to more effectively display net- works. Network graphics are easier to understand if they follow as much as possible the following five guidelines:. A large number of approaches have been developed for automatic layout of network graphics. One general class of algorithms, called force-directed, has proven to be a flexible and powerful approach to automatic network layouts.
These algo- rithms work iteratively to minimize the total energy in a network, where the en- ergy can be defined in a number of ways. A popular approach is to have connected nodes have a spring-like attractive force, while simultaneously assigning repulsive forces to all pairs of nodes. The springs in this algorithm act to pull connected nodes closer to one another, while the repulsive forces push unconnected nodes away from each other.
The resulting network system will move around and oscillate for a while before settling into a steady state that tends to minimize the energy in the network system. This describes how the algorithm works, but the remarkable feature is that the resulting network graph tends to produce displays that are aesthetically pleasing, in the sense described above Fruchterman and Reingold To see the positive results of using one of these algorithms, consider the compar- ison in Fig.
On the left-hand side, the Moreno network is displayed randomly. On the right-hand side we are using the Fruchterman-Reingold algorithm for the net- work display. In fact, it is the default algorithm used by the statnet network plotting functions. On the right-hand side the nodes are displayed more symmetrically, there are relatively fewer edge cross- ings, and the tie lengths are more uniform.
All of this makes it easier to interpret the structural information contained in the network. As stated above, a force-directed algorithm works by iteratively adjusting the overall network layout until some measure of overall network energy is minimized.
The details of this are usually not of interest, but to see how this works in practice consider Fig. Starting from a circle layout, it shows how the Fruchterman-Reingold layout algorithm works through successive iterations, from 0 the starting circle to The Fruchterman-Reingold algorithm, along with other force-directed approaches, are iterative and non-deterministic.
That means that each time you run the plotting algorithm you will not get the exact same layout. However, you will get a layout that tends to be symmetrical, minimize edge crossings, etc. Network visualization in statnet is handled by two closely related functions, plot and gplot. The latter has more layout options, so it may be more gener- ally useful. To use a different layout algorithm, it is as simple as specifying the appropriate layout option.
Figure 4. The layout options provided in statnet and igraph, see below work algo- rithmically or heuristically, usually with some randomness. So, even with the same layout option, a different graphic layout will be produced each time the network is plotted. Fortunately, R provides a way to have exact control over the layout coor- dinates. This allows for exact positioning, or saving the layout coordinates after a particular network is plotted.
The coord option in the plot function is used for this. This option expects a matrix with two columns. Each row corresponds to one node, the first column gives the X coordinate, and the second column gives the Y coordinate. This is demonstrated in the next example. Here, we produce an initial plot of the Bali network, saving the coordinates. We then stretch out the layout of the graph by multiplying the Y coordinates by a constant. Both plots are shown in the figure, along with axes to make it easier to see how the coordinates have changed Fig.
There are many other ways to use specific coordinates, but the main use is to preserve a particular layout for future production and examination. The igraph package provides the user with a similar set of options for controlling the layouts of network graphics. The layout option is used to specify an existing layout function or refer to a set of vertex coordinates. Original coordinates Modified coordinates. As with any graphic, networks are used in order to discover pertinent groups or to inform others of the groups and structures discovered.
It is a good means of displaying structures. However, it ceases to be a means of discovery when the elements are numerous. The figure rapidly becomes complex, illegible and untransformable. Jacques Bertin. Achieving effective network graphic design is not that different from any other type of information graphic.
The goal for any network graphic design should be to produce a figure that reveals the important or interesting information that is contained in the network data. To do this, the analyst must make decisions about every graphical element that can appear in the figure. R, and the plotting functions contained in the statnet and igraph packages, give the analyst almost complete programmatic control over the appearance of the network graphic. The purpose of this chapter is to walk through many of the most useful design elements in network graphics, and discuss how to use them and why they should be used in certain ways.
Like any other type of information graphic, network visualizations are made up of a large number of distinct visual elements. These individual elements include things that are distinctive to network graphics, such as nodes and ties, as well as other elements common to most graphics, such as titles, legends, etc. The plotting functions in statnet and igraph provide a great deal of programmatic control to the user. Although a simple call to a plotting function is enough to produce a default net- work graphic, it is almost always the case that you will need to take time to set appropriate function options and develop some additional R code to produce an.
Some design decisions will be made on aesthetic grounds, while many others will be based on the most important pattern or story that you wish to convey with the graphic and that is supported by the underlying network data. The following sections present a quick tour of the most commonly used indi- vidual graphing elements for network visualizations. They will each be covered on their own in turn. By default, statnet produces a network graphic with red circles as nodes. To designate a different color, the vertex.
The gmode option is also used here to tell statnet not to handle Bali as a directed graph. So, for example, it is a simple matter to produce a plot with attractive light blue nodes Fig. In general, all of the basic color-handling options of R are available for plotting networks. This opens up a lot of power and flexibility for graphic design, but to use color effectively will require some homework. In particular, it will be useful to read more in-depth treatments of color use in R e.
As the above example suggests, a color can be designated by its color name. To see all of the possible color names recognized by R, use the colors command. The following code will produce the same network graphic with the same light blue nodes figure not shown , showing how you can obtain colors using the rgb and hexadecimal approaches. To get the appropriate rgb values for a particular color name, you can use the col2rgb function.
One less common color feature in R can come in handy for network diagrams, especially with large networks where the nodes overlap in the graphic. The rgb function can be used to specify the amount of transparency, from 0 fully transparent to 1 fully opaque. Figure 5. Both graphics show the same random network of nodes. The layouts are differ- ent because each is using the Fruchterman-Reingold force-directed algorithm. The figure on the left is using a fully opaque dark blue color. The overlapping nodes are much easier to see when transparent colors are used.
Note that some graphics devices in R may not support transparent colors. In these previous examples, every node has the same color. A more important use of color is to communicate some characteristic of the node or network by having different nodes have different colors. Specifically, information stored in a categorical node attribute can often be communicated through judicious node color choices.
For example, the Bali terrorist network has the role vertex attribute which stores the categorical description of the role that each member played in the net- work. CT means a member of the command team, BM is a bomb maker, etc. Bali for more information. Node colors can be used to effectively distinguish the network member roles. Since this information is already stored in a vertex att- ribute, statnet can use this to automatically pick node colors.
This is only true for plot , not gplot. The node labels are also printed out to facilitate interpretation. Node labeling will be discussed in Sect. The net- work is much more interpretable by using color coding in this way. For example, we can more easily understand the subgroup structure by noting that the greater density between the members of Team Lima TL, cyan , as well as the bombmakers BM, black.
However, by simply using the name of an existing vertex attribute, statnet picks node colors from the existing default color palette in R. Using the default palette has a number of disadvantages. First, it is limited to eight colors. R will cycle through the set of eight colors if there are more than eight types of nodes to color. Second, the default palette starts with black which is often not a good color choice to include with other colors in a network graphic. More generally, the default colors do not represent an aesthetically pleasing or useful set of colors for displaying categorical classifications.
A more flexible, and usually aesthetically more satisfying approach is to set up your own color palette and then index into it for color selection. The RColor Brewer package provides a number of predesigned color palettes that are very useful when using color to distinguish between a relatively small set of categories. For more information, see? More details on the ColorBrewer. BM CT. In the following code, a user-defined palette is created by selecting five colors from a larger palette called Dark2 provided by RColorBrewer.
Once the palette has been defined, it can be used in the network plotting call. This approach produces more pleasing sets of colors, and is much more flexible than relying on the default color palette Fig. Note that we convert the role vertex attribute character vector to a factor so that the indexing will work. This means that if you have a numeric vector stored as a vertex attribute that you do not have to turn it into a factor. In other words, the indexing works with factors or numeric vectors, but not character vectors Fig. In addition to using color to distinguish between different types of nodes, statnet can be directed to use different shapes for the nodes.
It is also particularly useful for situations where you will not be able to use color to distinguish nodes or help viewers who may be color-blind. Unfortunately, statnet has only a limited ability to distinguish nodes by shapes, by designating the number of sides used to plot the node polygon normally, the number of sides is 50, which produces a circle. If the number of sides is 3 you get a triangle, 4 a square, and so on. This is only useful for a very small number of node types Fig. If you have a particular need to use node shapes in a network graphic, igraph is much more flexible in this regard.
See Sect. Network node sizes are controlled by the vertex. The overall sizes of the nodes should be set so that the nodes are large enough to be distinguishable, but small enough that they do not extensively overlap. Rather than setting the same overall size for every node, it is often useful to use the node size in a network graphic to communicate some important quantitative characteristic.
For example, nodes vary in their positions in the overall network. Some nodes are very central, while others are more peripheral. Chapter 7 discusses node prominence and centrality in more detail, but for now we will simply calculate some node characteristics such that larger numbers indicate more central nodes. To set this up, we will calculate three different measures of node centrality.
Each of these lines of code produces a vector of centrality measures for each node, and larger numbers indicate greater centrality. Once you have this node-level vector of quantitative information, it can be used to set the relative sizes of the nodes. This is done by using the same vertex. However, as we can see by comparing the two panels in Fig. This results in node sizes where we can more easily see the nodes with higher degree relative to nodes with lower degree.
The next two examples show other types of adjustments that might be necessary when setting relative node sizes. Using cls closeness we have the opposite prob- lem from the previous example, where the nodes sizes start out too small. So an appropriate adjustment is to multiply the original values Fig. The bet vector betweenness provides a more complex example.
First, the raw vector sizes vary across several orders of magnitude with one node with a size of In addition, some of the nodes have 0 for their bet values. These zeros would result in the nodes being plotted with 0 size, so we need to handle this by adding 1 to the entire vector before taking the square root Fig.
The adjustments for relative node sizes can be tedious, although R does give you complete control for how to adjust the sizes. The following function can be used to save some time when figuring out the best node sizes. The function rescale takes a vector of node characteristics actually can be any numeric vector , and rescales the values to fit between the low and high values. The next plot shows how the function works and rescales the raw degree values for the Bali network to set the node sizes to vary from one to six Fig. A network graphic is often more interesting and easier to interpret if nodes are labelled so that the audience can see who or what makes up the network.
This is particularly helpful for smaller networks; if networks get too large then the labels themselves may get in the way of the network information. If a network object in statnet contains the special vertex attribute vertex. Other characteristics of the node labels can be controlled such as font size, color, and distance from node Fig. Arnasan Muklas Imron. Ghoni Junaedi Rauf Samudra.
Octavia Husin Patek Feri Hidayat.
- Collecting Facebook data.
- Search form.
- A User s Guide to Network Analysis in R.
- A User’s Guide to Network Analysis in R (Use R!)!
- SocialMediaLab - Social Network Analysis (SNA).
The automatic labels based on information stored in the vertex. For example, in the case of the Bali network the actual names of the terrorists are not that interesting to most viewers. Fortunately, you can use other text information to label the nodes.
R & Bioconductor - Manuals
We saw an example of this earlier in Fig. In this case we are using the text stored in the role vertex attribute to label the nodes. The key here is to use the label option to specify what text vector to use for the labels Fig. OA OA Fig. If your network data include valued ties, or in general any quantitative information that can be related to ties between nodes, then you can communicate that informa- tion visually by altering the width of the displayed ties in a network graphic. For example, the strength of friendship ties might be known, or the amount of money that flows between organizations in a directed network might be measured.
In these cases, thicker ties can denote greater strength or greater flow Fig. The Bali network includes a tie attribute called IC, which is a simple five-level ordinal scale that was used to measure the amount of interaction between members of the network. This attribute can be used to set the width of the ties in the network visualization. In the example below the IC values are extracted from the stored edge attribute, this allows us to transform the vector to better distinguish among the five IC levels by multiplying the vector by 1.
While edge width can be set to communicate quantitative information about network ties, the color of the edge can be set to communicate qualitative information about the tie, similar to how node colors can set. For example, you could use different colors of line graphics to distinguish between positive and negative ties in a social network Fig.
The Bali network does not contain categorical or qualitative information stored in an edge attribute, so here we create a random categorical vector to demonstrate how to use different edge colors in a network graphic. For this example, we set up a color palette that can be used to index the correct color choice, based on the categorical edge vector.
In this case blue will be used for edge type 1, red for edge type 2, and green for edge type 3. This might reflect neutral ties blue , negative ties red , and positive ties green. Also see Fig. While edge width can be set to communicate quantitative information about network ties, the type of the edge can be set to communicate qualitative information about the ties. For example, you could use different types of line graphics to distinguish between positive and negative ties in a social network.
- A User’s Guide to Network Analysis in R by Douglas A. Luke!
- 6.1 Tweets.
- Theory of function spaces.
The Bali network does not contain categorical or qualitative information stored in an edge attribute, so here we create a random categorical vector to demonstrate how to use different edge types in a network graphic. Also, the different line types do not show up clearly using plot , so gplot is used here Fig. Although this works as intended, the resulting graphic is not very attractive and in my mind is hard to interpret.
Different line types should be used sparingly, and probably only for very small networks with only two different line types. Most pub- lished network graphics stick to color and maybe line width to distinguish among different types of network ties. The examples above show how network graphic elements such as node color, node shape, node size, edge type, edge width can be used to communicate important characteristics of the network.
As with other types of information graphics, it is often useful to provide a legend so that the meaning of this information is clear to the user. The basic plotting functions contained in statnet do not have built-in func- tionality for providing a network graphic legend. Fortunately, it is easy to use the legend function provided by basic R to add a legend to a network graphic.
In the example below we replicate the network graphic from Fig. It uses node size, node color, and a legend to efficiently and clearly communicate the most important information contained in the Bali network. As the previous two chapters demonstrate, both statnet and igraph have sophisticated plotting capabilities that can produce a very wide variety of net- work graphics. However, these plotting functions cannot meet all of the analytic or presentation needs. In particular, network scientists may wish to produce more specialized network graphics.
Also, while statnet and igraph excel at pro- ducing high-quality publication ready network graphics, these graphics are static. Fortunately, developers have started exploring how to take network graphics and deliver them to web-based platforms where users can interact with the diagrams. This chapter explores a few of these more specialized network graphic techniques, as well as demonstrating how to produce some simple web-based interactive net- work diagrams.
One of the useful features of many other network analysis packages such as UCINet and Pajek is the ability to produce network diagrams that are interactive at some level. These capabilities can be very useful for exploring the network, as well as fine-tuning a network graphic for subsequent dissemination. There are a few exceptions to this, as well as some new packages that allow for creating interactive network diagrams that can be published to the web.
In this section a few of these options are demonstrated. The igraph package includes the tkplot function which supports simple interactive network plots through a Tk graphics window. Only some features of the network graphics can be modified. A typical use for this feature is to produce the interactive graphic, adjust the node positions to improve the network layout, save the node position coordinates and then use the coordinates to produce a final non- interactive network diagram.
This work flow is illustrated below with the Bali network, see Chap. None of these approaches yet have come close to matching what a fully-developed network graphics application such as Gephi can do. However, I anticipate that we will be seeing rapid development of more R-connected approaches to web-based network visualization in the next few years. The networkD3 package is a small set of functions that can be used to build simple interactive network graphics that can be displayed in shiny-aware documents i. The following code shows how simple it is to produce an interactive graphic.
The first set of lines will send a graphic to the Viewer window if you run the commands within RStudio. The simpleNetwork function expects the network data in the form of an edgelist stored in a dataframe. The output from the examples in this section is not shown here, because it requires RStudio or a web browser to view. The output from simpleNetwork is so simple that it mainly is useful as a proof-of-concept or tech demo.
Slightly more sophisticated network graphics can be produced using the forceNetwork function.
- Collecting Twitter data and creating social networks!
- The nature of thermodynamics?
- A User s Guide to Network Analysis in R | R (Programming Language) | Social Network.
For this example, we are using the Bali network again. The function expects data to be passed to it in two data frames. Currently only a categorical grouping variable is allowed. If the nodes have numeric ids, they must start at 0. So, the main work to use the function is putting the data into the correct format.
Once again, this can be saved to an external file. Be careful, you will get an error if you try to overwrite an existing file, even if it is not open in your browser. The visNetwork package is a similar set of tools that uses the vis. This package also requires network data to be provided in a nodes data frame and an edges data frame. The nodes data frame should include an id column, and the edges data frame should have from and columns.
Using the Bali network, the following code sets up the data and produces a minimal example of an interactive network graphic. Like in the previous example, this code produces an interactive network in the Viewer window of RStudio. The visNetwork package has a large number of options that can be used to control the appearance of the network diagram, as well as for controlling how the plot can be embedded in Shiny web applications.
The next code shows off some of these options. First, some of the display options are controlled by saving node or edge infor- mation into the nodes or edges data frames. The visNetwork and visOptions functions are used to dis- play the network, add a legend based on the grouping variable, set default colors for each group, and then allow for the user to highlight individual nodes and their immediate neighbors when clicking on a node in the diagram. As before, these interactive plots will appear in a plot window if you are using RStudio. Once the plot has been designed, it can be exported to a freestanding web- page or embedded in other web platforms e.
This example adds a set of navigation buttons to the final network plot that allows moving the network and zooming in or out. As evidence of the rapid development of interactive network tools, the Statnet devel- opment team has recently published a web-based version of their R network analytic tools using the shiny web application framework. Statnet Web can be used by connecting directly to the shinyapps. Or, the tools can be run locally by installing the statnetWeb package.
In addition to producing basic network plots by selecting parameters and options from drop-down boxes, statnetWeb can produce a variety of network statistics as well as fit and test ERGMs see Chap. Although web-based statnet does not give as much control over or reproducibility of network analytic results as a programming approach does, it is an impressive platform for quickly exploring network characteristics and will be useful for teaching as well as disseminating network analytic results.
Traditionally, network diagrams are plotted to illustrate fundamental network and node properties such as prominence see Chap. However, there are a number of more specialized plotting techniques that can be used that are appropriate for highlighting other important or interesting aspects of the networks. Three of these approaches are demonstrated in this section: arc diagrams, chord diagrams, and heatmaps. Arc diagrams can be used when the positioning of nodes in a network is of less interest than the pattern of ties.
Here is a simple example of an arc diagram, using the arcdiagram package. Note that this has to be installed using GitHub. The set-up for this example includes loading all the required libraries, then cre- ating an edgelist object for the arcdiagram function. For this example, we are using the Simpsons dataset, which contains a set of fictitious network data that shows the primary interaction ties between 15 of the characters on the Simpsons television show. The arc diagram can be enhanced in a number of ways to highlight node and other network characteristics. Also, the degree of each node is used to adjust its size Fig.
Chord diagrams are a specialized type of information graphic that uses a circular layout to display the interrelationships between data in a matrix. They have become particularly popular in genetics research. Because network information can be org- anized in matrices, chord diagrams are an interesting graphic option for network plots. The circlize package, by Zuguang Gu, implements a variety of circular graphics, including chord diagrams.
The package has a lot of features, giving the user great control over the graphical appearance. The included vignette, circular visualization of matrix is suggested reading. In this example, we return to the network of the Netherlands World Cup soccer team. Although Fig. Here we will create a chord diagram to further examine these patterns. The first steps are to load the required packages and prepare the data. The main requirement is to have the network data in the form of a sociomatrix, with the entries corresponding to the strength or size of the tie if it is a valued network.
The matrix will also have to have names assigned for the rows and columns. Ralph Wiggum. Ned Flanders Mr. Milhouse Smithers Maggie Homer. Moe Bart. Carl Pr. To make the subsequent graphics a little easier to interpret we drop all ties with less than ten passes. With a sociomatrix that has names assigned, a basic chord diagram can be pro- duced by a simple call to the chordDiagram function Fig. Chord diagrams can contain a lot of information, especially for larger networks, so it is usually important to fine tune the plot to highlight the most important infor- mation. In this next plot, a number of options are used to make the graphic a little easier to interpret.
First, colors are set so that players in the same position Forward, Midfielder, etc. Then, because this is a directed network, flows passes, in this case go in both directions. The directional option is used so that the departing passes start further away from outer circle, making it easier to see the difference between passes sent and passes received. Finally, the order option is used to sort the players by their position. In the resulting chord diagram Fig. We can see that FW7 receives more than twice the number of passes than the other two forwards.
Heatmaps are another example of a specialized graphic that can be used for net- works, especially valued or weighted networks. Here, a heatmap is produced to highlight the players who are passing or receiving the most among the Netherlands teammates. MF8 GK1 MF6 Once the data are set up, the heatmap is relatively easy to produce Fig.
The colorRampPalette function is used to designate a color range that will be used for the low and high ends of the values in the sociomatrix. The color ranges chosen here were taken from color chooser tools at paletton. FW The heatmap also shows the same pattern of heavy passers as Fig. The dark- est square is for the passes from the goalkeeper to DF4.
Although ggplot2 is not designed to handle all of the requirements of a full- fledged network visualization package, some of its advanced graphics capabilities can be used to create specialized network plotting routines. GK1 MF8. GK1 Fig. The bulk of the work is done by the edgeMaker function which creates the curved ties between each connected dyad. In addition to the core sna and ggplot2 packages, the Hmisc package is used which provides the bezier function used by edgeMaker.
As has been typical with the examples in this chapter, the network data has to be transformed to an edgelist format prior to using the plotting functions. Now, we use edgeMaker to create the curved edges. Also, gplot from sna is called once to store the layout coordinates for the ggplot2 function. This means that any set of coordinates can be fed to ggplot2. Before producing the plot, we create an empty ggplot2 theme. This is used to clean up after producing the plot. And now the final step is to create the plot using ggplot.
Familiarity with ggplot2 will help in understanding this code. The scale colour gradient option controls the intensity of the gradient, and the scale size option controls the amount of the taper Fig. Jean Genet. Networks are interesting because of their specific structural patterns, and how those structures affect the members of the network. Stated more simply, networks affect their members based on where those members are located in the networks.
A person who is connected to many other members of a network is likely to view the rest of the network quite differently from somebody who is relatively isolated from the other members. Network analysis provides many tools for viewing, analyzing, and assessing the locations of individual nodes and ties. This is often the first type of network analysis that is performed once network data are obtained, beyond simple network description. By examining the location of individual network members, we can assess the prominence of those members.
An actor is prominent if the ties of the actor make that actor visible to the other members in the network Knoke and Burt In the rest of this chapter, we will cover a number of the most common ways to assess network member prominence. For non-directed networks we will look at centrality; where we view a central actor as one who is involved in many direct or indirect ties.
For directed networks, prominence is usually referred to as prestige; a prestigious actor is one who is the object of extensive ties. This chapter will also cover how individual node-level measures of centrality and prestige can be aggregated into network-level centralization measures. An example of how to report the results of prominence analysis will be presented. Finally, there will be a short discussion of identifying cutpoints and bridges in networks. It makes intuitive sense that a network member who is connected to many other members of the network is in a prominent position.
For non-directed networks, we will say that this type of actor has high centrality, or that it is in a central position. However, there are a number of ways of operationalizing this type of prominence. In fact, there are dozens of centrality statistics available to the network analyst. To see how we can come up with different types of centrality measures, consider the example network displayed in Fig. Which node is most central? Nodes c and g are both positioned in the center of the graph, but as we learned in Chap. However, node c is directly connected to more network members than any other node, so in that sense we could view c as a central node.
Alternatively, node g does not have as many direct network ties, but it is positioned in such a way that it connects two different parts of the network. In particular, the only way that information from nodes h, i, and j gets to the rest of the network is through node g. Finally, even though node g is only directly connected to two other nodes, it is positioned so that it is fairly close to every other node in the network.
Specifically, node g can reach every other node in only one or two steps. That is, node g is connected to the rest of the network by paths of length one or two. So, in these two very different senses, node g can also be thought of as a central node. In the next three sections, we will cover the three most commonly used measures of centrality.
The simplest measure of centrality by far is based on the notion that a node that has more direct ties is more prominent than nodes with fewer or no ties. Degree centrality thus, is simply the degree of each node. We first introduced node degree in Chap. The degree of a node is the number of ties it has with other nodes. Following the notation of Wasserman and Faust , degree centrality is defined as:.
The network in Fig. However, here is how degree centrality can be calculated in statnet, assuming that we have the data stored in a network object called net. The first line of code simply reminds you of the names of the nodes and their order. The degree function calculates and returns the degree centrality scores for each node. The gmode option tells the function to treat the network object as a non-directed network graph. This option needs to be used, even if the network is created and stored as a non-directed network. The results confirm what we had already suggested above. Node c has the highest degree centrality.
It is connected to five other nodes in the network, more than any other node. Instead of examining only the direct connections of the nodes, we can focus on how close each node is to every other node in a network. This leads to the concept of closeness centrality, where nodes are more prominent to the extent they are close to all other nodes in the network.
Closeness centrality, then, is the inverse of the sum of all the distances between node i and all the other nodes in the network. A node with high betweenness is prominent, then, because that node is in a position to observe or control the flow of information in the network. The equation for betweenness centrality is. A geodesic is the shortest path between two nodes. This shows that node c has the highest betweenness score, with nodes g and h not far behind. These quick examples show that different measures of centrality will emphasize different aspects of the prominence of nodes in a network.
R can handle many different measures of centrality and prestige. See the accom- panying table for a list of the measures currently included in the statnet and igraph packages Table 7. As we can see, R provides a wide variety of ways to examine the centrality and prestige of individual actors in a network.
The choice of which measure of centrality or prestige to use is driven in part by the type of network data you have; in particular, whether the network is directed or not. However as suggested in Sect. That being said, it is also useful to keep in mind that in many real-world social networks there is a great deal of overlap in the various centrality and prestige mea- sures.
Nodes that are identified as highly central using eigenvector centrality are also likely to be identified as central with other measures, especially those most closely related to eigenvector centrality e. We can illustrate this by showing the correlations among a set of centrality measures available in statnet applied to the DHHS Collaboration network. Measures statnet igraph Degree degree degree Closeness closeness closeness Betweenness betweenness betweenness Eigenvector evcent evcent Bonacich power bonpow bonpow Flow betweenness flowbet Load loadcent Information infocent Stress stresscent Harary graph graphcent Bonacich alpha alpha.
Centrality and prestige are characteristics of nodes in a network, based on the posi- tion of the node in the overall network. The variability of the individual centrality scores in a network can be very informative. For example, consider the following two extreme examples: a star graph, and a circle graph Fig. In statnet, centralization is calculated using the centralization function.
The function accepts a name of an existing centrality or prestige function, and returns the appropriate network-level centralization score. Note that despite its name and the information presented in the help file for the function, centralization can be used for directed graphs. Using the star and circle graphs, we can see that every node has the same cen- trality score for the circle graph, leading to a minimum centralization score.
The star graph shows the opposite pattern, where there is high variability between the node-level centrality scores, leading to higher centralization scores. All centrality and prestige functions in statnet as well as igraph produce a vector of node-level scores, one for each actor in the network. Using the Bali terror- ist network, we can see that centrality varies widely across the network members. These scores can be examined individually, but for both analysis and reporting, it is usually more informative to examine patterns of prominence across nodes, across different prominence measures, and even across different networks.
If the network is small enough, it can be useful to examine the individual node- level prominence scores. Table 7. Muklas Rauf Idris Arnasan. Degree Closeness Betweenness Samudra This can be non-network information such as age or weight. More useful here is to use information from the network itself; in this case the centrality scores for each node.
This can easily be done using the network plotting options. In fact, only one addi- tional parameter vertex. This parameter can be a constant, in which case it simply controls the overall size of each vertex in the graph. However, you can also pass it a vector of numeric scores. All of the node-level prominence measures return a numeric vector, so that is what can be used to scale node size based on centrality or prestige.
The only tricky issue is that R reads the raw numbers passed to vertex. Typically, you will need to play around with some type of scaling factor to ensure that the graphic is inter- pretable. The following two figures illustrate this. This rescales the raw degree scores so that they all fall between 0 and 1.
Douglas A. Luke-A User's Guide to Network Analysis in R-Springer (2015)
The first figure shows that the normal- ized degree scores are too small. The second graph uses the same information, but the vertex. A network graphic that includes node-level prominence information can be an effective analysis and communication tool. The overall structure of the network can be made clear, as well as the importance of individual positions. Figure 7. Patek Husin Dulmatin. There are two additional concepts from graph theory that can be useful tools when assessing locational properties of individual nodes or ties.
The first is a cutpoint, which is defined as a node that, if dropped, would increase the number of com- ponents in the network. In many types of networks cutpoints thus occupy important positions connecting different parts of the network. If they were dropped, that would result in two subsets of actors that would not be able to communicate with each other Fig. You can use the cutpoint function in statnet to quickly identify any cutpoints in a network.
So, in addition to the two central nodes c and g we had identified earlier, we can see that h is also a cutpoint. Although simple to see in this example, we can confirm the nodes as cutpoints in a few different ways Fig. Bridges are the edge equivalent to cutpoints. That is, an edge is a bridge if remov- ing it will split one component into two. There is no bridge identification function built into statnet, but it is relatively easy to create a function that will detect bridges.
This function takes a statnet directed or non-directed network, and ex- amines each tie to see if removing it changes the component count. A logical vector with length equal to the number of ties is returned indicating which ties are bridges. This shows us that there are three ties that are bridges in the example network. We can also use the bridges function similarly to the cutpoints function in a graphic to display which edges are bridges Fig. Our young people are faced by a series of different groups which believe different things and advocate different practices, and to each of which some trusted friend or relative may belong.
Margaret Mead. The social systems contained in networks often exhibit complex structures. For example, in his classic The strength of weak ties, Granovetter suggested that many social networks are made up of relatively densely connected subgroups e. It then follows that it will be important to be able to define and identify such subgroups.
Many disciplines have theories that assume that larger social systems are made up of distinguishable subgroups, for example soci- ologists consider social classes; psychologists examine small group behavior, and public health examine health disparities between different social groups. This chapter covers a number of techniques available within R to identify and examine subgroups that may be contained in larger social networks. The igraph package is used extensively in this chapter, because of the depth of its coverage of subgroup and community detection techniques.
At times, it may not be necessary to use specific subgroup techniques. Here, it is self-evident that the network is made up of two primary groups, even if we did not know beforehand that this depicts a primary school class. However, in most real-world social networks the subgroup structure is not as clear, if it even exists at all. Figure 8. The color coding and labels suggest that the there may be some cohesion among members from the same DHHS agency, but it is not crystal clear. One way to think about network subgroups is through social cohesion.
Cohesive subgroups are sets of actors that are tied together through frequent, strong, and direct ties Wasserman and Faust This approach is so intuitive that it led to a number of the earliest techniques for identifying network subgroups. Cliques are one of the simplest types of cohesive subgroups, and because of their straightforward definition are also one of the easiest types to understand. A clique is a maximally complete subgraph; that is, it is a subset of nodes that have all possible ties among them.
Consider the example graph in Fig. Technically, connected dyads also are cliques, but typically only cliques of size 3 or larger are of interest. Also, by definition any clique of size k will also contain all the cliques sized k-1, k-2, etc. The following commands demonstrate how to get information about any cliques in a network. Despite what the name suggests, clique. To get a list of all the cliques, constrained by a minimum or maximum size, use cliques.
When there are a large number of cliques in a network, maximal. Finally, as the name suggests, largest. Note that the latter three functions return lists of vertex ids. When the igraph object has vertex names, the following syntax shows how names rather than ids can be displayed. Cliques, however, have two major disadvantages that reduce their utility in real- world social network analysis. First, a clique is a very conservative definition of a cohesive subgroup. Consider a subgraph made up of seven vertices.
To be a clique, all of the 21 possible ties must exist between all seven members. A consequence of this fragility is the second major issue of cliques: they simply are not very common in larger social networks. Table 8. Four random networks were created with 25, 50, , and nodes. For each network, the average degree was constrained to approximately 6. With this information it is possible to calculate a similarity coefficient, such as the Jaccard Index.
In case of partitioning results, the Jaccard Index measures how frequently pairs of items are joined together in two clustering data sets and how often pairs are observed only in one set. These indices also consider the number of pairs d that are not joined together in any of the clusters in both sets.
A variety of alternative similarity coefficients can be considered for comparing clustering results. An overview of available methods is given on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila J Mult Anal 98, Search this site.
Home Manuals Home. Bioinfo Labs. Bioinformatics Facility. Girke Lab. R Basics. R Graphics. The associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems. This R tutorial provides a condensed introduction into the usage of the R environment and its utilities for general data analysis and clustering.
It also introduces a subset of packages from the Bioconductor project. Many packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available R packages, it is strongly recommended to consult the official Bioconductor and R project sites. Due to the rapid development of most packages, it is also important to be aware that this manual will often not be fully up-to-date.
Because of this and many other reasons, it is absolutely critical to use the original documentation of each package PDF manual or vignette as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author. In this format all commands are represented in code boxes, where the comments are given in blue color. To save space, often several commands are concatenated on one line and separated with a semicolon ' ; '. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps.
Windows users can simply ignore them. Commands highlighted in red color are considered essential knowledge. They are important for someone interested in a quick start with R and Bioconductor. Where relevant, the output generated by R is given in green color. Both of them work the same way and in both directions. For consistency reasons one should use only one of them. R Startup Behavior The R environment is controlled by hidden files in the startup directory:.
Rhistory and. Rprofile optional. The link 'Packages' provides a list of all installed packages. After initiating 'start. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists. Basics on Functions and Packages. R for loading into R IDE e. RData' when exiting R and the workspace is saved. Removes objects. This is sometimes useful to clean up memory allocations after deleting large objects. More details on this topic can be found here. This option is intended to support programs which use R to compute results for them.
The output file lists the commands from the script file and their outputs. Rout' is appended to outfile. R', then nothing will be saved in the. Rdata file which can get often very large. Remember, single escapes e. If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles.
Export to files write. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications. The argument 'col. Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Subsequent exports to the same file will arrange several tables in one HTML document. This library is usually not installed by default. Data and Object Types.
Assigning values to object components. Calculations [ Function Index ] Four basic arithmetic functions: addition, subtraction, multiplication and division. A list of the basic R functions can be found on the function and variable index page. Iterative calculations. With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations.
Generates the same result as 'sqrt x '. Regular expressions R's regular expression utilities work similar as in other languages. Vectors are ordered collection of 'atomic' same data type components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by 'NA'. R inserts them automatically in blank fields. The sort function sorts the items by size.
How to use tableau
The rev function reverses the order. The order function is usually the one that needs to be used for sorting complex objects, such as data frames or lists. The resulting logical vector can be used for the actual subsetting step of vectors and data frames. Factors are vector objects that contain grouping classification information of its components. Appending arrays and matrices cbind matrix1, matrix2 Appends columns of matrices with same number of rows. Data Frames Data frames are two dimensional data objects that are composed of rows and columns. They are very similar to matrices.
The main difference is that data frames can store different data types, whereas matrices allow only one data type e. These names need to be unique. By adding a "-" sign one can reverse the sort order. This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax. This returns all occurences of duplicates. The results are returned as vectors. In this example, they are appended to the original data frame with the data.
The argument '1' in the apply function specifies row-wise calculations. If '2' is selected, then the calculations are performed column-wise. First, an example matrix 'x' is created. However, this will be very slow for data frames with millions of rows. This approach is about times faster than the loop-based alternatives: sd t myDF or apply myDF, 1, sd. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.
Reformatting data frames with reshape. Length Sepal. Width Petal. Length Petal. Length 5. Length 6. Width 3. Width 2. Species Sepal. In this example the list component names are prepended to the corresponding vectors. A much faster alternative is given in the data frame section. R" Imports the colAg function. The columns in the resulting object are named after the chosen aggregates. The following list provides an overview of some very useful plotting functions in R's base graphics. To get familiar with their usage, it is recommended to carefully read their help documentation with?
The environment greatly simplifies many complicated high-level plotting tasks, such as automatically arranging complex graphical features in one or several plots. The syntax of the package is similar to R's base graphics; however, high-level lattice functions return an object of class "trellis", that can be either plotted directly or stored in an object. Important functions for accessing and changing global parameters are:? The environment streamlines many graphics routines for the user to generate with minimum effort complex multi-layered plots.
The ggplot function accepts two arguments: the data set to be plotted and the corresponding aesthetic mappings provided by the aes function. Additional plotting parameters such as geometric objects e. Their settings can be changed with the opts function. The following graphics sections demonstrate how to generate different types of plots first with R's base graphics device and then with the lattice and ggplot2 packages. A selection palette for 'pch' plotting symbols can be opened with the command 'example points '. As alternative, one can plot any character string by passing it on to 'pch', e.
Please consult the '? Scatter Plot Generated with Base Graphics. The argument as. Change plotting parameters show. Length, Sepal. The 'split. More details on this topic are provided in the 'Arranging Plots' section. A very nice line plot function for time series data is available in the Mfuzz library. Line Plot Generated with Base Graphics lattice. Scatter Plot Generated with lattice.
Scatter Plot Generated with ggplot2. The argument 'ncol' controls the number of columns that are used for printing the legend. Bar Plot Generated with lattice. C Customizing colors library RColorBrewer ; display. Wind Rose Pie Chart Generated with ggplot2. Several Heatmaps in One Plot Generated with lattice. The latter defines the height of each heatmap. R " Imports required functions. The Regular Intersect approach not compatible with Venn diagrams!
Their frequency is provided in the result. This could be any data type! With the current implementation, the computation time is about 0. OLlist[]; OLlist[] Returns the corresponding intersect matrix and complexity levels. More details on this function are provided in the Venn diagram section. This transformation can give reasonable results for sample sets with large size differences. Histogram Generated with ggplot2. The featureMap. R script plots simple feature maps of biological sequences based on provided mapping coordinates.
The usage of plotted values will connect the data points. More on this can be found in the documentation for 'par'. The last step sets the palette back to its default setting. In the second plot a modified palette is called the same way. The start and end values need to be between 0 and 1. The wider their distance the more diverse are the resulting colors. The col2rgb can translates them into the RGB color code.
System returns the corresponding x-y-coordinates after clicking on right mouse button. The actual image data are not written to the file until the 'dev.
The pdf and svg formats provide often the best image quality, since they scale to any size without pixelation. A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual. The following exercises introduce a variety of useful data analysis utilities available in R. Import from spreadsheet programs e. Download the following molecular weight and subcelluar targeting tables from the TAIR site, import the files into Excel and save them as tab delimited text files.
Check tables and rename gene ID columns. Problem 1: How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods - the two step version with na. Subset the data frame accordingly and sort it by MW to check that your result is correct.
As an alternative approach, assign the second column to the row index of the data frame and then perform the same query again using the row index. Explain the difference of the two methods. Export data frame to Excel. R" Imports the venn diagram function. Problem 5: Generate two key lists each with 4 random sample sets. Compute their overlap counts and plot the results for both lists in one venn diagram. Problem 6: Write all commands from the previous exercises into an R script exerciseRbasics. R and execute it with the source function like this: source "exerciseRbasics.
This will execute all of the above commands and generates the corresponding output files in the current working directory. Programming in R This section of the manual is available on the Programming in R site. Bioconductor Introduction Bioconductor is an open source and open development software project for the analysis of genome data e.
This section of the manual provides a brief introduction into the usage and utilities of a subset of packages from the Bioconductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the Bioconductor project. The introduced packages were chosen, because the author uses them often for his own teaching and research.
To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site. Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutley critical to use the original documentation of each package PDF manual or vignette as primary source of documentation. Finding Help The instructions for installing BioConductor packages are available in the administrative section of this manual.
Documentation for Bioconductor packages can be found in the vignette of each package. A listing of the available packages is available on the Bioc Package page. Annotation libraries can be found here. Another valuable information resource is the Bioconductor Book. The basic R help functions provide additional information about packages and their functions: library affy Loads a particular package here affy package.
For large data sets use the more memory efficient justRMA function. The 'library gcrma ' needs to be loaded first. The 'library plier ' needs to be loaded first. To access them see below. Works for mas5, rma and gcrma. See HT-Seq manual for more details. See affyQCReport for details. A summary list and a plot are returned. The function 'call. One can also use here the "gcrma" method after loading it with the command 'library gcrma '. See description on page 5 of vignette "simpleaffy".
CEL files it also reads experiment layout from covdesc. The "get. Type "? Meaning of colors: red - all present, orange - all present in one group or the other, yellow - all that remain. Write the results into separate files. Create scatter plots for the filtered data sets and save them to external image files. Compare the differences between the three methods. Analysis of Differentially Expressed Genes. Limma Limma is a software package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression.
The package includes pre-processing capabilities for two-color spotted arrays. The differential expression methods apply to all array platforms and treat Affymetrix, single channel and two channel experiments in a unified way. The methods are described in Smyth and in the limma manual.
On Windows, simply install ActiveTcl. For a quick start, follow the instructions for the Estrogen data set. For a quick start, follow the instructions for the Swirl Zebrafish data set. Data objects in limma There are four main data objects created and used by limma:. Limma: Dual Color Arrays. Requirements Have all intensity data files in one directory. If an intensity data file format is not supported then one can specify the corresponding column names during the data import into R see below.
Type '? Targets file : Format of targets file. This file defines which RNA sample was hybridized to each channel of each array. Only for SPOT: separate gene list file Optional: spot type file for identifying special probes such as controls. This argument allows the import of a spot ID or annotation column. RG1, RG2 into one large object. Provided example with 'wt. The appropriate way of computing quality weights depends on the image analysis software.
The command '? QualityWeight' provides more information on available weight functions. The incorporation of the spot type information has the advantage that controls can be highlighted in plots or separated in certain analysis steps. The format of this file is specified in the limma pdf manual. The default background correction method is bc. Use bc. This is a graphical summary of the MA distribution between the arrays. Inconsistent spreads of hinges and whiskers between arrays can indicate normalization issues.
Usually one wants to base gene selection on the adj. Value rather than the t- or B-values. The MAS 5.
Related A Users Guide to Network Analysis in R (Use R!)
Copyright 2019 - All Right Reserved