<<< Back to the guide, "Sample ARCH datasets and how to explore them."
Introduction
This is a very basic introduction to Gephi. It begins with the assumption of no prior knowledge and explains how you can import the network dataset you receive from ARCH to perform basic transformations on it yourself. This Introduction was written in May 2023 with Gephi 0.9.7.
Used in this tutorial:
- Dataset: Domain graph from the Art Galleries web archive collection
- Tools: Gephi
- Time: ~20 minutes to complete
Instructions
In this section:
Get to know your data
- Locate the .domain-graph.csv. file in the ARCH workshop archive and open it with your preferred spreadsheet program (Excel, Calc, Numbers, Sheets, etc.).
- Take note of the four attributes included in each Domain graph extraction from ARCH. Each row in the spreadsheet represents the number of times that a selected site or page links to another web domain when it was collected for the archive:
- .crawl_date.: a timestamp representing when each link between domains was collected.
- source.: each domain that hosts web content selected for the collection.
- target.: each host domain to which a source domain in the collection above links.
- .count.: the sum of links collected from each source to each target host domain.
- .crawl_date.: a timestamp representing when each link between domains was collected.
Create a network graph
Importing your data
- Download and install the Gephi platform: https://gephi.org/.
- Open Gephi and select to start a .New Project..
- Under the "File" menu, select "Import Spreadsheet" and choose the domain graph CSV file from your local storage.
- Configure the the .Separator:. value to .Comma. and .Import as:. to .Edges table.. Then click the "Next >" button:
- In the "Import settings (2 of 2)" dialog, configure the .Time representation:. value to .Intervals. and leave the remainder of the information set to its default settings. Then click the "finish" button.
- From the "Import report" dialog, configure the .Edges merge strategy:. value to .Don't merge.. Then click the "OK" button:
Setting up the dates in your data
- Ensure that Gephi can recognize the dates found in this file so you can explore the graph dynamically. First, click on the "Data Laboratory" tab at the top-left, followed by the "Edges" sub-navigational tab. Then click on the "Merge columns" button at the bottom:
- Select each the .Interval. and .crawl_date. option from the left column and click on the right-arrow (→) button to identify them for merging. Set the .Merge strategy. value to .Create time interval. and click the "OK" button:
- Set the .Start time column. and .End time column. values both to .crawl date. and select the option to parse dates in the yyyyMMddHHmmss format:
- You will now see a timeline at the bottom of the screen that you can click to enable. You may alternatively open the timeline anytime later by opening the "Window" menu and selecting "Timeline."
- Click on "time options" button (which appears as a cog icon) at the bottom-left and select "set time format," followed by "datetime."
Adding labels
- Now let's do a similar transformation to migrate the domain names over to each node. To begin, click on "Nodes" at the top-left, followed by the "Copy data to other column" button at the bottom.
- Select the .Id. option in the drop-down menu. In the dialog window, select the droo-down option to copy data from 'Id' to .Label.. Then click the "OK" button. You should see the same values in both columns now.
Basic graph layouts
- A basic layout is now available under the "Overview" tab. Let's make it more legible and descriptive, starting with the options in the "Layout" pane on the left:
- Select .Yifan Hu Proportional. from the drop-down menu of layout options, then click the "Run" button to see the layout applied to the graph in the center pane:
- Now let's add some labels to see the graph develop in a more meaningful way. Click on the "T" button below the graph in order to label each node with its domain name in our dataset.
- To make them more legible, resize the nodes (domains) based on how many other nodes in the diagram link to them. This is called "in-degree" in Gephi and a common measure within network analysis literature. From the "Appearance" window located in the left pane, click on the "size" icon, followed by "ranking," and then select the .In-Degree. option from the drop-down menu. Set the .Min size. to .3., the .Max size. to .40. and click "Apply."
- Now replicate these steps to adjust the size of each label. Click on the "text size" icon, followed by "ranking," and then select the .In-Degree. option. Set a .Min size. of .0.1., a .Max size. of .3., and click "Apply."
- Some of the labels now overlap, so let's apply an additional layout to clean them up. This time, select .Label Adjust. from the drop-down menu in the "Layout" pane and press the "Run" button:
Applying more algorithms
- Now let's run an algorithm to learn more about our graph. For demonstration, let's run a rudimentary community detection algorithm found in the "Statistics" pane on the right-hand side. Press the "Run" button next to the option labeled .Modularity. and click through the next report dialog:
- The final step is to apply the modularity categories to the graph. We will use color to denote the communities nodes appear in. Return to the "Appearance" pane and this time click on the painter's palette. Select "Partition" and then choose the .Modularity Class. option from the drop-down menu. Click the "Apply" button and your graph should look something like this:
Congratulations! You now have a nicely-laid out graph. Now, try experimenting with other features in Gephi. There are several community-based and video tutorials available at: https://gephi.org/users/.
Troubleshooting
Increase memory
It is possible to run out of local memory available to Gephi on a personal computer like a laptop when importing or processing large web archive collection datasets. The relevant error message from Gephi looks like this:
The project file couldn't be opened. Please check the file has .gephi extension.
ArrayIndexOutOfBoundsException - 0
This error may cause the application to quit and/or corrupt the dataset file. Follow the instructions here to locate the local Gephi installation’s configuration file and increase the memory allocation manually in order to continue working with your dataset.
Comments
0 comments
Please sign in to leave a comment.