Honors Oral Exam

Exploring Sampling Techniques in Large Graphs and Networks

Anna Myakushina (University of Rochester)

Thursday, May 4th, 2023
3:00 PM - 3:50 PM
Hylan 203

Graphs are an important branch of data science that are essential to visualizing and representing relationships, particularly in spaces with many observations. Working with large graphs (10,000 nodes or more), however, can be extremely difficult- runtime increases dramatically, and results are often difficult to visualize and interpret due to how busy the resulting model is. For this reason, efficient and accurate sampling techniques are necessary to scale down data to a manageable magnitude. This study will explore sampling methods of three different classes: random selection techniques, sampling by exploration, and deletion methods. Various existing methods are introduced and explained, and the probability of node selection is calculated for each method in order to compare and discuss advantages and pitfalls of each method. In total 15 existing methods are explored and discussed, including Forest Fire Selection (FFS), a popular method utilizing graph traversal.

The last section of this study then proposes a new sampling technique, Spontaneous Forest Fire Sampling (SFFS), which has been modified from FFS. Both FFS and SFFS are modeled on a Facebook social network at three different forward burning probabilities (pf=0.2, 0.5, and 0.7) and four different sample sizes (5%, 10%, 15%, and 20% of the population). A modification to both FFS and SFFS to include more edges, named edge restoration, is also proposed and modeled. Finally, SFFS is tested separately at “jumping frequencies” of N=1, 2, 3, 5, and 7. In total, 80 samples are generated, modeled, and compared using Kolmogorov-Smirnov D-statistics. It is found that depending on sampling goals, either edge-restored SFFS or traditional FFS performed the best.

Event contact: jonathan dot pakianathan at rochester dot edu