Academic Publications


Replication Management in Self-Organizing Clusters for Parallel Data-Intensive Applications [pdf] December 2004

Aziz Gulbeden

Abstract:
One way to improve availability and performance in distributed storage systems is to replicate the data on multiple storage providers. However, a number of challenges arise when a replication protocol is going to be employed in a distributed storage system. Some of the major problems are; how to maintain the consistency among the replicas, and how to propagate updates to the replicas in an effective way with minimal degradation in the overall system performance. This thesis studies the design and implementation of replication management in a self-organizing storage cluster called Sorrento, which targets data-intensive parallel applications with highly concurrent requests and low write-sharing patterns. We introduce a replication protocol combined with versioning, which performs replication lazily in the background making replication decisions based on the network loads of the storage providers. We evaluate the overhead that our replication scheme adds to the system and measure the gain one can achieve using the network load to make replication decisions. The v results show that the system can deliver high throughput by employing asynchronous replication. Additionally, making the replication decisions based on the network loads increases the overall system performance by reducing the load imbalance in the system.

 

A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications [pdf] - Super Computing, November 2004

Hong Tang, Aziz Gulbeden, Jingyu Zhou, William Strathearn, Tao Yang, and Lingkun Chu

Abstract:
Cluster-based storage systems are popular for data-intensive applications and it is desirable yet challenging to provide incremental expansion and high availability while achieving scalability and strong consistency. This paper presents the design and implementation of a self-organizing storage cluster called Sorrento, which targets data-intensive workload with highly parallel requests and low write-sharing patterns. Sorrento automatically adapts to storage node joins and departures, and the system can be configured and maintained incrementally without interrupting its normal operation. Data location information is distributed across storage nodes using consistent hashing and the location protocol differentiates small and large data objects for access efficiency. It adopts versioning to achieve single-file serializability and replication consistency. In this paper, we present experimental results to demonstrate features and performance of Sorrento using microbenchmarks, application benchmarks, and application trace replay.

 

PRISM: Indexing MultiDimensional Data in P2P Networks using Reference Vectors [pdf] - ACM Multimedia, Singapore, 2005

Ozgur Dogan Sahin, Aziz Gulbeden, Fatih Emekci, Divyakant Agrawal, Amr El Abbadi

Abstract:
Peer-to-peer (P2P) systems research has gained considerable attention recently with the increasing popularity of file sharing applications. Since these applications are used for sharing huge amounts of data, it is very important to efficiently locate the data of interest in such systems. However, these systems usually do not provide efficient search techniques. Existing systems offer only keyword search functionality through a centralized index or by query flooding. In this paper, we propose a scheme based on reference vectors for sharing multi-dimensional data in P2P systems. This scheme effectively supports a larger set of query operations (such as k-NN queries and content-based similarity search) than current systems, which generally support only exact key lookups and keyword searches. The basic idea is to store multiple replicas of an object’s index at different peers based on the distances between the object’s feature vector and the reference vectors. Later, when a query is posed, the system identifies the peers that are likely to store the index information about relevant objects using reference vectors. Thus the system is able to return accurate results by contacting a small fraction of the participating peers. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval Models C.2.4 [Computer- Communication Networks]: Distributed Systems General Terms: Algorithms, Design, Experimentation Keywords: Peer-to-Peer Systems, Similarity Search, Reference Vectors

 

Privacy Preserving Query Processing using Third Parties [pdf] - ICDE, April 2006

Fatih Emekci, Divyakant Agrawal, Amr El Abbadi, Aziz Gulbeden

Abstract:
Data integration from multiple autonomous data sources has emerged as an important practical problem. The key requirement for such data integration is that owners of such data need to cooperate in a competitive landscape in most of the cases. The research challenge in developing a query processing solution is that the answers to the queries need to be provided while preserving the privacy of the data sources. In general, allowing unrestricted read access to the whole data may give rise to potential vulnerabilities as well as may have legal implications. Therefore, there is a need for privacy preserving database operations for querying data residing at different parties. In this paper, we propose a new query processing technique using third parties in a peer-to-peer system. We propose and evaluate two different protocols for various database operations. Our scheme is able to answer queries without revealing any useful information to the data sources or to the third parties. Analytical comparison of the proposed approach with other recent proposals for privacy-preserving data integration establishes the superiority of the proposed approach in terms of query response times.