Looking up data in P2P systems
The main challenge in P2P computing is to design and implement a robust and scalable distributed system composed of inexpensive, individually unreliable computers in unrelated administrative domains. The participants in a typical P2P system might include computers at homes, schools, and businesses, and can grow to several million concurrent participants.
Characteristics:
- The barriers to starting and growing such systems are low, since they usually don't require any special administrative or financial arrangements, unlike centralized facilities;
- P2P systems offer a way to aggregate and make use of the tremendous computation and storage resources on computers across the Internet.
Lookup problem: How do you find any given data item in a large P2P system in a scalable manner, without any centralized servers or hierarchy?
The lookup problem is simple to state: Given a data item X stored at some dynamic set of nodes in the system, find it. This problem is an important one in many distributed systems, and is the critical common problem in P2P systems. One approach is to maintain a central database that maps a file name to the locations of servers that store the file. Napster (www.napster.com) adopted this approach for song titles, but this approach has inherent scalability and resilience problems: the database is a central point of failure.
At one end of the symmetric lookup spectrum, the consumer broadcasts a message to all its neighbors with a request for X. When a node receives such a request, it checks its local database. If it contains X, it responds with the item. Otherwise, it forwards the request to its neighbors, which execute the same protocol. Gnutella has a protocol in this style with some mechanisms to avoid request loops. However, this "broadcast" approach doesn't scale well because of the bandwidth consumed by broadcast messages and the compute cycles consumed by the many nodes that must handle these messages.
One approach to handling such scaling problems is to add "superpeers" in a hierarchical structure, as is done in FastTrack's P2P platform, and has been popularized by applications like KaZaA.
Freenet uses an innovative symmetric lookup strategy. Here, queries are forwarded from node to node until the desired object is found based on unstructured routing tables dynamically built up using caching.
The recent crop of P2P algorithms, including CAN, Chord, Kademlia, Pastry, Tapestry, and Viceroy are both structured and symmetric, unlike all the other systems mentioned here. This allows them to offer guarantees while simultaneously not being vulnerable to individual node failures. They all implement the DHT abstraction.
Indexing and keyword search. These DHT algorithms retrieve data based on a unique identifier. In contrast, the widely deployed P2P file-sharing services are based on keyword search. While it is expected that distributed indexing and keyword lookup can be layered on top of the distributed hash model, it is an open question if indexing can be done efficiently.