|
In our previous blog, we discussed Horizontal Scaling and how scaling across multiple computer servers is a key feature of Cloud Computing and has potential benefits for smart objects and smart networks. Another concept which goes hand in hand with horizontal scaling is parallelization. With the advent of Cloud Computing, the scale and implementation of the concepts of parallelization have changed. Parallelization can increase the speed of software operations or increase response time. Simultaneously, Vertical Scaling can be used on symmetric multiprocessors to spawn multiple program threads.
However, as the Sun Microsystems White Paper on Cloud Computing Architecture points out, vertical scaling only has as much parallel processing capability as the server has processors (or cores) - or, at least, as many cores that have been purchased and allocated to a particular Virtual Machine (VM). This is because today's computing environments are shifting towards x86-architecture servers with two or four programming sockets (i.e. the interfaces which make network programming possible.). It is for this reason that parallelization should be considered on a more macro scale than our previous description as software that can use parallelization across many servers can scale to potentially thousands of servers. This infinitely increases the potential for scalability than was possible with symmetric multiprocessing.
In the traditional physical world of computing, parallelization has been frequently implemented using load balancers or content switches that distribute incoming requests from software programs across a number of servers. Similarly, parallelization in a cloud computing world can be implemented with a load balancing application or a content switch but distributing incoming requests across a number of virtual machines in this situation. In both scenarios, applications can be designed to recruit additional resources to accommodate workload spikes.
The classic example of parallelization with load balancing is a number of stateless web servers (i.e. a server that treats each request as an independent transaction that is unrelated to any other request) where the incoming workload is distributed across a pool of servers. Of course, there are many other ways to use parallelization in Cloud Computing environments. For example, a Cloud Computing application that uses a significant amount of CPU time to process user data might use a scheduler to receive jobs from users. The scheduler then places the data into a repository and starts a new VM for each job and hands the VM a token that allows it to retrieve the data from the repository. When the VM has completed its task it passes a token back to the scheduler that allows it to pass the completed project back to the user and then terminates.
Applications can be parallelized only to the extent that their data can be partitioned so that independent systems can operate on it in parallel. Any credible application architecture should include a plan for dividing and conquering data. The partitioning of data has a significant impact on the volume of data transferred over networks. There are several examples of parallelization that leverage data partitioning. We have previously discussed Hadoop (http://hadoop.apache.org). As noted previously, this is an implementation of the MapReduce design pattern which is itself an implementation of the master/workers parallelization design pattern. Database sharding, which we discussed previously, can be accomplished through a range of partitioning techniques including vertical partitioning (i.e. partitioning by database table column), range-based partitioning (e.g. by date) and directory-based partitions (i.e. partitioning by distrinct domains). The approach taken really depends on how the data is to be used.
Parallelization is also being used in the finance industry. Major financial institutions have refactored their fraud detection algorithms so that what was once more a batch data-mining operation where patterns and trends were detected from large data sets now runs on a large number of systems in parallel and provides real-time analysis of incoming data. Some High Performance Computing (HPC) applications that deal with three-dimensional data have been designed so that the state of one cubic volume of a gas, liquid or solid can be calculated for time t by one process. This means that the state of the one cube is passed onto the parallel processes representing eight adjoining cubes and the state is calculated for time t+1.
The argument for the use of parallelization is therefore clear. The data management of smart objects and smart networks would also benefit from the adoption of a parallelization strategy as the volume of data and the conversion of that data into meaningful information may necessitate the use of parallelization techniques. The myriad of devices and the lack of standardization in packet formats and data transmission may lead to many different types of data packet listeners and data capture and interpretation software being needed. Consider the example of a system that captures data from a wireless sensor network (WSN) and a smart grid. The smart grid may transfer data to the system using 2.5G or 3G telecommunications while the WSN may transfer data using Zigbee. The packets would be in different formats, would contain different data and would require different software to capture and translate the packets. When one factors in the different Operating Systems (TinyOS, Contiki or indeed none in many cases) and Programming Languages (nesC, C++, Java among others) used, it is clear that bespoke software would be required for the different smart objects, be they sensors, smart meters, GPS readers or RFID tags. These data capture modules would ideally run in parallel so that data could be captured from these devices simultaneously thus providing a richer snapshot of the condition and activities taking place within the environment or infrastructure being monitored.
Partitioning strategies could also play a key role in conjunction with parallelization for the data management of smart objects. Smart networks (or smart dust) could comprise of tens of thousands of computing devices. By adopting a mechanism by which data could be organised and partitioned by group, location or by date captured, data could be distributed horizontally across the Cloud. Similarly, partitioning could be undertaken on a vertical basis where database table columns could be split logically.
Like the other aspects of Cloud Computing that we have discussed in previous blogs, parallelization is another technique that is helping to make Cloud Computing an enabling technology for the data management of smart objects. Vertoda provides data management and middleware that can be used in the Cloud to organize and store smart object data. We are also developing a platform that will greatly enhance the ability to capture data from the myriad of smart objects and manage this data both in Cloud and Enterprise Computing environments.
|