Monday, November 25, 2013

Streaming Systems history, part 2

Here I am again with some history on streaming systems! Today I would like to dive into systems born from 2005 to 2010. If you find any mistake please comment the post and let me know!

In this post I'm going to talk about the second generation of streaming systems: Borealis (2005), SPC (2006), DryadLINQ (2008) and S4 (2010).

Borealis is a distributed stream processing engine which inherits much from Aurora and Medusa. Aurora is a framework for monitoring application born around 2002. At the high level, the system model is made out of operators which receive and forward data thorough continuous queries. After Aurora was born, Aurora* was proposed, which is a version of Aurora with distribution and scalability. Borealis implements all the functionality of Aurora and is designed to support not only dynamic query modification and revision but also to be flexible an highly scalable.


Borealis can run Directed Acyclic Graphs (DAGs) as pipelines but not Arbitrary pipelines. A bachelor thesis (if I'm not mistaken) proposed a graphical user interface for Borealis that let users build pipelines using visual building blocks. Otherwise, one can implement a pipeline directly in C# (imperative programming). Borealis is the only one system in the second generation that supports not only deployment on clusters and the cloud, but also on pervasive devices (i.e. microcontrollers).
At runtime, it is flexible at node level to handle failures: it is capable of reinitialising a failed node while the stream is running, thanks to a replication mechanism that sends the data to the closest upstream replica. Load balancing is again performed with load shedding.

SPC stands for Stream Processing Core and is a middleware for distributed stream processing which targets data mining applications. It was built to support applications that extract data from multiple digital data streams. Topologies are composed by Processing Elements (which are what I call nodes) which implement application-defined operators that are connected by stream subscriptions.

Likewise Borealis, this system supports DAG as topology and has an imperative programming language as well as a graphical user interface. The deployment is only available on cluster of machines and the cloud. It is flexible at pipeline level, in the sense that it can change pipeline while the stream is flowing. The idea is to have the pipeline to be extended at runtime: new nodes can join the pipeline and connected to the stream while at runtime. Again like Borealis, it has replication to cope with faults and has no mean to cope with load balancing.

DyadLINQ imports a new programming model for distributed computing on large scale. It supports both general-purpose declarative and imperative operations on datasets thanks o a high-level programming language. A streaming application built with DryadLINQ is composed by LINQ expressions automatically translated by the compiler into a distributed execution plan (which DryadLINQ calls "job graph", and is a "topology") passed to the Dryad executing platform.

DryadLINQ supports DAG pipelines and both imperative and declarative programming models. The deployment is only targeting cluster of machines, while likewise SPC, it supports dynamic topology reconfiguration at runtime. Fault tolerance is indeed tackled with reconfiguration while load balancing is faced by a dynamic adaptive controller. More in details, DryadLINQ exploits some hooks in the Dryad API to mutate on runtime the pipeline to improve performance. Ideally performance is improved by aggregating nodes, thus decreasing I/O times.

Last but not least, we have S4 (Simple Scalable Streaming System), which is again a general-purpose distributed streaming platform developed by Yahoo! and inspired by the Actor model and MapReduce. It allows developers to program applications that process unbounded streams of data. Developers can set up pipelines with Processing Nodes that host Processing Elements. Each Processing Element (PE) is associated with a function and the type of event that it consumes.

Also S4, like almost all the others, supports DAGs as topology. The programming model is imperative and the deployment is possible on cluster of machines and the cloud. The pipeline is flexible at Node level, while for fault tolerance it supports reconfiguration (it can reconfigure a pipeline splitting the nodes deployed on a failed host among the remaining available execution resources) and for load balancing it again shows a dynamic adaptive controller.

And that's it. Hopefully I haven't made much mistakes.

See you next time with the last part!

Friday, November 15, 2013

Streaming Systems history, part 1

Lately I've been very busy studying streaming systems in general. I've took a look at the last decade streaming systems and what literature proposed. I want to share a little bit my findings summing up things here a little bit. I'm going to divide this series of posts in three, one for each generation of streaming systems. All the data should be more or less correct, but feel free to comment below and ask question of correct mistakes if there are! I would like to be the more precise as possible.

The first generation of streams proposes, among all, StreamIt (2002), Sawzall (2003) and CQL (2003).

StreamIt is a compilation infrastructure and programming language created to setup pipelines of streams. Users can create any kind of topology of the pipeline thanks to different kind of filters available. The filter structure available are:

  •  Pipeline: Let the user build a filter which has one input channel and one output channel, thus with a series of pipeline filters you can only build linear streaming pipelines.
  • SplitJoin: The first filter is a Split which has two output streams, then there can be different pipeline filters and at the end a Join filter which joins the work performed in parallel.
  • Feedback Loop: Let the user build a node which has two output streams, one that goes on in the pipeline and the other that feeds itself. Useful for tail recursive computations (i.e. Fibonacci).
The programming model is imperative, with a much Java/C++-like syntax. Pipelines built with StreamIt can be run on clusters of machines. To cope with load problems it has a mechanism of reinitialisation, where if a bottleneck is found, the pipeline is restarted with a different configuration to cope with the load changes.

Sawzall is a procedural domain-specific programming language which includes support for statistical aggregation of values read or computed from the input, very similar to Pig. Sawzall was developed by Google to process log data generated by Google's server. It processes one record at a time, and emits an output in tabular form. Sawzall is stateless, and thanks to MapReduce, each Sawzall program can run on multiple hosts (cluster of machines). 

Last but not least, CQL (Continuous Query Language) is an SQL based language for writing and maintaining continuous queries over a stream of data. Those queries are suitable for reactive and real-time programs; for example to keep an up to date view of data. It was implemented as a part of a project named STREAM.
Here is an example of a CQL query:

Select Sum(O.cost)
From Orders O, Fulfillments F [Range 1 Day]
Where O.orderID = F.orderID And F.clerk = "Alice"
    And O.customer = "Bob"

As you can see, CQL is much alike SQL, with the difference of having a timespan on the query (i.e. the day range in the example). This is because queries are performed over continuous streams of data, thus a time range has to be specified to get a result. In the example, the query selects the sum of the total costs of orders in one day, performed by Alice and bought by Bob.

And that's it for the first part of this streaming system history. I'm sorry about not writing something more about Sawzall, if you have some suggestions, comments or corrections please comment the post!
Bye!

Thursday, November 7, 2013

I'm still alive

Hi all, I'm sorry I couldn't post much later. I've been in vacation for a while, and then suddenly a big chunk of work arrived, and I had practically 0 time to write here. I'm not even programming anymore, just paper writing and teaching assisting. Hopefully I'll be back on track after the 15th, the last deadline for the last paper I wrote.
Until next time, see ya!