SDN solutions are complex and troubleshooting/monitoring them is even harder. It seems that while we have a better way to automate the network we lose visibility and operability. For example, in order to troubleshoot an issue you have of to understand the network in general but also to have a deep understanding of how the SDN solution is implementing the network. And if you have multiple SDN solutions deployed – with maybe nested SDN solutions like container network in VMs – finding the root cause of an issue starts to be really hard.
In this context I will introduce a new project we started few month ago at Red Hat and which aims to bring back visibility and operability to such environment. In this first post I will not describe in details how it works, I will explain the global design and how to setup a lab environment.
What is Skydive ? Skydive is a project that aims to collect, store and analyze the state of a network infrastructure and the flows going through this infrastructure. Skydive is SDN-agnostic which means it doesn’t rely on any SDN solution but provides a way to gather informations from SDN controllers.
This was for the definition, now let’s see how it works.
Skydive is composed of two components:
- Skydive agents which collect local topology informations (interfaces, bridges, …), and capture traffic locally.
- Skydive analyzers which collect and aggregate topology and flows from the agents. The analyzers can leverage information from SDN controllers such as Neutron or any other SDN solution.
Having these data collected and aggregated in one point allows us to :
- Find where packets are dropped
- What kind of packet lead in issues
- Find the congestion points : bandwidth, number of sessions, etc.
- Get latency, RTT metrics
- Metrics helping in capacity planning, billing
The lab !
For this lab I will explain how to deploy an all-in-one node, so Agent + Analyzer. It will be easy to start an other Agent later.
Let’s begin by installing the Analyzer. Currently Skydive relies on ElasticSearch as a Data store, thus it needs to be deployed before.
Skydive makes use of Openvswitch and SFlow for the flow capture, a limitation that we will removed soon but for this lab we need to have an up and running OVS with an ovsdb listening on a TCP port.
$ sudo ovs-appctl -t ovsdb-server \ ovsdb-server/add-remote ptcp:6400
Once you have your ElasticSearch and OpenvSwitch up and running, you are ready to download the Skydive binary :
$ wget https://github.com/redhat-cip/skydive/releases/download/v0.2.0/skydive $ chmod +x skydive
You will just need a little config file to specify to the agents where the Analyzer is running and the interface listening and which probes that will be used :
analyzer: listen: 0.0.0.0:8082 agent: listen: 0.0.0.0:8081 analyzers: 127.0.0.1:8082 topology: probes: - netns - netlink - ovsdb flow: probes: - ovssflow
$ ./skydive analyzer -c skydive.yml
$ sudo ./skydive agent -c skydive.yml
Once you have the skydive components started, you can check that the API are responding. Since the Agent and the Analyzer offer the same API and WebUI, you can check on both.
$ curl http://localhost:8081/rpc/topology $ curl http://localhost:8082/rpc/topology
or checking the WebUI
Within the Skydive repository there is a script that can be used in order to build a test topology : two namespaces connected by an OpenvSwitch bridge.
$ wget https://raw.githubusercontent.com/redhat-cip/skydive/master/scripts/simple.sh $ chmod +x simple.sh $ ./simple.sh start 192.168.0.1/24 192.168.0.2/24
Once this script is executed, the WebUI should look like this :
In order to fill the ElasticSearch a bit and getting some flows on the WebUI, we can do the classical ping test between the two namespaces created by the script. I let the more audacious of you testing with netcat and beyond.
$ sudo ip netns exec vm1 ping 192.168.0.2
On the WebUI moving the mouse pointer over the bridge named “br-int” shows the flows going through this bridge.
By checking the Skydive flows API we can see the flows that were captured, the interfaces involved and where they have been captured :
$ curl http://localhost:8082/rpc/flows
Deploying a multi node lab is quite easy, we just need to start another Agent after having changed the Analyzer address in the configuration file. There is another script within the Skydive repository to create two namespaces connected through a GRE tunnel.
$ wget https://raw.githubusercontent.com/redhat-cip/skydive/master/scripts/multinode.sh $ chmod +x multinode.sh
On the first node :
$ ./multinode.sh start 192.168.0.1/24 <tunnel endpoint IP>
On the second node :
$ ./multinode.sh start 192.168.0.2/24 <tunnel endpoint IP>
Conclusion and Roadmap
While the project is young, Skydive can already be deployed for testing purposes. We will add more Flow probes in the next weeks in order to be able to capture traffic outside of OpenvSwitch. Neutron and Docker connectors are already available, and more external connectors will be added in order to qualify topology and flows with extra informations.
I couldn’t finish the post without saying that Skydive is an Open Source project under the Apache licence written in Go and is open to Contributions.
We are hosted on SofwareFactory leveraging the Gerrit contribution model made popular by OpenStack and we have a mirror on Github. Patchsets welcome!