Becoming a data-driven organization sounds so simple. But fulfilling the vision of making smarter decisions takes more than simply providing analysts with tools.
In this series of blog posts, I shall show you how to build an end-to-end data pipeline. It allows information to be captured from source, processed and analysed according to business need. I shall configure the pipeline using cloud services as a business user, thereby removing the dependency on traditional IT infrastructure and the set up and maintenance of data platforms.
My pipeline is made up of three main elements today, though I have plans to augment it with more complex processing using additional cloud services.
Ships are required to broadcast their positions. Messages are picked up by a beacon in southern England. This AIS data is processed at the edge by an MQTT broker and broadcast by topic. This is the data source for my demonstration.
- The first step is to pick up the AIS json data feed published by the MQTT broker and process it for ingest into a database. I have done this without writing any code. I configured nodes in Node-RED to construct a flow, which inserts the AIS data into a Cloudant database. I used the Node-RED boilerplate service in the IBM Cloud which includes a bound Cloudant service.
- Secondly, I used the catalog service in the Watson Data Platform on the IBM Cloud to create a connection and a data asset in my catalog. The catalog allows me to describe and share data assets so that they are easy for people to find and use subject to access controls and governance policies.
- Then I access the catalog from the Data Science Experience (DSX) to populate my Jupyter Notebook with the access to my database. DSX provisions a Spark instance automatically for my analysis, which is to plot the positions of ships on a graph.
Data scientists typically work individually, struggle to find the data they need and are often unaware of the assets, code and algorithms that already existing in their organisations.
These challenges are overcome by using cloud native data pipeline services available on the IBM Cloud. The analyst is able to get started on analysis within minutes of deciding what data is needed to tackle a business problem. It is easy to find and access using the catalog and the enabling infrastructure to execute analytics on large amounts of data is provisioned automatically for them when they create a notebook. (Furthermore, the Spark instance is de-provisioned when the notebook is closed.) Data assets and notebooks can be shared so that data science becomes a team sport.
Get ready to try for yourself by signing up to the Watson Data Platform on the IBM Cloud at dataplatform.ibm.com.
Other posts in this series include:
Acknowledgements: thanks to Dave Conway-Jones, Richard Hopkins and Joe Plumb for their contributions.