Becoming a data-driven organisation

Organisations struggle to become data-driven if they retain traditional siloed business functions.  The hand-offs resulting from their differing business goals and inter-communication overheads incur too much inertia.

The real question is:  How do you become outcome driven?  It requires those who interact with customers to understand what is happening in context – being informed – to be empowered to make decisions and to be equipped to act according to the business goal.

It takes an end-to-end approach to become an outcome-driven organisation

I have shown how to build a slice of a data pipeline in previous posts on my blog.  This end-to-end approach is the enabler of shared situational awareness.  Data is available from source in a shared platform, which in turn feeds information to all parts of the business.  However, vertical organisation silos also need to be dissolved in favour of outcome-driven value streams.  Those at the front line must be able to see all the way back to the start of the information cycle safely within the organisation’s information governance policies.  Everyone then has improved and timelier shared awareness.

Each area of the business that interacts with customers operates as a business value stream.  These streams enshrine the concept of bringing the work to the people, rather than shipping people to the work.  This increases quality and employee engagement and reduces internal conflict.

Consuming higher value services releases business capacity

Teams are assembled for value streams.  They are multi-disciplinary and obviate the need for traditional IT programmes and shared services.  The maintenance burden of sustaining existing IT systems is reduced because migrating workloads to the cloud means that previously highly sought after, shared technical expertise can be dedicated to each business area.  Each business area can concentrate on optimising its outcomes.

Teams are able to find and access the information they need using the data platform and configure a pipeline to produce the insights they need for decision making.  This employs techniques including data analysis, identifying patterns, algorithm development, and more.  The pipeline can be augmented by AI and machine learning for greater automation and accuracy.

Micro-services architectures provide teams with the technical capabilities to act, but that is the subject of a future post.  Suffice to say that this offers a step change in automation and agility.

Such automation enables business operations to react more quickly to changes.  It frees up time for people to learn new skills, for better quality engagement with each customer and to focus on tasks that rely on imagination, intuition and empathy.

The profile of technical skills an organisation needs to compete has shifted

Each business area will use the platform to easily create, maintain, grow, shrink and decommission its own systems.  They will be able to exploit automation, sophisticated analytics and machine learning.  As I have shown in previous posts, the barriers to deployment are so low they will be able to start small, experiment and enhance capabilities in days or weeks on the platform without creating unsupportable or under the desk IT.

Only then can you truly become data driven and maximise the benefits of a data pipeline.

Simplifying data science

Cloud computing is changing the way IT services are accessed and consumed.  We are seeing that the dependence on infrastructure expertise is diminishing by engaging higher up the stack.

In my previous post, I showed how to ingest ship positioning data into a Cloudant NoSQL database using Node-RED on the IBM Cloud.  This time I shall show you how an analyst or data scientist can find and access information quickly so that they can spend more of their time using their expertise to derive insight, discover patterns and develop algorithms.

I use the Catalog service to create a connection to and description of my Cloudant data.  The connection simplifies access to data sources for analysts, and I can associate data assets including their descriptions and tags with that connection so that data can be easily found.  The catalogue, connections and data assets are subject to access control and I can implemented governance policies, showing lineage, for example.

Let’s see how we create the catalogue entries on the Watson Data Platform in the IBM Cloud.

See how to share data assets using a catalogue.

Analysts are able to access data assets in the Catalog in notebooks using the Data Science Experience (DSX).  Create a project and simply add the required assets by picking them from the catalog.  I can then create a Jupyter notebook within DSX and generate the Python code to connect to the data asset I need.  Furthermore, DSX automatically provisions a Spark instance for my analysis when I create (or reopen) the notebook.

All this only takes a few minutes as I show here.

See how to use a data pipeline for analysis.

This degree of automation is achieved by concentrating on configuring services that make up an overall data pipeline.  The links between the services simplify the tasks of analysts to find and access data from environments they are familiar with.  The dependency on IT resource to provide and manage data platforms is removed because the analytics engine is provisioned as required, and released once the analysis is complete.

In addition, analysts can share their work, and teams can be built to work on problems, assets and notebooks together.  Data science has become a team sport.

I shall describe how such an end-to-end data pipeline might be implemented at scale in the next post in this series.  In the meantime, try out the Watson Data Platform services for yourself on the IBM Cloud at

This is the third in a series of posts on building an end-to-end data pipeline.  You can find my notebook and the other data pipeline artifacts on GitHub.

Building an end-to-end data pipeline

Becoming a data-driven organization sounds so simple.  But fulfilling the vision of making smarter decisions takes more than simply providing analysts with tools.

In this series of blog posts, I shall show you how to build an end-to-end data pipeline.  It allows information to be captured from source, processed and analysed according to business need.  I shall configure the pipeline using cloud services as a business user, thereby removing the dependency on traditional IT infrastructure and the set up and maintenance of data platforms.

My pipeline is made up of three main elements today, though I have plans to augment it with more complex processing using additional cloud services.

Ships are required to broadcast their positions.  Messages are picked up by a beacon in southern England.  This AIS data is processed at the edge by an MQTT broker and broadcast by topic.  This is the data source for my demonstration.


  1. The first step is to pick up the AIS json data feed published by the MQTT broker and process it for ingest into a database. I have done this without writing any code.  I configured nodes in Node-RED to construct a flow, which inserts the AIS data into a Cloudant database.  I used the Node-RED boilerplate service in the IBM Cloud which includes a bound Cloudant service.
  2. Secondly, I used the catalog service in the Watson Data Platform on the IBM Cloud to create a connection and a data asset in my catalog. The catalog allows me to describe and share data assets so that they are easy for people to find and use subject to access controls and governance policies.
  3. Then I access the catalog from the Data Science Experience (DSX) to populate my Jupyter Notebook with the access to my database. DSX provisions a Spark instance automatically for my analysis, which is to plot the positions of ships on a graph.

Data scientists typically work individually, struggle to find the data they need and are often unaware of the assets, code and algorithms that already existing in their organisations.

These challenges are overcome by using cloud native data pipeline services available on the IBM Cloud.  The analyst is able to get started on analysis within minutes of deciding what data is needed to tackle a business problem.  It is easy to find and access using the catalog and the enabling infrastructure to execute analytics on large amounts of data is provisioned automatically for them when they create a notebook.  (Furthermore, the Spark instance is de-provisioned when the notebook is closed.)  Data assets and notebooks can be shared so that data science becomes a team sport.

Get ready to try for yourself by signing up to the Watson Data Platform on the IBM Cloud at

Other posts in this series include:

Acknowledgements: thanks to Dave Conway-Jones, Richard Hopkins and Joe Plumb for their contributions.