Simplifying data science

Cloud computing is changing the way IT services are accessed and consumed.  We are seeing that the dependence on infrastructure expertise is diminishing by engaging higher up the stack.

In my previous post, I showed how to ingest ship positioning data into a Cloudant NoSQL database using Node-RED on the IBM Cloud.  This time I shall show you how an analyst or data scientist can find and access information quickly so that they can spend more of their time using their expertise to derive insight, discover patterns and develop algorithms.

I use the Catalog service to create a connection to and description of my Cloudant data.  The connection simplifies access to data sources for analysts, and I can associate data assets including their descriptions and tags with that connection so that data can be easily found.  The catalogue, connections and data assets are subject to access control and I can implemented governance policies, showing lineage, for example.

Let’s see how we create the catalogue entries on the Watson Data Platform in the IBM Cloud.

See how to share data assets using a catalogue.

Analysts are able to access data assets in the Catalog in notebooks using the Data Science Experience (DSX).  Create a project and simply add the required assets by picking them from the catalog.  I can then create a Jupyter notebook within DSX and generate the Python code to connect to the data asset I need.  Furthermore, DSX automatically provisions a Spark instance for my analysis when I create (or reopen) the notebook.

All this only takes a few minutes as I show here.

See how to use a data pipeline for analysis.

This degree of automation is achieved by concentrating on configuring services that make up an overall data pipeline.  The links between the services simplify the tasks of analysts to find and access data from environments they are familiar with.  The dependency on IT resource to provide and manage data platforms is removed because the analytics engine is provisioned as required, and released once the analysis is complete.

In addition, analysts can share their work, and teams can be built to work on problems, assets and notebooks together.  Data science has become a team sport.

I shall describe how such an end-to-end data pipeline might be implemented at scale in the next post in this series.  In the meantime, try out the Watson Data Platform services for yourself on the IBM Cloud at dataplatform.ibm.com.

This is the third in a series of posts on building an end-to-end data pipeline.  You can find my notebook and the other data pipeline artifacts on GitHub.

One thought on “Simplifying data science

  1. Pingback: Building an end-to-end data pipeline | Tech Insights

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s