AI needs IA

Too few artificial intelligence (AI) projects succeed.  Many organisations approach AI believing that you can collect data for an algorithm in the hope that it realises the anticipated benefits.  Instead you should look at data and design a system to address a problem, not an algorithm.

Here are some keys to success for adopting AI.

  • Select the right business problem. This must be one for which a team already exists and has the data.  It avoids the pitfall where, “We need to test AI,” results in a deceptively attractive initiative which has low business value and is hard.  For example, a business process is required to collect data.  Nevertheless, there is a conundrum for many organisations that the business case to get the data requires a demonstration of AI.
  • Look at the data. Typically, organisations significantly underestimate the effort needed to orchestrate the data in readiness for AI.  AI needs accurate data, and data cleansing and preparation takes 80% of the effort.  This is a hard engineering problem and requires a sound approach to information architecture, technologies and a range of skills, not just data scientists.
  • Build systems, not algorithms. Many assume that a sequence of steps is sufficient to generate insight and recommendations.  However, feedback is crucial to improving overall accuracy.  It is complex with lots of moving parts and demands a multi-disciplinary approach.

AI must be transparent for the public to trust it.  This is especially significant for the public sector because important decisions must be explainable.  It is essential to understand who trains the AI system, what data was used to train it, and what went into the recommendations made by the algorithm.  This extends the realm of information governance.

In summary, AI needs IA: Information Architecture.

Simplifying data science

Cloud computing is changing the way IT services are accessed and consumed.  We are seeing that the dependence on infrastructure expertise is diminishing by engaging higher up the stack.

In my previous post, I showed how to ingest ship positioning data into a Cloudant NoSQL database using Node-RED on the IBM Cloud.  This time I shall show you how an analyst or data scientist can find and access information quickly so that they can spend more of their time using their expertise to derive insight, discover patterns and develop algorithms.

I use the Catalog service to create a connection to and description of my Cloudant data.  The connection simplifies access to data sources for analysts, and I can associate data assets including their descriptions and tags with that connection so that data can be easily found.  The catalogue, connections and data assets are subject to access control and I can implemented governance policies, showing lineage, for example.

Let’s see how we create the catalogue entries on the Watson Data Platform in the IBM Cloud.

See how to share data assets using a catalogue.

Analysts are able to access data assets in the Catalog in notebooks using the Data Science Experience (DSX).  Create a project and simply add the required assets by picking them from the catalog.  I can then create a Jupyter notebook within DSX and generate the Python code to connect to the data asset I need.  Furthermore, DSX automatically provisions a Spark instance for my analysis when I create (or reopen) the notebook.

All this only takes a few minutes as I show here.

See how to use a data pipeline for analysis.

This degree of automation is achieved by concentrating on configuring services that make up an overall data pipeline.  The links between the services simplify the tasks of analysts to find and access data from environments they are familiar with.  The dependency on IT resource to provide and manage data platforms is removed because the analytics engine is provisioned as required, and released once the analysis is complete.

In addition, analysts can share their work, and teams can be built to work on problems, assets and notebooks together.  Data science has become a team sport.

I shall describe how such an end-to-end data pipeline might be implemented at scale in the next post in this series.  In the meantime, try out the Watson Data Platform services for yourself on the IBM Cloud at dataplatform.ibm.com.

This is the third in a series of posts on building an end-to-end data pipeline.  You can find my notebook and the other data pipeline artifacts on GitHub.

Data is not the new oil

You’ve heard it many times and so have I:  “Data is the new oil”

Well it isn’t.  At least not yet.

I don’t care how I get oil for my car or heating.  I simply decide what to cook and where to drive when I want.  I’m unconcerned which mechanism is used to refine oil or how oil is transported, so long as what comes out of the pump at the garage makes my car go.  Unless you have a professional interest or bias I suspect you’re much the same.

Why can’t it be the same with data?

Well for a start, the consumer of data is often all too aware of the complexity of the supply chain and the multiple skills and technologies that it takes to get them the data they wish to consume.  Systems take forever to create and are inflexible in the wrong places.   The ability to aggregate data is over-constrained by blanket security rules that enforce sensible policies, but result in slow moving or over bureaucratic processes and systems.

Today’s cloud technologies have helped, but even here, data services are aimed at developers as the consumer of data, not the end user of it.

The consumers of the new oil would love to be ignorant of where it came from, but they are all too aware and involved in the supply chain that they try and coax to do what they want.

Even with today’s cloud technologies, data services are predominantly created for developers, not the true consumers who understand the data.

Wouldn’t it be wonderful if those who make business decisions could find naturally described information when they wanted?  If they could use it as they wish without regard for the underlying infrastructure?  All with the confidence that access controls and data protection measures are built in.  Enforcing governance policies within the platform builds trust and helps achieve regulatory compliance, such as GDPR.

These are characteristics of a data pipeline: services that ingest data from sources, govern, enrich, store, analyze and apply it.  How data is stored is no longer of concern.  Data is available to all without aggravation.

With their latest cloud platforms, companies like IBM are delivering platforms that do precisely this.  IBM has even published a Data Science Experience that enables a data scientist to build their own pipelines with a rich palette of ingest, machine learning and storage technologies.

We take oil for granted.  Can you say the same for the data you need to drive your business forward?

Try out the Data Science Experience on the IBM Cloud.

Introducing Tech Insights

Thanks for joining me!

I shall be sharing personal perspectives on cloud, data and analytics, and the use of technology in Government in line with IBM’s Social Computing Guidelines.

I have blogged extensively in recent years, as well as providing copy and interviews to trade press and national media.  Here are some of those articles and publications for your convenience.

IBM Insights on Business (generating insight sooner)

Many posts on IBM’s Big Data & Analytics Hub

Big data patterns and maturity model on IBM developerWorks

Big data is multi-disciplinary

Making data serve society (The IET)

Behind the scenes of IBM’s Wimbledon data bunker (The Guardian)

Cyber Security: Protecting the Public Sector and subsequent articles in SC Magazinecouncils must improve their digital security, building a risk aware cyber secure culture.

An author of the IBM Redbook: Implementing a service-oriented architecture using an enterprise service bus

Check out my profile.