The 3L’s that kill #data projects

The typical data project starts with the BA or systems architect asking: “fast, cheap or good – which one do you want?” But in my experience, no matter how much time you have, or how much money you are willing to throw at it, or what features you are willing to sacrifice, many initiatives are doomed to fail before you even start because of inherent obstacles – what I like to refer to as the 3L’s of data projects.

Image taken from "Computers at Work" © 1969 The Hamlyn Publishing Group

Image taken from “Computers at Work” © 1969 The Hamlyn Publishing Group

Reflecting on work I have been doing with various clients over the past few years, it seems to me that despite their commitment to invest in system upgrades, migrate their content to new delivery platforms and automate their data processing, they often come unstuck due to fundamental flaws in their existing operations:

Legacy

This is the most common challenge – overhauling legacy IT systems or outmoded data sets. Often, the incumbent system is still working fine (provided someone remembers how it was built, configured or programmed), and the data in and of itself is perfectly good (as long as it can be kept up-to-date). But the old applications won’t talk to the new ones (or even each other), or the data format is not suited to new business needs or customer requirements.

Legacy systems require the most time and money to replace or upgrade. A colleague who works in financial services was recently bemoaning the costs being quoted to rewrite part of a legacy application – it seemed an astronomical amount of money to write a single line of code…

As painful as it seems, there may be little alternative but to salvage what data you can, decommission the software and throw it out along with the old mainframe it was running on!

Latency

Many data projects (especially in financial services) focus on reducing systems latency to enhance high-frequency and algorithmic securities trading, data streaming, real-time content delivery, complex search and retrieval, and multiple simultaneous user logins. From a machine-to-machine data handover and transaction perspective, such projects can deliver spectacular results – with the goal being end-to-end straight through processing in real-time.

However, what often gets overlooked is the level of human intervention – from collecting, normalizing and entering the data, to the double- and triple-handling to transform, convert and manipulate individual records before the content goes into production. For example, when you contact a telco, utility or other service provider to update your account details, have you ever wondered why they tell you it will take several working days for these changes to take effect? Invariably, the system that captures your information in “real-time” needs to wait for someone to run an overnight batch upload or someone else to convert the data to the appropriate format or yet another person to run a verification check BEFORE the new information can be entered into the central database or repository.

Latency caused by inefficient data processing not only costs time, it can also introduce data errors caused by multiple handling. Better to reduce the number of hand-off stages, and focus on improving data quality via batch sampling, error rate reduction and “capture once, use many” workflows.

Which leads me the third element of the troika – data governance (or the lack thereof).

Laissez-faire

In an ideal world, organisations would have an overarching data governance model, which embraces formal management and operational functions including: data acquisition, capture, processing, maintenance and stewardship.

However, we often see that the lack of a common data governance model (or worse, a laissez-faire attitude that allows individual departments to do their own thing) means there is little co-operation between functions, additional costs arising from multiple handling and higher error rates, plus inefficiencies in getting the data to where it needs to be within the shortest time possible and within acceptable transaction costs.

Some examples of where even a simple data capture model would help include:

  • standardising data entry rules for basic information like names and addresses, telephone numbers and postal codes
  • consistent formatting for dates, prices, measurements and product codes
  • clear data structures for parent/child/sibling relationships and related parties
  • coherent tagging and taxonomies for field types, values and other attributes
  • streamlining processes for new record verification and de-duplication

From experience, autonomous business units often work against the idea of a common data model because of the way departmental IT budgets are handled (including the P&L treatment of and ROI assumptions used for managing data costs), or because every team thinks they have unique and special data needs which only they can address, or because of a misplaced sense of “ownership” over enterprise data (notwithstanding compliance firewalls and other regulatory requirements necessitating some data separation).

Conclusion

One way to think about major data projects (systems upgrades, database migration, data automation) is to approach it rather like a house renovation or extension: if the existing foundations are inadequate, or if the old infrastructure (pipes, wiring, drains, etc.) is antiquated, what would your architect or builder recommend (and how much would they quote) if you said you simply wanted to incorporate what was already there into the new project? Would your budget accommodate a major retrofit or complex re-build? And would you expect to live in the property while the work is being carried out?

Next week: AngelCube15 – has your #startup got what it takes?

The New Alchemy – Turning #BigData into Valuable Insights

Here’s the paradox facing the consumption and analysis of #BigData: the cost of data collection, storage and distribution may be decreasing, but the effort to turn data into unique, valuable and actionable insights is actually increasing – despite the expanding availability of data mining and visualisation applications.

One colleague has described the deluge of data that businesses are having to deal with as “the firehose of information”. We are almost drowning in data and most of us are navigating up river without a steering implement. At the risk of stretching the aquatic metaphor, it’s rather like the Sorcerer’s Apprentice: we wanted “easy” data, so the internet, mobile devices and social media granted our wish in abundance. But we got lazy/greedy, forgot how to turn the tap off and now we can’t find enough vessels to hold the stuff, let alone figure out what we are going to do with it. Switching analogies, it’s a case of “can’t see the wood for the trees”.

Perhaps it would be helpful to provide some terms of reference: what exactly is “big data”?

First, size definitely matters, especially when you are thinking of investing in new technologies to process more data more often. For any database less than say, 0.5TB, the economies of scale may dissuade you from doing anything other than deploy more processing power and/or capacity, as opposed to paying for a dedicated, super-fast analytics engine. (Of course, the situation also depends on how fast the data is growing, how many transactions or records need to be processed, and how often those records change.)

Second, processing velocity, volume and data variety are also factors – for example, unless you are a major investment bank with a need for high-frequency, low-latency algorithmic market trading solutions, then you can probably make do with off-the-shelf order routing and processing platforms. Even “near real-time” data processing speeds may be overkill for what you are trying to analyze. Here’s a case in point:

Slick advertorial content, and I agree that the insights (and opportunities) are in the delta – what’s changed, what’s different? But do I really need to know what my customers are doing every 15 seconds? For a start, it might have been helpful to explain what APM is (I had to Google it, and CA did not come up in the Top 10 results). Then explain what it is about the resulting analytics that NAB is now using to drive business results. For instance, what does it really mean if peak mobile banking usage is 8-9am (and did I really need an APM solution to find this out?) Are NAB going to lease more mobile bandwidth to support client access on commuter trains? Has NAB considered push technology to give clients account balances at scheduled times? Is NAB adopting technology to shape transactional and service pricing according to peak demand? (Note: when discussing this example with some colleagues, we found it ironic that a simple inter-bank transfer can still take several days before the money reaches your account…)

Third, there are trade-offs when dealing with structured versus non-structured data. Buying dedicated analytics engines may make sense when you want to do deep mining of structured data (“tell me what I already know about my customers”), but that might only work if the data resides in a single location, or in multiple sites that can easily communicate with each other. Often, highly structured data is also highly siloed, meaning the efficiency gains may be marginal unless the analytics engine can do the data trawling and transformation more effectively than traditional data interrogation (e.g., query and matching tools). On the other hand, the real value may be in unstructured data (“tell me something about my customers I don’t know”), typically captured in a single location but usually monitored only for visitor volume or stickiness (e.g., a customer feedback portal or user bulletin board).

So, to data visualisation.

Put simplistically, if a picture can paint a thousand words, data visualisation should be able to unearth the nuggets of gold sitting in your data warehouse. Our “visual language” is capable of identifying patterns as well as discerning abstract forms, of describing subtle nuances of shade as well as defining stark tonal contrasts. But I think we are still working towards a visual taxonomy that can turn data into meaningful and actionable insights. A good example of this might be so-called sentiment analysis (e.g., derived from social media commentary), where content can be weighted and scored (positive/negative, frequency, number of followers, level of sharing, influence ranking) to show what your customers might be saying about your brand on Twitter or Facebook. The resulting heat map may reveal what topics are hot, but unless you can establish some benchmarks, or distinguish between genuine customers and “followers for hire”, or can identify other connections with this data (e.g., links with your CRM system), it’s an interesting abstract image but can you really understand what it is saying?

Another area where data visualisation is being used is in targeted marketing based on customer profiles and sales history (e.g., location-based promotion using NFC solutions powered by data analytics). For example, with more self-serve check-outs, supermarkets have to re-think where they place the impulse-buy confectionary displays (and those magazine racks that were great for killing time while queuing up to pay…). What if they could scan your shopping items as you place them in your basket, and combined with what they already know about your shopping habits, they could map your journey around the store to predict what’s on your shopping list, thereby prompting you via your smart phone (or the basket itself?) towards your regular items, even saving you time in the process. And then they reward you with a special “in-store only” offer on your favourite chocolate. Sounds a bit spooky, but we know retailers already do something similar with their existing loyalty cards and reward programs.

Finally, what are some of the tools that businesses are using? Here are just a few that I have heard mentioned recently (please note I have not used any of these myself, although I have seen sales demos of some applications – these are definitely not personal recommendations, and you should obviously do your own research and due diligence):

For managing and distributing big data, Apache Hadoop was name-checked at a financial data conference I attended last month, along with kdb+ to process large time-series data, and GetGo to power faster download speeds. Python was cited for developing machine learning and even predictive tools, while DataWatch is taking its data transformation platform into real-time social media sentiment analysis (including heat and field map visualisation). YellowFin is an established dashboard reporting tool for BI analytics and monitoring, and of course Tableau is a popular visualisation solution for multiple data types. Lastly, ThoughtWeb combines deep data mining (e.g., finding hitherto unknown connections between people, businesses and projects via media coverage, social networks and company filings) with innovative visualisation and data display.

Next week: a few profundities (and many expletives) from Dave McClure of 500 Startups

Defining RoDA: Return on #Digital Assets

How do we measure the Return on Investment for digital assets? It’s a question that is starting to challenge digital marketers and IT managers alike, but there don’t appear to be too many guidelines. Whether your social media campaign is being expensed as direct marketing costs, or your hardware upgrade is being capitalised, how do you work out the #RoDA?

In most businesses, measuring the expected RoI of plant or equipment is usually quite easy: it’s normally a financial calculation that takes the initial acquisition price, amortized over the useful life of the asset, and then forecasts the “yield” in definable terms such as manufacturing output or capacity utilisation.

However, when we look at digital assets, many of those traditional calculations won’t apply, either because the usage value is harder to define, or the benchmarks have not been established. Also, while hardware costs may be easy to capture, how are digital assets such as websites, social media accounts, software (proprietary and 3rd party) and domain names being reported in the P&L, cash-flow analysis and balance sheet?

Sure, most hardware (servers, PCs and physical networks) can be treated as capex (e.g., if the purchase price is more than $1,000 and the useful life is 2-5 years). But how do you make sure you are getting value for money – is it based on some sort of productivity analysis, or is it simply treated as fixed overhead – regardless of your turnover or operating costs?

As we move to cloud hosting and #BYOD, many of these assets utilised in the course of doing business won’t actually appear on the company balance sheet. Yet they will have some sort of impact on the operating costs. Most software is sold under a licensing model, where the customer does not actually own the asset. (But, if the international accounting standards change the treatment of operating leases longer than 12 months, that 2-year cloud hosting fee might just became a balance sheet item.)

I was once involved in the acquisition of a publishing business that was converting legacy print products to digital content. Not only did they capitalise (and amortize) the servers and the conversion software, they also capitalised the data entry costs (using freelance editors) to avoid the expense hitting the P&L. Nowadays, that’s a bit like putting the HTML coding team on the balance sheet and not the payroll…

In some cases, the costs associated with maintaining an e-commerce website or registering a URL, will remain as overhead or operating expenses. But over time, businesses will want to have a better understanding of their RoI for different online sales and digital marketing channels, especially if they have been investing considerably in their design, build and maintenance. Measuring online visitor data, customer conversion rates and average yield per sale, etc. are becoming established metrics for many B2C sites. Having a good grasp of your #RoDA may just give you a competitive edge, or at least provide a benchmark on effective marketing costs.

 

F for Facsimile: What are ‘Digital Forgeries’?

Last week, I attended the 2014 Foxcroft Lecture, given by Nicholas Barker, entitled “Forgery of Printed Documents”. The lecture prompted the question, what would we consider to be a ‘digital forgery’?

Make Up

The lecture was an investigation into a practice that emerged in the 18th century, when reproductions (‘fac similes’ – Latin for ‘make alike’) of early printed texts were created either as honest replicas, or to enable missing pages from antiquarian books to be restored to ‘make up’ a complete work. In some cases, the original pages had been removed by the censors, for others the pages had been left out in error during the binding process, and mostly they had simply been lost through damage or age.

Other factors created the need for these facsimiles: the number of copies of a book that could be printed at a time was often limited by law (censorship again at work), or works were licensed to different publishers in different markets, but printed using the original plates to save time and money.

Despite the innocent origins of facsimiles, unscrupulous dealers and collectors found a way to exploit them for financial gain – and of course, there were also attempts to pass off completely bogus works as genuine texts.

Replication vs Authentication

Technology has not only made the mass reproduction of written texts so much easier, it has also changed the way physical documents are authenticated – for example, faxed and scanned copies of signed documents are sometimes deemed sufficient proof of their existence, as evidence of specific facts, or in support of a contractual agreement or commercial arrangement. But this was not always the case, and even today, some legal documents have to be executed in written, hard-copy form, signed in person by the parties and in some situations witnessed by an independent party. For certain transactions, a formal seal needs to be attached to the original document.

Authenticating digital documents and artifacts present us with various challenges. Quite apart from the need to verify electronic copies of contracts and official documents, the ubiquity of e-mail (and social media) means it has been a target for exploitation by hackers and others, making it increasingly difficult to place our trust in these forms of communication. As a result, we use encryption and other security devices to protect our data. But what about other digital content?

Let’s define ‘digital artifacts’ in this context as things like software; music; video; photography; books; databases; or digital certificates, signatures and keys. We know that it is much easier to fabricate something that is not what it purports to be (witness the use of photo-editing in the media and fashion industries), and there is a corresponding set of tools to help uncover these fabrications. Time stamping, digital watermarks, metadata and other devices can help us to verify the authenticity and/or source of a digital asset.

Multiplication

In the case of fine art, the use of digital media (as standalone images or video, as part of an installation, or as a component in mixed media pieces) has meant that some artists have made only a single unique copy of their work, while others have created so-called ‘multiples’ – large-scale editions of their work. (The realm of ‘digital works’ and ‘digital prints’ produced by photographers and artists is worthy of a separate article.)

Making copies of existing digital works is relatively simple – the technology to reproduce and distribute digital artifacts on a widespread scale is built into practically every device linked to the Internet. Not all digital reproduction and file sharing is theft or piracy – in fact, through the wonders of social media ‘sharing’, we are actually encouraged to disseminate this content to our friends and followers.

The song doesn’t remain the same

Apart from the computer industry’s use of product keys to manage and restrict the distribution of unlicensed copies of their software, the music and film industries have probably done the most to tackle illegal copying since the introduction of the CD/DVD. At various times, the entertainment industries have deployed the following technologies:

  • copy-protection (to prevent copies being ripped and burned on computers)
  • encryption (discs and media files are ‘locked’ to a specific device or user account)
  • playback limits (mp3 files will become unplayable after a specific number of plays)
  • time expiry (content will be inaccessible beyond a specific date)

Most of these technologies have been abandoned because they either hamper our use and enjoyment of the content, or they have been easy to over-ride.

One technical issue to consider is ‘digital decay’ (*) – mostly, this relates to backing up and preserving digital archives, since we know that hard drives die, file formats become obsolete and software upgrades don’t always retrofit to existing data. But I wonder whether each subsequent copy of a digital artifact introduces unintentional flaws, which over time will generate copies that may render nothing like the original?

In the days of analogue audio tape, second, third and fourth generation copies were self-evident – namely, the audible tape hiss, wow and flutter caused by copying copies, by using machines with different motor speeds, and by minor fluctuations in power. Today, different file formats and things like compression and conversion can render very different versions of the ‘same’ digital content – for example, most mp3 files are highly compressed (for playback on certain devices) while audiophiles prefer FLAC. Although this is partly a question of taste, how do we know what the original should sound like? With a bit of effort, we can re-process an ‘original’ downloaded mp3 into our own unique ‘copy’ which may sound very different to the version put out by the record company (who probably mastered the commercially released mp3 from studio recordings created using high-quality audio processing and much faster data sampling rates).

So, would the re-processed version be a forgery?

(*) Thanks to Richard Almond for his article on Digital Decay which I found very useful.