The Data Stack – Download the most complete overview of the data centric landscape.

This blog post offers an overview and PDF download of the data stack, thus all tools that might be needed for data collection, processing, storage, analysis and finally integrated business intelligence solutions.

(Web)-Developers are used to stacks, most prominent among them probably the LAMP Stack or more current the MEAN stack. On the other hand, I have not heard too many data scientists talking about so much about data stacks – may it because we think, that in a lot of cases all you need is some python a CSV, pandas, and scikit-learn to do the job.

But when we sat down recently with our team, I realized that we indeed use a myriad of different tools, frameworks, and SaaS solutions. I thought it would be useful to organize them in a meaningful data stack. I have not only included the tools we are using, but I sat down and started researching. It turned out into an extensive list aka. the data stack PDF. This poster will:

  • provide an overview of solutions available in the 5 layers (Sources, Processing, Storage, Analysis, Visualization)
  • offer you a way to discover new tools and
  • offer orientation in a very densely populated area

Continue reading about The Data Stack – Download the most complete overview of the data centric landscape.

Tags: , , , , , ,

Generator for Google Tag Manager

In the past two years, tag management evolved and became the leading technology used to manage tracking (and more) on the web. Meanwhile Google Tag Manager has established itself as a key player in the field of tag management.

While tag management empowers online marketers to do “powerful things”, it also adds a layer of complexity to their practice of web analytics.

For the marketers who have never setup tag management, the new concepts are repelling.
For the ones who have done it before, there is still no guarantee of quality.
For the ones who have done it more than a couple of times, most of the “clicking-through” of a container can feel repetitive.

Many reasons to come up with a boilerpate container for Google Tag Manager. We first used it internally for client work, and soon decided to share with the community in the form of a container generator.

Continue reading about Generator for Google Tag Manager

Tags: ,

A recommender system for Slack with Pandas & Flask

Recommender systems have been a pet peeve of me for a long time, and recently I thought why not use these things to make my life easier at liip. We have a great community within the company, where most of our communication takes place on Slack. To the people born before 1990: Slack is something like irc channels only that you use it for your company and try to replace Email communication with it. (It is a quite debated topic if it is a good idea to replace Email with Slack)

So at liip we have a slack channel for everything, for #machine-learning (for topics related to machine learning), for #zh-staff (where Zürich staff announcments are made), for #lambda (my team slack channel) and so on. Everybody can create a Slack channel, invite people, and discuss interactively there. What I always found a little bit hard was «How do I know which channels to join?», since we have over 700 of those nowadays.

Bildschirmfoto 2016-06-16 um 11.34.12

Wouldn’t it be cool if I had a tool that tells me, well if you like machine-learning why don’t you join our #bi (Business Intelligence) channel? Since Slack does not have this built in, I thought lets build it and show you guys how to integrate the Slack-API, Pandas (a multipurpose data scientist tool), Flask (a tiny python web server) and Heroku (a place to host your apps).

Continue reading about A recommender system for Slack with Pandas & Flask

Tags: , , , , ,

What’s your twitter mood?

The idea

  • Analyze tweets of a user for being positive, negative or neutral using machine learning techniques
  • Show how the mood of your tweets change over time


  • Fun way to experiment with Sentiment Analysis
  • Experiment with language detection


Gathering data

We analyzed tweets from Switzerland, England, and Brazil. We put extra care to make sure our model can do well against Swiss-German text.

Make awesome model in node

We created custom fast Natural Language Processor in node.js. Why node? It has very good run-time when dealing with lots and lots of strings. We used unsupervised machine learning techniques to teach our model the Swiss German and English writing model. Once we had a working model, we added couple other models using Bayesian inference to create an ensemble

Make nice front-end

portugese sentiment analysys

Once we got our server working we thought about adding some better UI. We asked our User Experience specialist Laura to suggest improvements. See for yourself:


Problems and learnings

Language detection is needed to use the right sentiment model

Design model for Swiss-German is especially hard: the language incorporates German, with a lot of French and Italian words. Also spelling of words changes from canton to canton. If we add that most people when writing tweets are forced to use abbreviation, we get the whole picture of the challenge.

An accurate model needs a lot of data

In order to get a good result we needed to incorporate data from various people and different nationalities. The good thing is that the more you use our model the more accurate it gets.

Training data is available

One of the problems is that for humans is hard to understand the irony or sarcasm. Especially in short tweets. So it’s also hard for a machine.

If you want to play with our results in this machine learning experiment:

I would like to thanks Andrey Poplavskiy for his “css love”, and Adrian Philipp for his huge contribution and encouragement towards this project.


Some comments that we received, were not so nice, but as always we are happy to receive any feedback.


Predicting how long the böögg is going to burn this year with a bit of eyeballing and machine learning.

So apparently there is the tradition of the böögg in Zürich. It is a little snowman made out of straw that you put up on top of a pole, stuff with explosives and then light up. Eventually the explosives inside the head of the snowman will catch fire and then blow up with a big bang. The tradition demands it that if the böögg explodes after a short time, there will be a lot of summer days, if it takes longer then we will have more rainy days. It reminds me a bit of the groundhog day. If you want to know more about the böögg, you should check out the wikipedia pageäuten.

Now people have started to bet on how long it will take for the böögg to explode this year. There is even a website  that lets you bet on it and you can win something. In my first instinct I inserted a random number (13 min 06 seconds) but then thought – isn’t there a way to predict it better than with our guts feeling? Well it turns out there is – since we live in 2016 and have open data on all kinds of things. Using this data, what is the prediction for this year?

590 seconds – approximately 10 minutes.

We will have to see on Monday to see if this prediction was right – but I can offer you to show now how I got to this prediction with a bit of eyeballing and machine learning. (Actually our dataset is so small that we wouldn’t have to use any of the tools that I will show you, but its still fun.)

Continue reading about Predicting how long the böögg is going to burn this year with a bit of eyeballing and machine learning.

Tags: ,

Get started exploring Google Analytics data with Python Pandas

The latest release of Pandas (v0.17.1) has brought the deprecation of the Google Analytics data reader submodule ( This deprecation decision is actually good news since this submodule had dependencies on packages that are not currently python 3 compatible and was, even under python 2.7, hard to get up and running.

After updating my system to the newest versions of pandas, I had to find a new connector to fetch Google Analytics data, and found an advantageous replacement in the google2pandas module from Panalysis.

This blogpost walks you through the setup of Pandas and google2pandas, and breifly introduces you to fetching and getting Google Analytics data into Pandas dataframes, for further exploration with Pandas.

Continue reading about Get started exploring Google Analytics data with Python Pandas

Tags: , ,

IP anonymizing and its impact on city and country dimensions in Google Analytics

IP anonymization is blurring part of an IP, in general to protect its owner’s anonimity. One way to do it is to zero the last octet of an IPv4, last 80 bits of an IPv6. That’s actually how Google Analytics does it.

Now anonymizing IPs might have some impact on the attribution of an IP to some country and city. How much? I often read it did not impact geolocalization “too much”, yet I never found studies about it.

A few months ago, I launched a small experiment: I tracked traffic to this very blog twice: once with IP anonmyzation enabled, once without.

I then recently started analyzing this data. Here are the early findings of this experiment.

Continue reading about IP anonymizing and its impact on city and country dimensions in Google Analytics


Why Piwik matters now

Piwik is an open-source web analytics solution that has been around for quite some years now and has seen a recent revival with the advent of Piwik 2.

It proposes all the necessary tools to capture, collect, process and analyse traffic data. Yes it has an API, yes fancy reports, segments, dashboards and goals, yes also to custom variables, …

Although I have immense respect for the product team behind Google Analytics, I must admit that Piwik brings three features that are unmet in Google Analytics.

Continue reading about Why Piwik matters now

Tags: , ,

Machine learning on Google Analytics (part 2)

In my previous blogpost, we saw that using machine learning (ML) on Google Analytics (GA), we can go one step beyond traffic analysis. ML will bring light on correlations in your traffic data, let hidden rules emerge and help making predictions. A typical use case could be discovering customers segments for an eCommerce website.

After running a short experiment, we already discussed the requirements for ML and the limitations of on doing that on GA.

In this article, I will describe the quickest way to test ML on your traffic data. For that, we will first need to transform GA statistic data into raw data compatible with ML. Then using a free ML software, we will import, visualize and transform data to optimize predictions. Finally we’ll compute a first decision tree for predicting the class of a visitor based on its characteristics.

Continue reading about Machine learning on Google Analytics (part 2)


Machine learning on Google Analytics

Data mining using machine learning (ML) is a field that always fascinated me. You give an learning system millions of collected data, and it outputs back unexpected insights, that can help you focus on what matters, make you drop things that don’t work, clears mystery from your customers behaviors and potentially reorient your company strategy.

It’s compelling, but that makes you imagine the machine is doing the hard job … Of course not, the machine remains stupid, as always. Data mining is a long iterative process that requires a good load of intuition and a deep understanding of machine learning algorithms. However, it remains more accessible and fun than statistics to experiment because of its intrinsic empirical approach – sorry for feeding what’s already being an unfair preconception that favor ML trend over statistics since decades…

I’ve been longing to put in practice my learning in that field with Google Analytics (GA) data. What more can ML offer in addition to the great analysis features Google Analytics provides ? Is it even possible? Suspecting that GA is not designed for that, I started a short experiment to explore this potentiality.

In this article – addressed to GA novices and ML enthusiasts – I will give a basic introduction about the requirements and benefits of ML, and list some constraints with GA.

In a future blog-post (edit : here), I will share my findings on how to quickly get machine learn-able traffic data, describe the technicalities of the exploration, and provide the minimum to let you do your own experiments.

Looking for a job?

Let’s imagine a web site, like the Liip blog on which you are presently. And a visitor like you, reading this blog-post. Since Liip almost always has some open positions, we want to make sure that if you’re a developer, you won’t switch on your next data mining article before you’ve visited our open positions. If not, then  you might well be a future client and thus interested to know about our service offering and expertise.

ML could help us to create a decision model to predict how much of a developer you are according to your attributes (ex : region, browser, visiting hours).

Our Google Analytics account contains thousand of examples where a visitor ended up visiting our job page, or not. That’s food for an automatic learner!

Continue reading about Machine learning on Google Analytics