Science & technology experiments

How does a 1% increase in traffic cost your health?

2023-12-19T00:00:00+01:00

A causal analysis ran on the traffic data collected in Luxembourg, a small and green country in Western Europe, indicates that a 1% increase in traffic leads to a 0.45% rise in nitrogen dioxide (NO2), a major air pollutant primarily emitted from cars and factories. Elevated levels of NO2 can adversely affect health in various ways. It can make breathing problems worse, negatively impact heart health, increase susceptibility to infections, and particularly affect vulnerable groups such as individuals with asthma, children, the elderly, and those with existing heart and lung conditions.

In the air quality directive the EU has set two limit values for NO2 for the protection of human health: the NO2 hourly mean value may not exceed 200 micrograms per cubic metre (µg/m3) more than 18 times in a year and the NO2 annual mean value may not exceed 40 micrograms per cubic metre (µg/m3).

The data for this analysis were collected from two different locations in Luxembourg, each about 30 kilometers apart. In both locations, traffic and air quality sensors were placed in close proximity to provide accurate measurements of traffic-related pollution. Among these, the city of Esch sur Alzette stands out. Despite having similar traffic counts as the other location, Esch sur Alzette consistently shows higher pollution levels. This difference could be attributed to the city’s history of being surrounded by heavy industry. Interestingly, at night, pollution levels between the two locations tend to converge, indicating that the elevated pollution in Esch sur Alzette fluctuates with daily activities and is not a constant feature.

Recommendations

The results of the causal analysis can be used to precisely control traffic overflow in hotspot areas. They can help calculate a range of traffic values, and upon nearing maximum capacity, traffic can be diverted through alternative routes. In the case of exceeding the maximum allowed CO2 levels, we can calculate how many vehicles need to be diverted from the affected area to normalize the pollution levels.

The quesiton is if it feasible as Luxembourg has already invested in traffic flow optimization - it offers free public transport nationwide, and numerous upgrades have been made to roads and other transportation links. Yet, these improvements do not alleviate the high pollution levels in areas at the crossroads with France. Any enhancements, such as increased highway capacity or better traffic management, are quickly offset by an influx of more vehicles.

For residents in highly polluted areas, the advice might be to relocate. However, these areas are known for their lower real estate prices, making relocation impractical for many.

Additionally, a nation goal should be to decrease the number of older-generation cars and ideally replace them with electric vehicles (EVs). The chart below shows that the percentage growth of EVs in Luxembourg is impressive - around 35% for the last 4 years. At the same time, the prevalence of diesel and petrol cars is declining. However, their absolute numbers remain high, with 268000 diesel and 237000 petrol cars. Assuming a consistent growth rate, it will take about 8 years for EVs to reach the current levels of diesel and petrol cars, starting from the current count of 23400 EVs. On a positive note, not all older cars are in active use, despite being registered. Additionally, while newer-generation cars emit fewer pollutants than their older counterparts, they are not entirely NO2 neutral.

Data

As mentioned earlier, the data was collected from two different locations where the NO2 sensors and traffic counters were situated approximately 700m-800m apart. Both locations are in proximity to a city. The traffic and pollution data were acquired from data.lu - a public data repository of the Luxembourg government. The weather data was obtained from a third party, as data.lu does not currently provide it, but plans to do so in the second half of 2024. For the analysis, the following filters were applied:

The date range is from 2020-03-01 to 2022-12-31, as earlier data is very sparse and, presumably, unreliable.
The models are built on rush hour data, spanning from 7:00 to 21:00. Early investigations indicated that the data distributions for night and day are quite different; therefore, they were modeled separately.

A simple regression model

In this analysis, I explored various methodologies, all tied to a causal question: what is effect on NO2 pollution with a one percent rise in traffic. The need for causal inference comes from the limitations of simple linear regression, which often yields inaccurate or biased results, preventing reliable conclusions from mere traffic and NO2 data correlations. An alternative approach, a controlled experiment involving random assignments of vehicle passage near sensors, while methodologically sound, is impractical to implement.

Let’s begin by considering a simple linear regression model, where we assume that NO2 pollution is solely driven by traffic.

\[log(pollution) = \alpha + \beta * log(traffic) + \varepsilon\]

The estimate of $beta$ or the slope of this model, gives us 0.4898 or 0.48% effect and $R^2$ is 0.1851 meaning that only 18.5% variability can be explained with this simple model. But let me remind you, that this approach captures the total effect on NO2 and potetially it means that we overlook other potential confounding factors. For those unfamiliar with the term, a confounder is a variable that influences both the outcome (in this instance, pollution) and the independent variable of interest (in our case, traffic).

From extensive research on NO2 pollution, we know that major contributors include cars and factories. In my analysis, I have considered the following variables as potential confounders:

Air temperature: Temperatures below 0°C may lead individuals to turn on heating, some of which might still use coal or other ‘dirty’ materials. This also affects traffic - we can expect more cars during both cold and hot days.
Wind speed: High wind can lower NO2 concentration in an area. However, the absence of wind, combined with a sunny summer day, might increase NO2 concentration. Windy days, perceived as unpleasant, might also increase traffic.
Precipitation: Rainy days might affect both NO2 concentration and traffic flow.
City code: Due to varying landscapes and urban planning, different cities are expected to have distinct effects on traffic and pollution. For example, Luxembourg City has many valleys where pollutants can be trapped.
Time (omitted from the analysis): Initially, including time variables such as hour, workday, month, and year might seem beneficial due to the high seasonality in the dataset. However, the traffic data captures the seasonality very well and time doesn’t have a direct effect on the pollution levels.

An advanced regression model

The additional features outlined above allow us to build a causal graph and more precisely estimate the effect of the treatment group, namely the traffic. The estimated effect is 0.4505 or 0.45%, which is 0.03% lower than that estimated by the simple model. Meanwhile, $R^2$ has increased to 0.48, which is a positive development.

\[log(pollution) =\]

\[\alpha + \beta_t * log(traffic) + \beta_c * city + \beta_t * temperature + \beta_w * wind + \beta_p * precipitation + \varepsilon\]

Beyond yielding more accurate estimates, the multiple input model offers detailed insights into each variable. Consider the table presented below:

Of particular interest is the code_Luxembourg variable. This indicates that, on average, the difference in pollution levels between Esch sur Alzette city and Luxembourg is 54%. Essentially, this implies that the model estimates a 54% reduction in pollution levels in Luxembourg compared to Esch sur Alzette, assuming all other variables remain constant. Furthermore, the model shows that an increase in precipitation, wind speed, or temperature leads to a decrease in NO2 pollution by 3.6%, 2.7%, and 1.6%, respectively.

A double ML model

I have incorporated a DoubleML model into the analysis, recognizing that it may not be a perfect fit for the problem at hand. A key selling point of this approach is its ability to handle high-dimensional data and effectively manage a large number of potential confounders. Supposedly, DoubleML reduces the necessity for deep domain expertise, allowing one to construct models even with a basic understanding of the system and data, provided there is ample data and some familiarity with machine learning techniques. My contention, however, is that constructing a theoretical model is still feasible and beneficial. For instance, in this analysis, we can derive numerous variables from the ‘datetime’ variable, such as year, month, and day of the week. Yet, in constructing a theoretical model, we could simply use a consolidated time variable to gauge its overall impact on the system.

While the DoubleML approach might be excessive for the current problem, its practical implementation reveals several issues. First, the model primarily yields estimates on the interaction between the treatment and the outcome, but it does not inherently provide insights into other parameters like city, temperature, or rainfall. This means additional steps are needed to analyze these factors. Second, the complexity of the DoubleML implementation can obscure the model’s internal workings, leading to a reliance on superficial understanding. Users must trust the model’s outcome without fully grasping the underlying mechanics.

The implementation of a DoubleML model is relatively straightforward. First, you build a model to predict the treatment using the input variables. To capture non-linear relationships, any modeling technique can be employed. In my analysis, I utilized XGBoost. We build the model on a training dataset and then make predictions on the test set. However, our focus is on the residuals, i.e., the difference between the true values of the test set and the predicted values.

\[log(traffic) = city + temperature + wind + precipitation\]

\[residuals_{traffic} = Y_{traffic} - \tilde{Y}_{traffic}\]

Next, we proceed to calculate the residuals of the outcome model, constructed without including the treatment variable.

\[log(pollution) = city + temperature + wind + precipitation\]

\[residuals_{pollution} = Y_{pollution} - \tilde{Y}_{pollution}\]

Finally, we obtain our estimate via a simple linear regression:

\[residuals_{pollution} = \alpha + \beta * residuals_{traffic}\]

The result is 0.44 which is close to the estimate of the multiple input regression model at 0.45, however $R^2$ is lower at 22.5%

Bayesian inference model

To me, Bayesian inference is an ideal complement to Causal Inference. Firstly, there’s a nuanced difference in how Bayesian inference presents estimated parameters compared to frequentist frameworks. Frequentist confidence intervals indicate the range within which we would expect the true parameter value to fall in repeated samples, not the probability of the parameter being within that range in a given sample. In contrast, Bayesian credible intervals offer a probability-based interpretation: they indicate the likelihood of the parameter being within a certain range, given the observed data and prior knowledge.

Secondly, we can incorporate existing knowledge about the system under study. In our case, we know the following:

The NO2 pollution rate is always positive.
A positive relationship exists between traffic count and NO2 pollution.
Traffic count can’t be negative.
The log-normal distribution can be used to describe the NO2 pollution rate.

The Bayesian model definition almost identical to the multiple input regression model described earlier, except that we need to encode priors which we defined above.

\[log(pollution) \sim Normal(\alpha + \beta_t * traffic + \beta_c * city + \beta_{temp} * temp + \beta_w * wind + \beta_p * prcp)\] \[\alpha \sim HalfNormal(5)\] \[\beta_t \sim LogNormal(0, 3)\] \[\beta_c \sim Normal(0, 2)\] \[\beta_{temp} \sim Normal(0, 2)\] \[\beta_w \sim Normal(0, 2)\] \[\beta_p \sim Normal(0, 2)\] \[\varepsilon \sim HalfNormal(1)\]

The table below presents the results from the Bayesian inference. The estimated parameters align closely with those obtained from previously described methods. A significant enhancement, however, is the inclusion of credible intervals for each parameter. Specifically, the estimated effect of traffic on NO2 is 0.451, with a credible interval ranging from a lower bound of 0.441 to an upper bound of 0.461. This approach, with its emphasis on credible intervals, represents the correct and most informative way to report such results.

Useful resources

For those interested in replicating the analysis, which is highly engouraged, or simply exploring the model implementations, I have shared my notebook on GitHub.
A highly recommended resource for understanding Causal and Bayesian Inference is the following book, which I found extremely valuable:

Final remarks!

I got hit by HackerNews - a luck or a skill?

2023-11-01T00:00:00+01:00

This is what happens to your website traffic when your post makes to the front page of tech news aggregators like HackerNews and Reddit. It appears that my previous post struck a nerve, sparking extensive discussions on these platforms. In this article, I’ll share my experience, key takeaways, and, most importantly (or perhaps not at all), whether it translated into any financial gains.

Just in case you haven’t come across HackerNews - it’s not a secretive group of hackers. It’s actually a tech news aggregator that’s highly favored by tech enthusiasts. And, as the story often goes, if your content makes it to the front page, be ready for a massive influx of traffic to your website.

Luck versus skill

Was it a lucky coincidence that my post went viral? That’s what I thought when I first shared it on HackerNews and suddenly everyone had an opinion in the comments. But then, the same thing happened on Reddit. More discussions, questions, and a bunch of shared experiences. Makes me think it’s more than just luck. I won’t lie; the post took time. It wasn’t just a quick write-up. I spent hours on the text, brainstorming ideas for the images, and doing a bunch of rewrites. Now the big question - can I do it again? I was joking with my friends that you get a credit for one viral post in your life, and if you use it up like I did, the game is over. But the response to the post shifted my thinking. Seems people on the internet still appreciate content that’s thoughtful yet fun, and either teaches you something new or throws in a different viewpoint. But I get it - my single success isn’t much of a voice, but have you heard of Mr. Beast? Believe it or not, I only learned about him recently when I listened to a Lex Fridman’s podcast featuring this viral content creator from YouTube. One key takeaway from that podcast: virality is replicable, but you need to practice the craft and believe yourself.

Scalable blogging platform

After an hour, the post made to second place

The day I shared on HN and Reddit, the average traffic went from 250 requests per day to 25,000 in a single day, excluding bot requests. My website and the infrastructure beneath survived a spike of 100x in traffic as the post remained at the top3 of HackerNews for 12 hours and 24 hours an the front page for. Now ask yourselves - can your infrastructure cope with 100x increase in traffic and much importantly, what would be the cost?

I’m lucky that, for a personal blog site, I don’t need a fancy platform; therefore, my hosting costs are close to zero. However, my approach might be unconventional. You see - my platform is static, meaning that pages are generated once, when a new post is created. Contrary to my setup, a dynamic content platform such as Wordpress, requires computational power to generate dynamic content with every query. On top of that, I write my posts in a “language” called Markdown, which gives me flexibility when it comes to formatting, be it code, formula, or just image alignment, and is a simplified version of the HTML language. To convert a post from Markdown to HTML output, I use the Jekyll tool, which, as far as I know, has only a command-line interface, putting us already into the “hackers” space for the most of the bloggers.

To summarize, this approach yields highly optimized content suitable for hosting, but its steep learning curve and perceived inflexibility might be a blocker for many.

Serverless hosting on AWS

As my website is just a bunch of HTML files, I upload them to the AWS (Amazon Web Services) platform for storage on the S3 service. This alone enables the running of a serverless website, meaning that there is no cost associated with a continuously operating computer. Additionally, AWS offers an interesting service that distributes your content globally, delivering it to recipients much faster. This becomes crucial when there’s a surge in traffic to your infrastructure — it’s not just a single point serving your content; instead, it is evenly distributed and closer to the user. Does it cost a fortune? Absolutely not—for small-scale websites like mine, AWS provides a free tier, essentially reducing the cost to nothing!

While AWS does offer a free tier for its services, it’s essential to note that exceeding these limits will incur costs. I mentioned this earlier, but it’s worth emphasizing. On the second day of publishing the post, a colleague pointed out the potential costs associated with egress traffic on AWS. In simpler terms, egress traffic refers to the data that exits AWS, such as when users access your website or if you’re transferring to a different cloud provider. However, incoming data, known as ingress traffic, doesn’t come with a charge. This feedback prompted me to revisit AWS’s pricing. For the CloudFront service, you receive 1TB free, which is substantial. A rough estimate, considering my post’s size is 3MB and it’s been accessed by 50K users, suggests I’ve only used about 150GB of outgoing traffic. So, I’m well within the free limit.

While it took some pressure off, I felt it necessary to delve further into the AWS billing dashboard and the free tier page. A positive note: the billing dashboard is pretty much real-time, so what you see is likely what you’ll pay at month’s end. However, a hiccup: CloudFront doesn’t appear on the free tier page, and its metrics can only be found on the service page. Moreover, the CloudFront metrics dashboard doesn’t allow you to aggregate data transfer (egress) into a one number, e.g. for a week, but offers finer granularity. To nail down an exact figure, I had to fetch a CSV file and do the math myself. Not really customer obsessed, right?

At this point, you might feel overwhelmed by the intricacy of just one service and wonder if there’s a simpler solution out there. Well, there is one – the budget alert. You can set a threshold for a specific amount, and you’ll be alerted when that limit is reached. Can things still go sideways? Of course, especially if you overlook a notification. But it’s definitely better than being in the dark!

What went wrong?

With the infrastructure holding up well, there was one issue with the post. I use the MathJax library, which allows me to compile text into mathematical formulas, and it usually works very well. However, I have noticed that, randomly, the formulas were not being generated on the page, making it look ugly. It took a lot of effort to debug this issue, and it turned out that the library was declared twice in the code. Fixing a system under heavy load is not fun, but stressful. One learning from this experience is to always test your posts on different browsers.

So, we’ve discussed the technical challenges, but the main question remains unanswered: Is going viral worth it?

Benefits of a viral post

You might have noticed, if you’re brave enough these days to browse without an ad blocker, that I don’t run any ads. While I’m not aiming for profit, I do mention books I’ve personally read; I wouldn’t recommend them otherwise. Thanks to a recent post, I might’ve earned enough for a new book. Yet, I found it surprising how few readers reached out on LinkedIn to connect. In the past, I expected some readers to follow me on Twitter, but I believe, sadly, that Twitter is dead. Interestingly, I received a lot of LinkedIn requests when someone shared my post to their connections. With RSS feeds slaughtered ages ago (thanks, Google) and Twitter slung recently, I don’t see a way for independent bloggers to build a persistent audience. Now, every post seems to be a hit or miss on news aggregators.

Causal inference as a blind spot of data scientists

2023-10-15T00:00:00+02:00

Throughout much of the 20th century, frequentist statistics dominated the field of statistics and scientific research. Frequentist statistics primarily focus on the analysis of data in terms of probabilities and observed frequencies. Causal inference, on the other hand, involves making inferences about cause-and-effect relationships, which often goes beyond the scope of traditional frequentist statistical methods.

Causal inference has a long history, but it gained more prominent attention in the latter half of the 20th century. This increased interest was partly due to advancements in statistical methods and the development of causal inference frameworks. In the 1980s, the work of Judea Pearl on causal inference significantly contributed to the field which continued into the 21st century. Economists and social scientists were among the first to recognize the advantages of these emerging causal inference techniques and incorporated in their research.

However, based on my personal anecdote, the data science community didn’t truly prioritize causal inference until around 2015 or later. It was during this period that less technically oriented economists faced significant challenges related to the scaling of big data, prompting them to seek assistance from data scientists. Unfortunately, data scientists often lacked the necessary expertise in causal inference, resulting in limited knowledge transfer to business stakeholders. As a result, we, the data scientists, missed an important development for quite a bit, so let’s catch up on that!

What is causal inference?

New York in the parallel worlds

To explain causal inference, I like an analogy of a parallel or alternative world. Nowadays, with the help of GenerativeAI, we can really simulate or create new worlds. Have you ever wondered how New York would be if the Aztecs would take over of Americas? Or what if the Roman Empire still ruled the world?

Now, how does this relate to data science and business decisions? Well, businesses make important choices every day, like where to invest money, who to hire or fire, and what the consequences of public policies might be. The data they collect only shows one side of reality. To really understand the results of their decisions, they need to explore an unseen or simulated reality. That’s where causal inference comes in, helping us make better decisions by considering alternative outcomes.

Let’s take an example to make it clear. Imagine you’re in charge of expanding the business of a company that makes snack bars for kids. Your goal is to boost sales, and you’re considering adding more sugar to your products because you have a hunch that kids love sugar. After enchancing all your snacks with sugar, you want to measure its impact. You want to know how much your sales would be in a parallel world where kids were stuck with bland snacks compared to your sweet treats. This is where causal inference steps in to provide the solution.

Sales of sweet snacks vs bland snacks

The chart above illustrates the difference between an observed scenario represented by the red line and an unobserved scenario represented by the black line. The technical term for the black line is ‘counterfactual’—it represents what the sales would be if we didn’t enhance the snacks with sugar.

To continue this intriguing story, let’s fast forward a bit. Now, you’re the CEO of the same company, which has gained international recognition and is traded on stock markets worldwide. However, recently, some pesky Facebook groups formed by moms and dads have launched a public campaign, claiming that your products and the entire concept behind them are making their children overweight and prone to diabetes.

In an effort to address these concerns and launch a PR campaign, you reach out to a university with whom you’ve collaborated in the past to improve your products. You ask them to investigate these claims. To your surprise, they request the same sales data that initially sparked the sugary product campaign. A few nights later, an underpaid PhD student conducts a causal inference analysis, constructs counterfactuals and uncovers the following findings.

Percentage of people with diabetes, simulated data

Upon seeing these results, you become convinced that there is a conspiracy against you and your company. You quickly instruct your lawyers to halt any further funding to the university and come up with a plan to take legal action against all parties threatening you. From this intriguing tale, we can take two valuable lessons about how Causal Inference played a pivotal role in two critical business scenarios:

Assesing the impact on sales by adding more sugar to your products.
Assessing the effects of a sugary diet on children’s health.

Here are more examples of causal inference:

Effect of attending a data science meetup on a person’s future and earnings
Air quality and free public transport
Percentage of electric vehicles and air quality
Impact of your campaigns (sales, marketing or support) on the revenue, profit, employees satisfaction and etc.
A product or service price change on the demand

Running a causal inference analysis

Looking back at our fictional story, the lawyers could raise a valid argument that association or correlation doesn’t necessarily imply causation. In simpler terms, just because there’s a correlation between high sugar consumption and an increase in the number of people with diabetes doesn’t mean that one directly causes the other. It’s another case of a spurious correlation, isn’t it? Take, for example, the high correlation between umbrellas and wet streets. Does that imply that people with umbrellas cause puddles on the streets? Of course not. It’s more likely that a common factor, in this case, a rain (or, using the technical term, a confounder), affects both variables. Now, the question is how can we estimate the impact given that there is a causal link between diabetes and a sugary diet?

One approach to understanding the impact of sugary snack consumption on diabetes risk would involve conducting an experiment. In this experiment, children would be randomly assigned to receive either unsweetened or sugary snacks. After a decade, we’d analyze how many children developed diabetes in each group and calculate any observed differences. If differences exist, we could confidently attribute them to causation since the random assignment eliminates biases. However, I can already see parents rolling their eyes and rightly questioning my moral values. And they’re correct - we can’t conduct cruel experiments on children or people. Secondly, it would take 10 years to figure out the difference, and thirdly, there are other biases to consider. Last but not least, it would be really expensive.

Linear regression

Let’s set aside for the moment the need to estimate causal impact and explore modeling the problem using linear regression. If we represent snack consumption as a binary variable (sugary or not), we can proceed as follows:

\[diabetes_{i} = \alpha_0 + \beta * sugar_i + \varepsilon_i\]

Now, here comes the exciting part – we can enhance our model by incorporating all the additional variables available to us. In our example, we might have access to demographic data, an individual’s activity or behavioral information. For instance, we can consider factors like how frequently and for how many hours a person exercises each week, their dietary habits, and so on.

\[diabetes_{i} = \alpha_0 + \beta_s * sugar_i + \beta_r * race_i + \beta_g * gender_i + \beta_x * X_i + \varepsilon_i\]

$X_i$ denotes extra variables what we might include

You might be wondering whether adding more variables to the model is a good idea, and the short answer is: it depends. When dealing with a causal question, it’s crucial to include variables known as confounders. These are variables that can influence both the treatment and the outcome. By including confounding variables, we can better isolate and estimate the true causal effect of the treatment. Failing to add or account for confounding variables may lead to incorrect estimates.

Example of a confounder

Additionally, including variables that are only predictors of the outcome can be beneficial. It reduces the variance and allows for a more precise estimation of the causal effect. However, adding a variable that predicts only the treatment can lead to a less accurate estimation of causal effect. This occurs because it increases the variance, making it more challenging to estimate the causal effect accurately.

It is worth to emphasize, that a regression model gives an average estimate based on the given inputs. In causal inference, this outcome is referred to as the average treatment effect, $ATE$, which provides an estimation across the entire group, rather than on an individual basis. Depending on the problem at hand, you might need to estimate an individual treatment effect. The Synthetic Control method, discussed below, allows you to assess the impact at an individual level.

Causal graphical models

Now that we’ve discussed the cases to consider, let’s dive into the process of deciding which variables to include and which to omit in your model. Causal graphical models, championed by Jude Pearl since the 1980s, offer an appealing approach at first glance. The fundamental concept is to construct a Directed Acyclic Graph (DAG) that contains all variables in your analysis. Using this graphical representation you can make informed decisions about which variables to retain and which to exclude.

In this framework, each node in the graph represents a variable, and an arrow pointing to another node signifies a causal relationship. The process of constructing this graph involves utilizing three building blocks: a pipe, a fork, and a collider, which help describe the causal flow between variables. This approach forces you to engage in a thoughtful and clarifying exploration of your model’s causal structure.

From left to right: a pipe, a fork and a collider

While learning about causal graphs can be challenging, it offers substantial benefits in understanding and addressing various causal inference problems and their solutions. Some argue that this approach doesn’t scale well on a model with +20 input variables. In my experience, most of the time we start with a limited set of variables and build the derivatives of these variables. Therefore, there is an opportunity to build a structural map on the primary variables in most of the cases.

Additionally, to alleviate the pain and speedup process, frameworks such as DoWhy and econml have been developed. In summary, starting your modeling journey with a causal graph may indeed be a challenging task, however it is a proven framework to get a robust and insightful model.

Instrumental Variable

In a nutshell, the causal graph should facilitate a solution of a causal problem. However, there many methods to tackle the problem depending what data and challenge you have at hands. Instrumental Variable (IV) method is quite unique. In theory, it seems like a magical solution — you find a variable that has a causal link to a treatment, but doesn’t impact the outcome directly, and voilà, you’ve cracked the code of causation. However, there’s a crucial requirement: the IV must remain completely unrelated to any unobservable factors. This means you need prior knowledge ensuring that the treatment and the outcome have no connections with any variables — an undertaking that can be quite challenging, if not nearly impossible, in the real world. In my experience, I haven’t come across any instances of its use, but interestingly, nearly every causal inference course dedicates a chapter to this intriguing concept.

Difference in Differences

DiD is a straightforward method that can be implemented in Excel without the need for advanced tools. The concept revolves around comparing two versions of a subject or a unit under investigation: one before a particular event or treatment and the other after. To enhance the analysis, you introduce a control group — a similar entity that remains unaffected by the treatment.

What are the potential effects of a young girl taking her chemistry classes seriously

Let’s consider an illustrative example: before the introduction of the free transport policy in San Diego, the average air pollution level was 10. After the policy was implemented, it decreased to 8. However, we can’t simply subtract 10 from 8, as it would yield biased results. To address this issue, we turn to data from a neighboring city, Tijuana, located just across the border. In Tijuana, the pollution level before the policy was 11, and after its introduction, it dropped to 10. Notably, Tijuana was not affected by the policy. Applying this approach, we find that the bias reduction diminishes the impact from $-2$ to $-1$.

\[diff = (SanDiego_{after} - SanDiego_{before}) - (Tijuana_{after} - Tijuana_{before}) =\] \[(8 - 10) - (10 - 11) = -1\]

Synthetic control

In our previous example, we made an assumption that the cities were similar, with the treatment being the only differing factor. Furthermore, we simplified the problem by considering only four parameters, which inevitably introduced a degree of uncertainty into our estimation. But what if we had access to a wealth of data from various units (cities in the previous example) and the ability to observe changes over time?

Enter the word of Synthetic Control. Like a magician, we build a synthetic twin for our treatment group. To do that, we take the treatment group and regress against a bunch of similar units, and with regularization, we select the most relevant features and determine their weights.

For instance, let’s evaluate the impact of Mark Zuckerberg having children on teenagers’ happiness in social networks. In order to construct a synthetic control group, we certainly include Bezos, Musk, and Gates, but we might also add Jon from the UK and Guiseppe from France. Never heard of the latter two? That’s precisely the point! With this model, built from data preceding the event (Zuckerberg having children), we project the data into the future, thereby generating our imaginery future, namely counterfactuals, which gives us a way to measure the impact.

Fictitious example

Previously, we discussed the average treatment effect, a measure that helps us estimate the overall causal impact on a treatment group. Now, with the synthetic control method, it becomes possible to estimate an individual treatment effect (in this case, the effect of Zuckerberg having children), provided that we have a sufficient amount of data to create a synthetic control.

Now, lets magic fade and look at disadvangates of synthetic control. The method assumes that all potential confounding variables are measured and controlled for as with linear regression. The synthetic control method assumes that the pre-treatment data accurately represent the underlying data-generating process. And on top of that, you get the standard ML problems - overfitting, difficult to validate (but not impossible) just to name few.

Double ML

Since the ’90s, when causal inference gained popularity, the data landscape has changed significantly. More often than not, we now find ourselves dealing with enormous amounts of data for a given problem, often without a clear understanding of how all the variables interconnect. In theory, DoubleML can be helpful in such cases.

The DoubleML method is founded on machine learning modeling and consists of two key steps. First, we build a model $m(X)$ that predicts the treatment variable $T$ based on the input variables $X$. Then, we create a separate model $g(X)$ that predicts the outcome variable $Y$ using the same set of input variables $X$. Subsequently, we calculate the residuals from the former model and regress them against the residuals from the latter model. An important feature of this method is its flexibility in accommodating non-linear models, which allows us to capture non-linear relationships — a distinctive advantage of this approach.

\[\tilde{T}_{sugar} = T_{sugar} - m(X)\] \[\tilde{Y}_{outcome} = Y_{outcome} - g(X)\] \[\tilde{Y}_{outcome} = \alpha_0 + \beta_t *\tilde{T}_{sugar}\]

Consider using DoubleML when you have high-dimensional data (many features) or when the relationships between inputs and treatment/outcome are not linear. DoubleML is particularly useful when you need to estimate treatment effects that vary across different subgroups or individuals. However, it’s essential to remember that DoubleML cannot magically eliminate the influence of poorly considered confounding variables.

Useful resources

The purpose of this blog post was to introduce readers to the world of Causal Inference and inspire them to explore it further. I’m planning to follow up with a hands-on post using public data. Below, you can find a list of resources that I found helpful during my own journey of learning about Causal Inference.”

Causal Inference for The Brave and True. It’s freely available on the internet, hands-on focused, and presents easily understandable formulas.
Mastering mostly harmless econometrics. A very nice introduction to Causal Inference. (video)
Causal Inference in Python: Applying Causal Inference in the Tech Industry
The Book of Why by Judea Pearl, the inventor of causal diagrams, not only delves deep into the theory but also offers valuable insights from a historical perspective.

Challenges and ideas for charging an EV in Europe

2023-09-15T00:00:00+02:00

A view from a Tesla charging station in Germany

The photo above captures the state of charging stations in Germany, and let me explain why I believe it does so. During a recent journey, I had the opportunity to travel across Germany, Poland, Lithuania, and back using an electric vehicle. While it wasn’t my first long trip with an EV, I noticed a difference when heading north as opposed to south of Europe.

In countries like France, Italy, Switzerland, Belgium, Luxembourg, and the Netherlands, most electric superchargers are situated outside of gas stations, often integrated with hotels or shopping malls. This setup provides less crowded, clean from litter areas, complete with amenities and eating options.

However, in Germany, known for its pragmatic approach, charging stations are typically integrated into existing infrastructure not primarily designed for electric cars or tourists. Yet, I believe they are missing out on a significant opportunity by opting for the default or cheapest option. Electric cars are clean and exceptionally quiet, making them an attractive choice as customers. Any business can benefit from incorporating a charging station and attracting additional, and in some cases, well-off clientele.

In the past, there have been instances where companies tried to upsell customers who used a Mac computer to access their websites. With electric charging stations, there’s no need for engaging in dubious marketing schemes; you simply build one and gain access to a stream of potential customers. Some might argue that a short detour may be necessary to reach such places, but based on my observations, Tesla and other EV drivers are more than willing to make that small effort.

Tesla changing stations in Poland are located away from traditional gas stations. However, it’s worth noting that the network’s coverage is quite sparse, with a station typically available only every 300 kilometers. This leaves little room for error and can be challenging for long-distance travelers. While there are alternative superchargers available, the lack of competition in Poland, and similarly in Germany, has resulted in high pricing, often double what Tesla charges. So, there’s a business opportunity here: open a charging station that offers a competitive pricing!

The situation in Lithuania is quite chaotic. It appears that the government is attempting to promote electric vehicles (EVs) by offering free charging stations, but charging time limited to 15 minutes or 1 hour. This complicates matters further as some stations are either broken or consistently occupied, making it a challenge for travelers passing through to find an available spot. This situation results in unrealistic expectations, particularly when potential EV buyers compare the expenses to those of a 20-year-old diesel car. Additionally, a few superchargers in the country charge extremely high prices. Interestingly, there’s a silver lining as a single Tesla station in the country offers free charging. As a first step, I strongly recommend abandoning the concept of free charging stations and instead focusing on building a reliable and efficient EV charging infrastructure.

And lastly, the most annoying thing which is common across different countries, is the need to install separate apps and register on various websites for non-Tesla charging stations. As a consumer, I want a “tap to pay” or a single app solution to manage all my charging needs. A business opportunity, right?

Navigating the Future of AI: Strategies for Survival

2023-03-19T00:00:00+01:00

Lately, reading the news and following updates about advancements in AI, specifically in Generative AI and chatGPT, gave me mixed feelings - on one hand, we are on something big and impactful, but at the same time it feels like a potential threat to the future. And I’m not alone - NLP students lost their field of research overnight, meanwhile some orgs at FAANG became obsolete. It is an old news that chatGPT can pass a software developer tests at FAANG, an exam to become a lawyer or generate inspirational phrases for your YouTube shorts. But I’m sceptical that we will experience a radical transformation in a short time of a few years, but rather, it will be an iterative change which can take a decade or more. But as a story goes, a slowly boiled frog was too comfortable to jump out of a pot, the fate we shall avoid.

My scepticism grew since 2010, when we were promised self driven cars, tomorrow! Looking back, it felt then, that we just needed a bit, maybe a year or two, and Uber or Bolt will be driverless. Do you see it coming next year, two years? I’m less optimistic this time, giving it another decade or so. And more recently, with the birth of Stable Diffusion back in 2022, it felt like we are going to generate movies for ourselves, find business ideas based on recent trends or build product promotions from a single photo. Where are we today? Well, we can generate “artistic” content with limited applicability in the business context, at best. Sure, startups are burning midnight oil to come up with innovative products, but so far it didn’t change much.

In the defense of chatGPT or LLMs, I agree that it already has a impact - as a coding assistant, a translator, an initial knowledge bank or a sentence generator, but we still neeed that connector between the keyboard and the chair. But don’t forget that you already taxed with useless content what is generated in a bink of the eye, meanwhile AI wars are yesterday’s news. Marketing companies are heavily rely on auto-generated text as in the example with YouTube shorts, meanwhile social platforms deploy AI to recognise and ban such content, so the former parties now use another service to paraphase the auto-generated content.

Let me offer you a different angle against this doomsday outlook. Few years back, as a part of Amazon Cloud(AWS) organization, I worked with AWS customers to transform their businesses by employing machine learning solutions. My main take away from that experiece was that the business doesn’t care about lastest state of the art ML/Ai technique unless it gives a competive advantage. As a consequence, they are happily running a 20 years logistic regression model or a rule based system, which they call ML model to please shareholders or investors. As a personal anecdote, I was leading a team of engineers with the goal build a deep learning model for a computer vision problem. After 3 weeks of development, it became obvious, that the approach, favored by the sales team, gives 80% accuracy at best, meanwhile the customer was insisting on a human, 100% accuracy. So, we gave ourself a chance to look beyond a deep learning approach and sure enough, within few days we found a solution based on an old computer vision library. In the end, it was a rule based approach which ~20 lines of code and was 100% accurate. To put more salt on the AI wound, the cost of running it in a serverless enviroment was $3 versus $70K for a deep learning solution and no maintenaince at all. What a beauty, right?

So, my dear reader, how we will survive the future ruled by cruel and fearless AI? I would emphasize two things: by putting business ideas and challenges upfront as in the example above or in this short tweet by former CTO Oculus VR; and diversifying our skills. If today’s trend is any good for future prediction, then majority of AI innovations of today will be in hands of handful and we will be hapilly consuming them as cloud services. Meanwhile, we will be instrumental for the business to bridge the gap, therefore our skills need to be diversified, nevertheless adjacent. Below, you can find a suggestive list of actions to diversify the skills. Let me know if have a suggestion in the comments of a social platform or on Twitter.

If you are a AI/ML practitioner, embrace yourself and look beyond your field. For me, it was Bayesian statistics and causal inference. As with all religions, there is one Bible, which all followers must get acquainted. And to strenghted your belief, a very good YouTube course is provided as well.
If you are in an IT related field, the cloud knowledge is a must nowadays. ACloud.guru really helped me to learn about Amazon, Google and Azure clouds and pass a couple of certificates.
Don’t skip over Data Engineering which is tightly coupled with ML Operations. For the former, I would suggest “Fundamentals of Data Engineering” and for the latter - “Designing Machine Learning Systems”.
Incorporate AI advancements in your life - Copilot, ChartGPT and etc. There is no shame of working smarter, but be conscious of potencial information leak. To my surprise, some schools in Europe are encouriging students to use ChatGPT for writing.
Learn about internet marketing and keep eye on it in order to better understand your customers.

Is serverless architecture cheap? Yes, but it depends on your use case

2020-05-21T00:00:00+02:00

If you are interested in serverless architecture then you probably have read many contradictory articles and might wonder, whether serverless architectures are cost effective or expensive. I would like to clear the air around effectiveness of serverless architectures through an analysis of a web scraping solution. The use case is fairly simple – at certain times during the day, let’s say every hour from 6am to 11pm, I want to run a Python script and scrape a website. The execution of the script takes less than 15 minutes. This is an important consideration to which we will come back later. The project can be considered as an ETL process without a user interface and can be packed into a self-containing function or a library.

Subsequently, we need an environment to execute the script, thus we have at least two options to consider: on-prem, e.g. your local machine, a Raspberry Pi server at home, a virtual machine in a data center, etc.; or deploy it in the cloud. At first glance, the former option might feel more appealing – you have the infrastructure available free of charge, why not to use it? The main concern of an on-prem hosted solution is the reliability – can you assure its availability in case of a power cut, a hardware or network failure? Additionally, does your local infrastructure support continuous integration and continuous deployment (CI/CD) tools to eliminate any manual intervention? E.g. automatic deployment to a production environment when you update your script. With these two constraints in mind, I will continue the analysis of the solutions in the cloud rather than on-prem.

Let’s start with the pricing of three cloud-based scenarios and go into details below.

	10 min per hour of execution time	30 min per hour of execution time
EC2 – t3.nano, 512MB ($0.0052 per hour), 24x7	$3.74	$3.74
ECS Fargate – EC2 Spot instance, 0.25 vCPU, 512MB	$0.34	$1.03
Lambda – 512MB	Monthly free tier	Monthly free tier, 2 runs capped to 15 min *

*The AWS Lambda free usage tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month. AWS Lambda pricing

The first option, an instance of a virtual machine in AWS so called EC2, is the most primitive one. It definitely does not resemble any serverless architecture, however let’s consider it as a reference point or a baseline. This option is similar to an on-prem solution giving you full control of the instance, but you would need to manually spin an instance, install your environment, set up a scheduler to execute your script at a specific time and keep it on for 24x7. And don’t forget the security – set-up a VPC, route tables and etc. Additionally, you will need to monitor the health of the instance and maybe, run manual updates. Doesn’t sound like a fun, right?

The second option is to containerize the solution and deploy it on Amazon Elastic Container service (ECS). The biggest advantage of this option is platform independence – having a Docker file with a copy of your environment and the script enables you to reuse the solution locally, on AWS platform or somewhere else. Now, a huge advantage running it on AWS is that you can integrate with other services, e.g. CodeCommit, CodeBuild, AWS Batch, etc. or benefit from discounted compute resources such as EC2 Spot instances.

The architecture, seen in the diagram above, consists of Amazon CloudWatch, AWS Batch and Amazon Elastic Container Registry (ECR). CloudWatch allows you to create a trigger, e.g. start a job when a code update is committed to a code repository; or a scheduled event, e.g. execute a script every hour. As per the description above, we want the latter, i.e. execute a job based on a schedule. When triggered, AWS Batch will fetch a pre-built Docker image from Amazon ECR and execute it in a predefined environment. AWS Batch is free of charge service and allows you to configure the environment and resources needed for a task execution. It relies on Amazon Elastic Container service (ECS) which manages resources at the execution time. You only pay for the compute resources consumed during the execution of a task.

Now, you might wonder where a pre-built Docker image came from. It was pulled from Amazon ECR and you have two options to store your Docker image there. You can build a Docker image locally and upload it to Amazon ECR, or you just commit few configuration files, namely Dockerfile, buildspec.yml and etc., to AWS CodeCommit (code repository) and build the Docker image on AWS platform. The latter, outlined below, allows you to build a full CI/CD pipeline. E.g. after updating a script file locally and committing the changes to a code repository on AWS CodeCommit, a CloudWatch event is triggered and AWS CodeBuild builds a new Docker image and commits it to Amazon ECR. When a scheduler starts a new task, it fetches the new image with your updated script file. If you feel like exploring further or you want actually implement this approach please take a look at the example of the project.

The third option is based on AWS Lambda service which allows you to build a very lean infrastructure on demand, can scale continuous and has very generous monthly free tier. The major constrain of Lambda service is that the execution time is capped to 15 minutes. Now, if you have a task running longer than 15 min then you need to split it into sub-tasks and run them in parallel, or fallback to option two. By default, Lambda gives you access to standard libraries, e.g. the Python Standard Library and in addition to that, you can build your own package to support the execution of your function or use Lambda Layers to have access to external libraries or even external Linux based programs.

You can access AWS Lambda service via the web console and create a new function, update your Lambda code or execute it. However, if you go beyond “Hello World” functionality, you might realize, that online development is not sustainable. E.g. if you want to access external libraries from your function, you need to archive them locally, upload to S3 and link it to your Lambda function. One way to automate Lambda function development is to use AWS Cloud Development Kit (AWS CDK) which is an open source software development framework to model and provision your cloud application resources using familiar programming languages. Initially, the setup and learning might feel strenuous, however the benefits are worth of it. To give you an example, please take a look at a Python class below, which creates a Lambda function, CloudWatch event, IAM policies and Lambda layer.

from aws_cdk import (
    aws_events as events,
    aws_lambda as lambdas,
    aws_events_targets as targets,
    aws_iam as iam,
    core
)
from aws_cdk.aws_lambda import LayerVersion, AssetCode


class LambdaAppStack(core.Stack):

def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
    super().__init__(scope, id, **kwargs)

    with open("index.py", encoding="utf8") as fp:
        handler_code = fp.read()



    role = iam.Role(
        self, 'mylambdaRole',
        assumed_by= iam.ServicePrincipal('lambda.amazonaws.com'))

    role.add_to_policy(iam.PolicyStatement(
        effect = iam.Effect.ALLOW,
        resources = ["*"],
        actions= ['events:*']))

    ….
    lambdaFn = lambdas.Function(
        self, "Singleton",
        code=lambdas.InlineCode(handler_code),
        handler="index.lambda_handler",
        timeout=core.Duration.seconds(600),
        runtime=lambdas.Runtime.PYTHON_3_6,
        memory_size=512,
        role = role
    )

    rule = events.Rule(
        self, "Rule",
        schedule=events.Schedule.cron(
            minute='59',
            hour='6-20/4',
            month='*',
            week_day='*',
            year='*'),
    )
    rule.add_target(targets.LambdaFunction(lambdaFn))

    ac = AssetCode("./python")

    layer = LayerVersion(self, "mylambda1", code=ac,
                         description="athome layer",
                         compatible_runtimes=[lambdas.Runtime.PYTHON_3_6],
                         layer_version_name=mylambda-layer')
    lambdaFn.add_layers(layer)

As you can see, you can have infrastructure as code in Python language and all changes will be stored in a code repository. For a deployment, AWS CDK builds an AWS CloudFormation template, which is a standard way to model infrastructure on AWS. Additionally, AWS Serverless Application Model(SAM) allows to test and debug your serverless code locally, therefore you can indeed create a continuous integration.

An example of a Lambda based web scraper can be found on github.com.

In this blog post, we have reviewed two serverless architectures for a web scraper on AWS cloud. Additionally, we have explored the ways to implement a CI/CD pipeline in order to avoid any future manual interventions. If you have any question, feel free to contact me on Twitter @dzidorius.

Crisis impact on the airline industry

2020-04-02T00:00:00+02:00

The majority of airports in Europe are shut down, so I was wondering what is going on in the air. The chart above shows the number of flights crossing, originated from or destined to a small European country, Luxembourg. The number of flights plummeted around 85%, but it is not big surprise to anyone as the crisis engulfed the world. So, today we have 15 flights per hour on average and it is important to repeat, that these are flights not only from or to Luxembourg airport, but fly over as well. The idea is to use it as an indicator of the global economy. Currenty, it took a huge hit and the airline industry is on the brink of bankruptcy. However, according to a recent news, air freight segment is flourishing. The way how data is collected allows us to figure out what the percentage of cargo planes is in the air and how is it changing over time by looking into an ICAO 6 digit hex code of each airplane. Below you can find the instructions how to build such data capture system, which is based on a software defined radio (SDR) device.

You can see my setup in the photo above. The main component here is NooElec NESDR usb stick and I use a Raspberry Pi computer to have an autonomous system, however you can run it on your PC. You can do many things with a SDR device - listen to the radio or watch TV, track the airplanes in your area, receive satelite weather images and etc. To begin with, you need to plug the device into your computer, tune to 1090 Mhz frequencyon any SDR application and start collecting data, however I recommend installing dump1090 software as it provides many things, e.g. decoding radio waves into digital representation. Below you can find a Linux bash script to launch the application and start dumping the data into a text file.

/home/your_path/dump1090/dump1090 --net --metric &
sleep 10
nc localhost 30003|sed -e "s/^/$(date +%s),/" >>/home/your_path/dump1090/dump_data.txt &
python /home/kafka/dump1090/planes_firehose.py &

On line 1, we are starting dump1090 application and passing some parameters, which might be relevant only to me, e.g. --metric. Line 2 gives some time to the application to start, otherwise the following line can fail. Then, on line 3 we use Linux netcat command to capture the incoming data on port 30003 and dump it to a text file. Now, if you just want to have a data dump on your local computer, then these first three lines are all you need. I went one step further and built a script to store the data on AWS S3 while streaming data to AWS Kinesis Firehose. Below you can find a Python script to store your data on AWS.

from datetime import datetime
import boto3

import socket
import datetime
import time


def netcat(hostname, port):
    kinesis = boto3.client('firehose')
    
    msg_list = []
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((hostname, port))
    #s.sendall(content)
    #s.shutdown(socket.SHUT_WR)
    while 1:
        try:
            data = s.recv(1024)
            if data == "":
                break
            #print(data.decode('utf-8'))
            data = data.decode('utf-8').split('\n')
            ts = str(int(time.time()))
            for d in data:
                if len(d)>0:
                    msg_list.append({'Data':bytes( ts + ', ' + d +'\n', 'utf-8')})
            #print(msg_list)
            if len(msg_list)>60:
                kinesis.put_record_batch(DeliveryStreamName="plane-fire", Records=msg_list)
                print(msg_list)
                msg_list = []
        except Exception as e:
            print("Connection error: "+str(e))
    print("Connection closed.")
    s.close()


netcat('127.0.0.1', 30003)

The cool part is that you can analyze and build insights direclty in the cloud. I use AWS Athena service to query csv files on AWS S3 and download just summarized data if needed. Here is an example how to run a query on the dataset collected with dump1090 application:

WITH glb as(
SELECT CAST(from_unixtime(CAST(ts AS bigint)) AT TIME ZONE 'Europe/Luxembourg' AS date) as day, 
  date_format(from_unixtime(CAST(ts AS bigint)) AT TIME ZONE 'Europe/Luxembourg', '%H') as hour,
  ROW_NUMBER() over (partition by hex_code, 
  	CAST(from_unixtime(CAST(ts AS bigint)) AT TIME ZONE 'Europe/Luxembourg' AS date),
	date_format(from_unixtime(CAST(ts AS bigint)) AT TIME ZONE 'Europe/Luxembourg', '%H')
  	 order by ts asc ) as row_nr,
  
* FROM "flights"."planes2020" WHERE SUBSTR(col1,2,4)='MSG' AND length(hex_code)=6
),
one_per_hour AS(
  select * from glb WHERE row_nr=1
  )
select day, hour, hex_code, count(*) from one_per_hour
GROUP BY day, hour, hex_code

Lastly, you might wonder how much does it costs to stream, store and analyze the data in the cloud. Currently, my expenses for this project is less than $0.01 and I expect it to be under $1 at the end of the month.

Applying Machine Learning to Peer to Peer lending

2015-03-11T00:00:00+01:00

Peer to peer lending allows to lend money to unrelated individuals without going through traditional financial service such as bank, credit union, etc. Nevertheless, there is an intermediary - service and platform provider. The provider verifies the identity of the borrower and income status, processes the payments, promotes its platform, deals with bad loans or demands bankruptcy for the borrower.
The advantage of peer to peer lending for the borrowers is lower interest rate and higher rate for lenders. However, higher rate comes with higher risk - the return is more volatile than a bank deposit.

Regardless of many lending platforms such us Prosper.com (US), zopa.co.uk (UK), www.fundingcircle.com (UK), auxmoney.com (DE), pret-dunion.fr (FR), comunitae.com (ES), lendico.com (global), majority of platforms accept only the investment from local investors. However bondora.com (EE) allows invest into three markets - Estonia, Finland and Spain and accepts investors from across Europe. Additionally, the rate of return at bondora.com is not fixed as in other platforms and allows much higher returns. However, there is a possibility that you may lose some or all of your initial investment as it is not protected by any financial compensation scheme.

The company behind bondora.com is isePankur AS, which is based in Estonia, a small country in north of Europe. The really amazing thing about bondora.com, that they share data with everyone. The data-set gives us an opportunity to glimpse at the performance of the company and a possibility to build our own credit scoring model!

The data goes back to 2009 and the chart below shows the total number of loans and funded loans per month. It looks like the business exploded in 2013 and the following charts will give us a few clues. In 2013 bondora.com became more active in Finland and Spain markets, though it was in 2014, when the grow skyrocketed in these markets.

Another big change, which might ignited the grow of bondora.com is the shift in the duration of the loans. Dominant loan duration before 2013 was 1-2 years, but in 2013 the company started issuing 5 years loans, which became the primary duration in 2014 and was half of all loans.

The latest data-set has data about 29688 loans and 172 columns or features (depending which parlance you prefer). Below you can see a partial print screen of the interface and dozen of the features.

###Model building###

Now, that we are familiar with bondora’s data-set, let’s move to model building. The prediction model can predict two types of outcomes - categorical (yes/no, true/false classes) or numerical (one is less than one hundred). Although most credit scoring models are built to return a credit score for a borrower, I have opted for simple model, where the outcome has two classes: good or bad.

The definition of “good” class is straight forward - the class in which you are willing to invest or give a credit, but “bad” class definition is complicated. The data-set has data about the borrowers who were late with their payments for 7, 14, 21, 60 days and defaulted loans. How bad is a borrower if he is late for X days? True, he doesn’t respect the schedule and the contract for various reason such as harsh life, distraction or any other reason. However, once he is back on track he pays what he owns, plus late charges, which leads to higher return for additional risk.
The defaulted loans really sounds as bad loans, right? Well, what if the loan defaults, but you get back the principal and partial interest rate? Doesn’t sound that bad, does it? What you really don’t like is the default on the loan and zero payments - these loans are the fraud and you want to avoid them. So let’s mark them as a “bad” ones.

Beside choosing the outcome, it took me awhile to realize another problem with the data-set - the shift in the business model. Nowadays, most of the loans are issued for 5 years and the data-set doesn’t event have data on matured 5 years loans! So I did the trick - I marked all 5 years loans as repaid which are still “alive” after 2-3 years.

While working on a few machine learning projects I quickly learned, that the biggest impact on performance of the model comes not from the fancy machine learning algorithms, but from well engineered features. In the chart below you can find, that 3rd feature is “total_interest”, which is made of “Interest” and “LoanDuration”. The two features perform well, but the derived feature has much bigger weight.
Additionally, I have added data about VIX index. The index tracks volatility of the stock market via S&P500 index - its value increases during the crisis and falls back during calm times. By adding independent source the performance of the model increased 2%.

After initial cleaning of the data-set and feature engineering it was time to build a simple model. My favorite machine learning algorithm is Random Forest for the following reasons: you can feed almost any raw data and it chew happily; the algorithm itself is easy to understand, nevertheless it is kind of black-box; it gives the weights of the features:

###Model metrics###

In classification task precision and recall are used frequently for model metrics. The predicted value can be assigned to four classes: True Positive - real fraud (model predicted True and value was True), True Negative - not a fraud (model predicted False value and value was False), False Positive - not fraud marked as a fraud (model predicted True, however value was False) and False Negative - real fraud marked as not fraud (model predicted False, but the value was True).

In case of p2p lending, if you commit False Positive error (Type I), you just miss one investment. You main concern is False Negative (Type II) errors, because you will be loosing money on the bad investment. Positive predictive value metrics is used for performance evaluation: sum(True positive)/sum(Test outcome positive)

###Blending###

Funny, but the blending works well with ML algorithms and with the people. Michael Nielsen in his book “Reinventing Discovery: The New Era of Networked Science” gives many examples how the collective intelligence can be more powerful than single mind. The idea is very simple - if you gather predictions from dozen of people or ML models then the average score will better than the best single prediction. There is one pitfall - if the predictions are given by “herd mind” then the result most likely be horrible. So, in ML environment try to include very different algorithms - decision trees, linear models, neural networks, etc. to sustain diversity.

For my final model I blend tree algorithms - Random forest, SVM and generalized boosted modeling (GBM). If all of them predict, that the loan is not a fraud, then I will make an investment.

###Real time data###

The modeling part is done, but without real time data it is just waste of time. Fortunately, there is a nice Python framework for web crawling - Scrapy. Initial time investment in the tool might look significant, but because it is robust it won’t be your concern any time soon, unless the platform gets face-lift…

###Automated investment###

Selenium is a suite of tools to automate web browsers across many platforms. It is widely used for web interface testing purpose, however any web-based task can be automated.

Once I have real time data I feed my model with it to find out which loans are good for the investment. Acquired list is send to Selenium script, which logins into the platform and makes the investments.

###Infrastructure###

My first idea was to use Raspberry Pi for the project, however I had the problems setting up R-language and Python frameworks. So, I rented an instance on DigitalOcean for 10 dollars a month (there is 5$/month option as well) and it worked well. Meanwhile, I realized, that I don’t need the server for 24 hours a day.

The solution was to power-on four times a day and shutdown the server once the job is finished. But as you probably know, powering-off your server is not enough to save you from paying - you need to archive your virtual instance (the same applies to Amazon AWS). So, I came up with the script, which creates a virtual image from the archive, powers it on, runs the crawler, runs R mining module, makes the investments if necessary and then shutdowns the instance, archives the image and deletes virtual machine.

Does it sounds good? Well, it was good, until I went on holiday for one week without almost any access to Internet. At the end of my vacation, I found, that every day four new servers were created and were still spinning… Turns out, that Raspberry Pi was able to initiate new instances, but wasn’t able to shutdown and delete. The support at DigitalOcean have asked for the log files from my server and I ended up paying the bill, because it was almost non-existant on my RaspPi. The lesson taken - log as much as possible and incorporate health checks for virtual machine in your scripts.

###Results###

I started my investments at Bondora in September 2014. At the beginning all my investments were done via Bondora’s investment engine, where you can define investment parameters (country, risk profile, etc.). Somewhere in November - December I decided, that I will rely on my own engine only. Below you can see, that the engine based on machine learning algorithms does good job by avoiding bad borrowers.

Data Mining for Network Security and Intrusion Detection

2012-07-17T00:00:00+02:00

In preparation for “Haxogreen” hackers summer camp which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection.

Flame virus, Stuxnet, Duqu proved that static, signature based security systems are not able to detect very advanced, government sponsored threats. Nevertheless, signature based defense systems are mainstream today - think of antivirus, intrusion detection systems. What do you do when unknown is unknown? Data mining comes to mind as the answer.

There are following areas where data mining is or can be employed: misuse/signature detection, anomaly detection, scan detection, etc.

Misuse/signature detection systems are based on supervised learning. During learning phase, labeled examples of network packets or systems calls are provided, from which algorithm can learn about the threats. This is very efficient and fast way to find know threats. Nevertheless there are some important drawbacks, namely false positives, novel attacks and complication of obtaining initial data for training of the system. The false positives happens, when normal network flow or system calls are marked as a threat. For example, an user can fail to provide the correct password for three times in a row or start using the service which is deviation from the standard profile. Novel attack can be define as an attack not seen by the system, meaning that signature or the pattern of such attack is not learned and the system will be penetrated without the knowledge of the administrator. The latter obstacle (training dataset) can be overcome by collecting the data over time or relaying on public data, such as DARPA Intrusion Detection Data Set. Although misuse detection can be built on your own data mining techniques, I would suggest well known product like Snort which relays on crowd-sourcing.

Anomaly/outlier detection systems looks for deviation from normal or established patterns within given data. In case of network security any threat will be marked as an anomaly. Below you can find two features graph, where number of logins are plotted on x axis and number of queries are plotter on y axis. The color indicates the group to which points are assigned - blue ones are normal, red ones - anomalies.

Anomaly detection systems constantly evolves - what was a norm year ago can be an anomaly today. The algorithm compares network flow with historical flow over given period and looks for outliers with are far away. Such dynamic approach allows to detect novel attacks, nevertheless it generates false positive alerts (marks normal flow as suspicious). Moreover, hackers can mimic normal profile, if they know that such system is deployed.

The first task when implementing anomaly detection (AD) is collection of the data. If AD is going to be network based, there are two possibilities to collect aggregated data from network. Some Cisco products provide aggregated data as Netflow protocol. However, you can use Wireshark or tshark to collect network flow data from the computer. For example:

tshark -T fields -E separator , -E quote d -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport -e udp.srcport -e upd.dstport -e tcp.len -e ip.len -e eth.type -e frame.time_epoch -e frame.len

Once you have enough data for mining process, you need to preprocess acquired data. In the context of intrusion, anomalous actions happen in bursts rather than single event. Varun Chandola et al. proposed to derive following features:

Time window based: Number of flows to unique destination IP addresses inside the network in the last T seconds from the same source Number of flows from unique source IP addresses inside the network in the last T seconds to the same destination Number of flows from the source IP to the same destination port in the last T seconds host based - system calls network based - packet information Number of flows to the destination IP address using same source port in the last T seconds
Connection based: Number of flows to unique destination IP addresses inside the network in the last N flows from the same source Number of flows from unique source IP addresses inside the network in the last N flows to the same destination Number of flows from the source IP to the same destination port in the last N flows Number of flows to the destination IP address using same source port in the last N flows

Below you can find an example of feature creation in R. The dataset was created by calling tshark script, which is specified above.

#load data
tmp=read.csv("stats2.cs",colClasses=c(re"characer",11)),header=F)
#get rid of everything below min. in timestamp
tmp[,10]=as.integer(as.POSIXct(format(as.POSIXct(as.integer(tmp[,10]),origin="1970-01-01"),"%Y-%m-%d %H:%M:00")))
#fix some rows
tmp=tmp[-which(sapply(tmp[,1],function(x) nchar(x)>15)),] tmp=tmp[which(!is.na(tmp[,4])),]

#aggregate date by 5 mins. it assumes, that flow is continuous
factor=as.factor(tmp[1:5000,10])

feature=do.call(rbind, sapply(seq(from=1,to=length(factor),by=4),function(x){ return(list(ddply( subset(tmp,factor==levels(factor)[x:(x+4)]),.(V1,V4),summarize,times=length(V11),.parallel=FALSE ))) }))

After preprocessing the data we can apply local outlier detection, KNN, random forest and others algorithms. I will provide R code and practical implementation of some algorithms in the following post.

While preparing this post, I was looking for the books, I found only few books covering data mining and network security. To my surprise Data Mining and Machine Learning in Cybersecurity book includes both topics and it is well written. However, if you are security specialist looking for data mining books, you can read my summary of “Data Mining: Practical Machine Learning Tools and Techniques”

My First Competition at Kaggle

2012-07-02T00:00:00+02:00

For me Kaggle becomes a social network for data scientist, as stackoverflow.com or github.com for programmers. If you are data scientist, machine learner or statistician you better off to have a profile there, otherwise you do not exist.

Nevertheless, I won’t bet on rosy future for data scientist as journalists suggest (sexy job for next decade). For sure, the demand for such specialists is on rise. However, I see one big threat for data scientist - Kaggle and similar service providers. You see, such services allows to tap high end data scientists (think of PhD in hard science) at minuscule fraction of real price. Think of Hollywood business model - top players get majority of the pool and the rest is starving. If you try the same service model on IT projects you will most likely get burned. My reasoning can be wrong, but I suspect, that project timespan is the issue - IT projects can take for while to finish (1-10 years), but main stream ML project won’t take that long.

Notwithstanding these obstacles, machine learning, information retrieval, data mining and etc. is a must with ability to write code for production, deal with streaming big data and cope with performance of intelligent system. Then, in programmers parlance, you will became “data scientist ninja” and every company will die for you. There is a good post on the subject on mikiobraun blog, but I mind you, that it is a bit controversial.

Although for last 4 years I often has been working on financial models and time-series, this competition added a new experience to me and hunger for the knowledge. During competition I found this book very practical and plentiful of ideas what to do next: Data Mining: Practical Machine Learning Tools and Techniques. As complimentary book I used Data Mining: Concepts and Techniques, though most of information can be found in one of them. I will try to summarize some chapters in my own story.

Understanding the data. “Online Product Sales” competition metadata (data about data) is miserly - there are three types of the data - the date fields, categorical fields, quantitative fields and response data for next 12 months. However metadata is most important element in all ML projects, which can save you a lot of time once you understand it better and it leads to much better forecast if you have “domain knowledge”.

Cleaning data. There is famous phrase: “garbage in garbage out”, meaning, that before any further action you have to detect and fix incorrect, incomplete or missing data. You have many possibilities to deal with missing data - remove all rows, where the data is missing; replace it with mean or regressed value or nearest value and etc. If your data is plentiful and missing values are random (meaning, that NA values do not bear any information) - just get rid of them. Otherwise you need impute new values based on mean or other technique. Mean based replacement worked best for me in this competition. Outliers are another type of the troubles. Suppose, that variable is normally distributed, but few variables are far away from the center. The easiest solution would be to remove such values - as many do in finance by removing “crisis period”. When next crisis hits, the journalists are rushing to learn a new buzzword- black swan. Turns out, that outliers can’t be ignored, because the impact of them is huge. So be precautious while dealing with outliers.

Feature selection. It was surprising to me that too many features or variables can pollute forecast, therefore you need to do feature selection. Such task can be done manually be checking correlation matrix, co-variance and etc. However, random forest or generalized boosted methods can lead to better selection. In R you just call randomForest() or gbm() and job is done.

Variable transformation - a way to get superior performance. “Online Product Sales” competition has two date fields, however these fields encoded as integers. By transforming these variables into date and retrieving year and month led to better performance of the model. In most of cases taking logarithm for numeric fields gives performance boost. Scaling (from 0 to 1 or from -1 to 1) and centering (normal distribution) might be considered when linear models are in use. It is worth to transform categorical variables as well, where 1 would mean, that a feature belongs to the group and 0 otherwise. Check model.matrix function in R for latter transformation and preProcess function in caret package for numerical variables.

Validation stage - helps you to measure performance of the model. If you have huge database to build a model you can divide you set into two/three parts - for training, testing and cross validation and you are ready to start. However, if you are not so lucky, then other methods come to play. Most popular method is division of the set into two groups, namely “training” and “test” and rotating it for 10 times. For example, you have 100 rows, so you take first 75 for training and 25 for test and you check the performance ratio. In the next step you take the rows from 25 to 100 for training and you use first 25 for test. Once you repeat such procedure 10 times, you have 10 performance ratios and you take average of it. Stratified sampling is a buzzword, which you should know when you do a sampling. Keeping all this information in mind I wasn’t able to to implement accurate cross validation and my results differ within 0.05 range.

Model selection and ensemble. Intuitively you want to choose the best performing algorithm, however the mix of them can lead to superior performance. For regression problem I trained four models (two random forest versions, gbm, svm), made the predictions, averaged the results and that led to better prediction.