Latest Posts (20 found)
emiruz 1 months ago

Modelling beliefs about sets

Here is an interesting scheme I encountered in the wild, generalised and made abstract for you, my intrepid reader. Let \(X\) be a set of binary variables. We are given information about subsets of \(X\), where each update is a probability ranging over a concrete set, the state of which is described by an arbitrary quantified logic formula. For example, \[P\bigg\{A \subset X \mid \exists_{x_i, x_j \in A} \big(x_o \ne x_j))\bigg\} = p\] The above assigns a probability \(p\) to some concrete subset A, with the additional information that at least 1 pair of its members do not have the same value.

0 views
emiruz 3 months ago

A short statistical reasoning test

Here are a few practical questions of my own invention which are easy to comprehend but very difficult to solve without statistical reasoning competence. They are provided in order of difficulty. The answers are at the end. If you find errors or have elegant alternative solutions, please email me (address in bio)! QUESTIONS 1. Sorting fractions under uncertainty You are given the number of trials and successes for a set of items, and you are asked to sort them by the fraction #successes / #trials.

0 views
emiruz 6 months ago

Fitting models from noisy heuristic labels

Summary I present a weak supervision paradigm called “data programming” which uses maximum likelihood estimation to produce soft labels from heuristics. These soft labels can then be used to train other models, without true labels being required at any stage. I’ve included a simple example from first principles to show that the methods work. The original authors have a fully featured package called Snorkel which provides sophisticated data programming and related features.

0 views
emiruz 8 months ago

Bootstrapping ranking models with an LLM judge

SUMMARY I use 500 Hacker News (HN) titles and an LLM to derive an article ranking model from a user supplied preference description. The LLM supplies the labelled data, whilst Ridge regression and cheap sentence transformer embedding provides the features. The surrogate has 0.74 Spearman correlation with the LLM labels, which is remarkable given that the experiment is entirely unoptimised. INTRODUCTION Preference based ordering is useful for lists because it makes it more likely that you’ll find what you’re looking for if you browse from top down.

0 views
emiruz 10 months ago

Kelly fractions for independent simultaneous bets

INTRODUCTION This post is about sizing independent simultaneous bets through methods related to the Kelly criterion. I’ll start by explaining what the Kelly criterion is and how to derive it. I’ll then discuss a simple way to extend it to simultaneous independent binary bets. KELLY CRITERION The Kelly criterion imagines a single bet made sequentially an infinite number of times. It aims to maximise the geometric mean of the returns in that scenario.

0 views
emiruz 1 years ago

RBF kernel approximation with random Fourier features

A basic application of linear methods are linear regression models. However, in some settings, they can be limiting at least because: (1) they may have few degrees of freedom and therefore saturate quickly, and (2) they may imply a rigid geometry (a hyper-plane) which is often unrealistic. It turns out that if we can express similarity between data points as a Gram matrix, we can do linear regression directly to it; otherwise known as “Kernel” regression.

0 views
emiruz 1 years ago

Metric learning with linear methods

I read this paper a while ago, which sets out the problem of linear metric learning nicely. I wanted to see whether metric learning was possible to carry out in closed form. It turned out to be relatively straightforward. Say we have some feature vectors \(x_i \in \mathbb{R}^p\) and some responses \(y_i \in \mathbb{R}^k\), We want: \[(Ax_i - Ax_j)^\top (Ax_i - Ax_j) \approx (y_i -y_j)^\top (y_i -y_j)\] where \(A\) is a \(p \times p\) matrix.

0 views
emiruz 1 years ago

The "Billion Row Challenge!" with Fortran

SUMMARY I tackle 1BRC in Fortran which requires processing 1B rows of weather station data (~15GB) to obtain min/max/mean for each station as quickly, as you can muster. I started out with a time of 2m8s and reduced it to a best run time of <6s on a 4 i7 laptop with 16GB RAM. I herein document how. INTRODUCTION The 1BRC data looks like this: Hamburg;12.0 Bulawayo;8.9 Palembang;38.8 St.

0 views
emiruz 1 years ago

Advent of Code in Prolog, Haskell, Python and Scala

Here are some Advent of Code solutions: 2023 (Prolog) 2022 (Haskell) 2021 (Python & Scala) (in progress at the time of writing). Here are some comparative notes: My Haskell solutions were mostly < 27 LoC. The Prolog solutions where considerably longer. The Prolog solutions were, on average, much harder to code for me. My Prolog solutions ended up looking rather functional for the most part.

0 views
emiruz 2 years ago

Domicles: a novel logic puzzle using Dominoe tiles

INTRODUCTION [If you want to have a go straight away, jump to the examples at the bottom of this post.] Making a novel logic puzzle has been a bucket list item for me since yesteryear and I was finally handy enough with Prolog to endeavour for something elegant without having to write reams of code. I arbitrarily decided that I wanted the puzzle to be expressed in terms of Dominoe tiles.

0 views
emiruz 2 years ago

A minimal probabilistic Prolog meta-interpreter

What follows are some notes about a minimal proof-of-concept for a stochastic simulator in Prolog via a meta-interpreter. META-INTERPRETER Here is a Prolog meta-interpreter which supports probabilistic head clauses through the use of the p/2 predicate: prove(true) :- !. prove((A,B)) :- !,prove(A),prove(B). prove(Head) :- clause(Head,Body), (p(Head,P)->(random(X),1-P<X);true), prove(Body). sim(_,0,S,S) :- !. sim(Goal,N0,Acc0,S) :- (prove(Goal)->Acc is Acc0+1;Acc is Acc0), N is N0-1, sim(Goal,N,Acc,S). sim(Goal,N,P) :- sim(Goal,N,0,S), P is S/N. Here is an example program:

0 views
emiruz 2 years ago

Better data analysis with logic programming

INTRODUCTION Gentle reader, permit me to try and convince you that data analysis is better with logic programming. In this post I’ll analyse a staple dataset – the ggplot2 diamond prices – using a symbolic approach which, I will demonstrate, is able to establish a robust model, otherwise difficult to recover. DATA I’ll use the diamond prices data which comes with the R ggplot2 package. It consists of information about 50k+ round-cut diamonds.

0 views
emiruz 2 years ago

Hidden information and solving Dominoes

Summary Some notes about the construction of a Block Dominoe playing algorithm for a hidden information variant of the game. I build a game simulator, learn from a heuristic algorithm and then develop some play-out based algorithms which seem fairly good. I conjecture the final algorithm approximates optimal play. The final SWI Prolog implementation is available here. I am selling an optimised Javascript library version; embeddable both in the browser or the backend.

0 views
emiruz 2 years ago

Analysis of the data job market using "Ask HN: Who is hiring?" posts

SUMMARY I parse HackerNews (HN) “Ask HN: Who is hiring?” posts from 2013 to time of writing and analyse them to better understand the trends in the data job market with a focus on the fate of data science. Here are my main conclusions: It is likely that the Data Scientist role is in a long term decline and that skills such as data mining and visualisation are also out of favour.

0 views
emiruz 2 years ago

An optimal-stopping quant riddle

Introduction I happened upon a post by Gwern discussing, in some detail, various solutions to riddle #14 from Nigel Coldwell’s list of quant riddles. I initially got as far as the problem description in Gwern’s article and avoided reading further so I could first solve it for myself. The problem is stated as follows: You have 52 playing cards (26 red, 26 black). You draw cards one by one. A red card pays you a dollar.

0 views
emiruz 2 years ago

Estimating gym goers: a mark and recapture experiment

Introduction I had recently started going to a new specialist gym that runs 3 classes per day during the working week and is closed the rest of the time. I’ve been at a few different times on a few different days, and already I was seeing many of the same people from the first class. It occurred to me that the chance of seeing the same faces should somehow scale with the number of people going to the gym, hence it may be possible to estimate the total number of gym members from the number of people I repeatedly see.

0 views
emiruz 2 years ago

Blocking, covariate adjustment and optimal experiment design

Summary I explain blocking, optimal design and covariate adjustment as methods to improve power in design of experiments. I try to motivate this as something data scientists working with online experiments ought to be doing since it can drastically improve the power of an experiment and make design of experiments tractable where otherwise it would not be. I also implement a D-optimal design fitting algorithm from first principles in Python to give the reader a deeper sense of what optimal design does, and I provide a slightly hand-wavy example to sketch how all these methods could be used together in the real world.

0 views
emiruz 2 years ago

Semi-supervised clustering with logic programming

Summary I motivate clustering as a problem well suited to logic programming in the general case, and I volunteer a couple artisanal clustering algorithms in Prolog demonstrated on some mock data. Note: the code herein is my own. If you see bugs, or are a Prolog mage and can write it even more concisely, I’d be grateful if you could let me know. Introduction There are many clustering algorithms born in specific circumstances such as k-means (via vector quantisation), biclustering (via gene expression analysis), DBSCAN (via spatial analysis) and so on, which went on to mostly be abused in the general setting.

0 views
emiruz 2 years ago

Prolog for data science

Summary I demonstrate a widely applicable pattern which integrates Prolog as a critical component in a data science analysis. Analytic methods are used to generate properties about the data under study and Prolog is used to reason about the data via the generated properties. The post includes some examples of piece-wise regression on timeseries data by symbolic reasoning. I also discuss the general pattern of application a bit. Introduction Given some data, the bulk of “data science” for me is the study of what the data implies and whether it can be coerced into the context specific role usually decided by someone other than me.

0 views
emiruz 2 years ago

SQL + M4 = Composable SQL

Introduction I often work with clients who have large “data lakes” or big star schema style enterprise databases with fact and dimension tables as far as the eye can see. Invariably said clients end up with a substantial SQL codebase composed of hundreds of independent queries with lots of overlap between them. I want to be able to treat SQL repositories like I’d treat other codebases. That is, I’d like to create libraries, share code, test blocks independently, and so on.

0 views