OpenRefine

12 août 2016

A long long while ago, I had a close look at GoogleRefine.  This tool’s sole purpose is to extract, clean, transform and reconcile data.  And  the more the messy is your data, the better you’ll like this tool!

At first glance GoogleRefine was very interesting but, at the time, the whole thing was more promising than useful.  But recently, while looking for GoogleRefine again (I just could not remember the name!), I found its successor: OpenRefine!

Since then, Java has matured, web services are more robust, the tool has progressed quite a lot and OpenRefine uses everything in it’s power to facilitate your job!  More ways to reconcile the data, many different ways to transform your data, more predefined functions and functionalities!

Custom transformations can be done in 3 ways with some easy coding : with GREL (Google Refine Expression Language), Jython (a Python implementation that runs on Java) or Clojure.  Many many many ways to reconcile the data are now available, more import formats (TSV, CSV, Excel, JSON, XML, etc), more ways to reconcile data from webservices and the list goes on.  I must say OpenRefine has lots to offer!

So instead of writing a novel about how cool this tool is, I’ll leave you with a list/compilation of videos, tutorials, documents and websites that demonstrate what OpenRefine do for you!

School of data

Enipedia Tutorial

OpenRefine.org

Hope this help!

In the I-have-to-clean-up-this-mess department, DataCleaner is another useful tool.  But that’s going to be the topic of another post!

Save

Save

Publicités

What’s new?

19 juillet 2016

What’s new?

After a major data loss (I haven’t given up on getting back all my data, mostly code repositories and databases!), I had to start all my pet projects from scratch. Luckily, it’s easier second time around as they say! And, lucky me, I store all my personal stuff on the web! So here’s a list of what’s coming up on this blog.

Ruzzle

Even though I had a decent working version of the genetic algorithm program to find the best ruzzle grid (original posts in French here, here and here), I wasn’t satisfied with the code.  It slowly evolved from a bunch of code snippets into something I could somehow call a genetic algorithm.  Problem was that my solution was tailored for this specific problem only!  Since I lost all the Smalltalk code, I redid the whole thing from scratch : better design, simpler API, more flexible framework.  I can currently solve a TSP problem, the best ruzzle grid search and a diophantine equation.

I also plan to provide examples of the 8 queens problem, the knapsack problem, a quadratic equation problem, a resource-constrained problem and a simple bit-based example with the GA framework.  Besides, the are now more selection operators, more crossover operators, more termination detectors (as well as support for sets of termination criteria!), cleaner code and the list goes on!  So I’ll soon publish a GA framework for Pharo.

As most of you know, the Rush fan in me had to pick a project name in some way related to my favorite band!  So the framework will be called Freewill, for the lyrics in the song :

Each of us
A cell of awareness
Imperfect and incomplete
Genetic blends
With uncertain ends
On a fortune hunt that’s far too fleet

Bingo

A stupid quest I’ll address after the first version of my GA framework is published.  It all started with a simple question related to the game of bingo (don’t ask!) : can we estimate the number of bingo cards sold in an event based on how many numbers it takes for each card configuration to have a winner?  So it’s just a matter of generating millions of draws and cards à la Monte Carlo and averaging how many numbers it takes for every configuration.  Why am I doing that?  Just because I’m curious!

Glorp

There’s been a lot of action on the Pharo side and Glorp.  I plan on having a serious look at the latest Glorp/Pharo combo and even participate to the development!

Sudoku

I’ll translate my articles (in French here, here and here) on the SQL sudoku solver in English and test the whole thing on the latest MySQL server.  Besides, db4free has upgraded to a new MySQL server version!

NeoCSV

I had done a port of NeoCSV to Dolphin right before losing all my code data.  Wasn’t hard to port so I’ll redo it as soon as I reinstall Dolphin!

Smalltalk

It’s time to reinstall VisualAge, VisualWorks, Squeak, ObjectStudio and Dolphin and see what’s new in each environment!  From what I saw, there’s a lot of new and interesting stuff on the web side.  Add to that the fact that most social media platforms have had significant changes in their respective APIs recently, so there’s a lot to learn there!

 

That’s a wrap folks!


Nom, prénom et pays

19 janvier 2015

Je suis à développer un utilitaire et j’ai besoin de données, beaucoup de données, c’est pourquoi je fais appel à vous!

Je suis à la recherche de données (e.g. ligue de baseball, de hockey, de football, de soccer, liste de joueurs d’échecs, associations ou regroupements divers, etc) comportant des noms (nom et prénom, séparés distinctement ou pas) ET le pays d’origine des joueurs…  Que ce soit en format CSV, TSV, DBF, SQL, MySQL ou MS-Access, pourvu que les fichiers soient téléchargeables et facilement importables dans une base de données…

Idéalement, j’espère colliger des noms provenant de tous les pays du monde.  Évidemment, comme la plupart de mes données à l’heure actuelle couvrent l’Amérique du Nord, je serais davantage désireux de colliger des données d’une autre région géographique!

Si vous avez des liens à me suggérer (des liens directs vers les fichiers téléchargeables), laissez-le moi savoir en m’envoyant un courriel à:

Merci de votre aide!