Nettoyer et préparer des données avec OpenRefine

1 mai 2017

Un autre excellent tutoriel sur OpenRefine : Nettoyer et préparer des données avec OpenRefine.


Nettoyer les données avec OpenRefine

13 avril 2017

Un autre excellent article sur l’utilisation d’OpenRefine.

OpenRefine : quelques ressources en français

21 septembre 2016

Vous le savez, je ne taris pas d’éloges à l’égard d’OpenRefine.  Le malheur, c’est que ce puissant outil a peu de vitrine dans la communauté informatique francophone.  Qu’à celà ne tienne, voici une courte liste de ressources en français sur le web!



How do you parse fractions with GREL and OpenRefine?

13 août 2016

I just came across a question related to parsing fractions on StackOverflow and I thought about how easy that kind of stuff could be done Smalltalk (because we have real fractions!).  And then, I wondered how hard that challenge would be with something as limited as GREL (the language used in OpenRefine, namely Google Refine Expression Language).  So I tried to do it in GREL!

Here’s the challenge : how do you go from a textual representation of a fraction like -10 2/3 to the number -10.66666 with GREL?

Here’s my test data:

fraction, target
10 2/3, 10.666666666666666
-10 2/3, -10.666666666666666
2/3, 0.666666666666666
-2/3, -0.666666666666666
+10 2/3, 10.666666666666666
+2/3, 0.666666666666666
2 2/3, 2.666666666666666
-2 2/3, -2.666666666666666
+2 2/3, 2.666666666666666
+2/3, 0.666666666666666
10 11/13, 10.8461538461538461
-10 11/13, -10.8461538461538461
11/13, 0.8461538461538461
-11/13, -0.8461538461538461
+10 11/13, 10.8461538461538461
+11/13, 0.8461538461538461
17/4, 4.25
+17/4, 4.25
-17/4, -4.25

The trick is to start small and focus on one case at a time! Just make sure your logic isn’t flawed first and then you can work on each specific case.  Here’s how I did it.

Start by making sure your code logic is right:

if(contains(value, « / »), if(contains(value,  » « ), « integer with fraction », « fraction »), « integer »)

Then, for each case, dig deeper and make sure you identify each case precisely. In the expression above, replace :

« integer with fraction » with:
if(startsWith(value, « -« ), « negative integer with fraction », « positive integer with fraction »)

« fraction » with:
if(startsWith(value, « -« ), « negative fraction », « positive fraction »)

« integer » with:
if(startsWith(value, « -« ), « negative integer », « positive integer »)

You should end up with something like this:

if(contains(value, « / »), if(contains(value,  » « ), if(startsWith(value, « -« ), « negative integer with fraction », « positive integer with fraction »), if(startsWith(value, « -« ), « negative fraction », « positive fraction »)), if(startsWith(value, « -« ), « negative integer », « positive integer »))

Now test!

fractions logic

Once every case is covered, now you’ll have to transform every of the 6 cases in code to extract and convert those numbers. It’s just a matter of splitting every part of the number with GREL’s usual friendly methods, namely indexOf, substring and startsWith.  And since the division operator (/) in GREL behaves like the floor function, you need to multiply your results (the strings that have been converted to numbers using toNumber) by 1.0 to force a conversion to a decimal number.

It’s really painful to code but, honestly, all you need is a bit of patience and make sure you deal with one part of the string at a time (and one case at a time)!

Wanna test your code?  Just select the « fraction » column and apply the code below in a transform expression : the target value and the transformed value should be the same in the preview pane!

strings to fractions

So here’s the GREL code to parse fractions (and convert them to decimal) :

if(contains(value, "/"), if(contains(value, " "), 
if(startsWith(value, "-"), -1.0*(toNumber(substring(value, 1, indexOf(value, " ")))+
(1.0*toNumber(substring(value, indexOf(value, " "), indexOf(value, "/")))/
toNumber(substring(value, indexOf(value, "/")+1)))), 
(toNumber(substring(value, 0,indexOf(value, " ")))+
(1.0*toNumber(substring(value, indexOf(value, " "), 
indexOf(value, "/")))/toNumber(substring(value, indexOf(value, "/")+1))))), 
if(startsWith(value, "-"), 1.0*toNumber(substring(value, 0, indexOf(value, "/")))/
toNumber(substring(value, indexOf(value, "/")+1)), 
1.0*toNumber(substring(value, 0, indexOf(value, "/")))/
toNumber(substring(value, indexOf(value, "/")+1)))), 
if(startsWith(value, "-"), toNumber(value), toNumber(value)))

NOTE: for some strange reason, WordPress keeps changing my double quotes 
with the French quotes.  Don't worry, it's all regular English quotes you 
should see!










12 août 2016

A long long while ago, I had a close look at GoogleRefine.  This tool’s sole purpose is to extract, clean, transform and reconcile data.  And  the more the messy is your data, the better you’ll like this tool!

At first glance GoogleRefine was very interesting but, at the time, the whole thing was more promising than useful.  But recently, while looking for GoogleRefine again (I just could not remember the name!), I found its successor: OpenRefine!

Since then, Java has matured, web services are more robust, the tool has progressed quite a lot and OpenRefine uses everything in it’s power to facilitate your job!  More ways to reconcile the data, many different ways to transform your data, more predefined functions and functionalities!

Custom transformations can be done in 3 ways with some easy coding : with GREL (Google Refine Expression Language), Jython (a Python implementation that runs on Java) or Clojure.  Many many many ways to reconcile the data are now available, more import formats (TSV, CSV, Excel, JSON, XML, etc), more ways to reconcile data from webservices and the list goes on.  I must say OpenRefine has lots to offer!

So instead of writing a novel about how cool this tool is, I’ll leave you with a list/compilation of videos, tutorials, documents and websites that demonstrate what OpenRefine do for you!

School of data

Enipedia Tutorial

Hope this help!

In the I-have-to-clean-up-this-mess department, DataCleaner is another useful tool.  But that’s going to be the topic of another post!