Tag Archive : programming

/ programming

Viral document prediction using Java machine learning

This blog post is about the two machine-learning frameworks Smile and DeepLearning4j. I am presenting my feedback and some hints useful for any Java developers interested into machine learning.

Introduction and disclaimer

This blog post is based on my own experience with Machine learning frameworks and java coding. I am not the most efficient or skilled data scientist or expert in the ML domain though I have been using it for some research use-cases in the past and for a real in-production service this year.

I have been using machine-learning (and deep-learning) for the following use-cases (classification problems):

  • How to identify code and bugs based on a certain type of code ie initialization code, exception code, IO code etc
  • How to identify a security leak into a code ? the sink, the source, the incriminated leak ? especially faults by injection.

NLP Problem / how to identify newspaper titles or articles that may become viral ?

Experiments

I began my experiments with NLP with the following frameworks and they were great :

I quickly obtained some results while I was extracting, refining and evaluating my dataset.

However; for the next step; I had to increase my dataset size (magnitude 10x-30x to reach more than 30000 documents). A bit small but I already was meeting the limits of Weka and KNime on some algorithms.

Why not using XYZ or XYY ?

To implement my latest use-case (that is running in production since one month), I had to identify a technological solution to implement NLP classification.

This kind of problem is hard since it consumes a lot of resources (software; hardware) and may result with a low accuracy.

I had basically three offered machine learning solutions :

  • use a Saas service as Google Cloud Machine Learning or AMI from Amazon
  • use the de facto leaders as Keras or Tensorflow or Pytorch
  • use another solution

Here are my reasons to go along with the choice of a Java Framework. Everything is debatable and would you disagree with me, write it into the comments.

How the heck do you go in production with python ?

A remaining issue to solve before switching to Python. I had basically three goals to achieve  :

  • identify a machine learning algorithm or technology to predict viral content. For this; fortunately I found several research papers.
  • Implement a prototype quickly (less than 2 months in fact) to assess if we can use the technology effectively
  • Go to production in 3 months. Some hard requirements as the prediction computation time.

Go to production with python frameworks definitely looks tough with my schedule time and my lack (for the moment) of high level python programming skills.

Why not a Saas service ?

I wanted it. And still want to use one. I wanted to use GCML. However our current infrastructure is all based on Amazon. And I gave more than a try on AMI and was dumbfounded of the obscure way to start with it. It has been a big fail, the UI was awkward and I understood Amazon was providing EC2 with all the preinstalled tools to perform deeplearning .. in python.

Basically, I wanted an API kind of service where I can do the learning and the prediction without worrying about the computation resources.

GCML (and probably Azure) are the best for these kind of service.

Why not Pytorch, Kerras or Tensorflow ?

Python frameworks are the obvious best way to implement machine learning.However to go live is another matter since my python skills were not good enough to stick with the schedule. I had to develop in parallel with the prediction system a whole integration with our current infrastructure to obtain the prediction, process it and modify the system behavior including REST API, security, monitoring and databse connection.

My decision to use Tensorflow faced the issue that I had to integrate the tool with MongoDB and build a REST API in a really short time as processing, extracting the dataset and build the vectors.

OK, you do not know python, what Java can do ?

Notice before going further : Java frameworks are not as good (accurate; performant) as the previously mentioned solutions and not as fast.

I did not know if Deep learning was the ideal solution for my problem therefore I need frameworks allowing me to compare machine learning classic algorithms (forest; bayesian) and deep learning.

I used Smile for its wide list of implemented machine learning algorithms and DeepLearning4j. for my neural network experiments.

Clearly expect some surprises with both of the frameworks but you can achieve some great results.

Machine learning with the Smile framework

You will find more information about this framework on https://github.com/haifengl/smile.

The framework is in license Apache 2.0 and therefore business friendly.

Several algorithms are provided into the framework, here is a small list for the classification tools :

 

The documentation is super great although the framework has been thought for the Scala developers, a Java developer may find enough help to build its tool.

Most of the time I spent, has been associated with the conversion of MongoDB entries (documents) into a valid dataset. Using AttributeSelection and building nominal attributes were not so easy and the implementation is quite memory consuming.

I also had some difficulties between the algorithms I may choose and the use of sparse arrays and optimized data structures. Some algorithms were failing at runtime; blaming me to use unsupported features. I lost clearly some time with such limitations.

Deep learning with the DeepLearning4J framework

 



        cnnComputationGraph = new ComputationGraph(config);
        cnnComputationGraph.setListeners(
                new ScoreIterationListener(100),
                new EvaluativeListener(trainIterator, 1, InvocationType.EPOCH_END),
                new PerformanceListener(1, true));
        cnnComputationGraph.init();


        log.info("[CNN] Training workspace config: {}", cnnComputationGraph.getConfiguration().getTrainingWorkspaceMode());
        log.info("[CNN] Inference workspace config: {}", cnnComputationGraph.getConfiguration().getInferenceWorkspaceMode());

        log.info("[CNN] Training launch...");
        cnnComputationGraph.fit(trainIterator, nEpochs);


        log.info("[CNN] Number of parameters by layer:");
        for (final Layer l : cnnComputationGraph.getLayers()) {
            log.info("[CNN] \t{}\t{}\t{}", l.conf().getLayer().getLayerName(), l.type().name(), l.numParams());
        }

       log.info("[CNN] Number of parameters for the graph numParams={}, summary={}", cnnComputationGraph.numParams(), cnnComputationGraph.summary());

I lead experiments with a CNN network to identify text patterns that may indicate a viral title into a document.

To build this CNN network, I also wanted to use Word2Vec to have a more flexible prediction based not only on the lexical similarity but also along semantical axis.

DeepLearning4j is not a mature framework despite the interesting features it provides. The code is in Mai 2019, still in beta. Moreover the Maven dependencies are providing a whole big set of transitive dependencies that may break your program.

The maven releases of DeepLearning4j are not frequently released (you may wait some months) and during that time, many bug fixes are done on the master branch without having the benefit to use it. The snapshot version is not available and building the project is a pain.

If I did not frighten you, the documentation is quite inegal for example the memory management (workspaces) is quite a mistery. Some exception and errors during the build of your network are genuinely disturbing especially when I tried to build a RNN network based on this example.

However with some patience, I have been able to use the CNN network and do some nice predictions.

My main current issues with the framework are a really high memory consumption (20go) and slow performances because I still does not have access to an EC2 with GPU.

However I have been able to build a prediction system based on a CNN network and NLP using DeepLearning4j and it is for the moment a success. Clearly I am planning to replace DeepLearning4j with a python equivalent but now I have some months to develop it in parallel.

Viral document prediction using Java machine learning
Viral document prediction using Java machine learning

And you, what would be your choice in such situation ?

Book : Java by Comparison, Simon Harrer

Book Review : Java by Comparison

This is a book review of Java by comparison (you may check and buy it there https://java.by-comparison.com), an ultimate compilation of exercices and good tips for whom wants to improve their coding skills. How many books about Code quality have you read ? Books having a real impact on your code code quality, your coder philosophy ?

(more…)

Example of Slideshow preview

I have been recently using several JS frameworks to produce slideshows and ultimately programmatic videos. Here is my feedback about two frameworks Reveal.js and Eagle.js.

(more…)

Vue.js code example with extends

This article will illustrate how to extends some parts or a whole Vue.js component. We’ll look at two different practices : mixins and extends.

(more…)

Cobol Custom Rule

In this article, I present how to write custom Cobol rules with SonarQube and some caveats I encountered. The targeted audience should have some basic compiler knowledge (AST, Lexical analysis, Syntaxic analysis).

(more…)

A short article to announce that Pivotal has announced a bunch of security fixes on its products.

(more…)

About me

18th March 2018 | | No Comments

You are on this site because we are sharing the same passion : Software

My name is Sylvain Leroy and I am software programmern, startup funder. My speciality is to save software from all kind of sickness.

about me
About Me in Switzerland 🙂

What is defining me the most precisely is my passion for Software and coding.

As explained on my company site (byoskill.com), I have three passions :

  • Software craftmanship (SQA)
  • Legacy software migration
  • and startup environments

I have been doing that from so long

I discovered coding something around 10, on our family Commodore 64/128.

Commode 64/128
Commode 64/128

Back in these times, I didn’t know about coding, I learnt how to use it to get what I wanted. Games, interests. I was curious to manipulate and understand this strange creature.

Basic program
Basic program

Of course I had normal activities, friends, sports. However this machine fascinated me. We had two big books, in English, a foreign language to me, full of code listing. I spent numerous to painstakingly type them on this computer. Some code were games, some revealed to be funny noise / sound effects, plane engine.

I had the chance to be from the generation who grew up with computers and incredible progresses. We switched one day to x86, a 286 with single coloured screen. I remember the screen was pale glowing in my room at night. I made obvious progresses in Basic, QBasic, Visual Basic (my college passion),  switched to Pascal(Delphi) at 14 during the middle school. At 16, I was efficient in Delphi and I tried the C language without enthusiasm.

I discovered at the same time ASM/Z80 programming to use on my Texas calculator and I switched from Pascal to ASM, with all the set, TASM, TLINK.

The book “the Art of Assembly Language” had a huge impact on me. I printed it with our good old printer in 4 big binders. And started to love it.

I continued until 18 my experience of assembly programming on two fields :

I finally has switched to C/C++ late 19 using Visual C++. I have been slowly mastering it. I was always tempted to switch to ASM using asm statements. To program with limited resources is so much funnier than with high level languages.

A lucky meeting changed my life and course

I have been following a Computer science diploma in two years at the University of Rennes 1 (Lannion). And then I discovered I could not work in industrial automation because a wrong choice of course. I switched to a general computer sciences Licence (3rd year).

During my master degree, I choose as exam project, a technical project in which we were supposed to write a Java syntactic analysis tool (linter). This meeting with this professor, Francois Bodin, had a influence of my final year of study, and at minimum the next ten years of my life.

Together, under his tutelage, we imagined a research project, a project of company creation, and we launched it. It was Tocea, which lived from 2009 to 2015 before being absorbed by a Software Editor, Metrixware. In my mind; I am and will be eternally grateful, for the opportunity – the seed – Francois offered to me. It has been an  incredible adventure. This environment was totally new for me, my family, my surrounding, given our social origins.

Serenitec : research project
Serenitec : research project

Tocea : my passion, and my initiatory route

Offically Tocea has been created in March 2010 after 3 years of research project and one year of incubation.

Research project Serenitec
Research project Serenitec

We were three at the beginning, and the project was called Navis.

Against, there, Marie-Anne  and Florent, have been of a great help and influenced positively the view of what could be Tocea, both socially and professionally.

Tocea / modele_carte_recto2
Tocea / modele_carte_recto2

A french article here of this period.

Francois Morin, is also important to me, since we brought Tocea to its maturity together. Co-founding a company is never simple. We had to learn from each other, to be able to work together. The stability and the trust of our relationship has been like the warm fireplace that attracts the frozen voyagers. And we simply attract the best to reach together our ambitions as a small software editor we were.

Our company had his life , successes and failures , joy and pain but I remember it as a wonderful social experience. I have seen students coming for their first experience, getting their first job, growing up and becoming our real assets. Tocea has been a success (humanely) thanks our people. They gave us our trust, and we tried together to create something great.

Links :

The transition

Tocea ended peacefully to become a more serious business under the acquisition by Metrixware. Fair enough, Metrixware, a well-known software editor specialized in Legacy system and migrations, saved us at that time, our business was in a full transition after some critical mistakes and a hard business year (for the whole sector).

I learnt much from being there about processes, change management and also company culture. Three things crucial for the success of any projects.

Now in Switzerland

I am enjoying my new road, currently in Switzerland.  I have been discovering this great country and particular job environment since the begin of 2017.

I have been since working full time as IT Consultant for large institutions. Recently, I created a small structure www.byoskill.com in which I am providing my experience for dedicated missions.

Each encounter, with either a Software Developer either a team, is pushing me forward to what I love the most :

Empowering people, saving Software and developing great tools.

 

No tags for this post.

Leave your comfort zone

To be or not to be (happy), that’s the question. In this article, I expose some thoughts about what could make a software developer happy in his work. I wrote this article with several targeted audience in mind : Junior developers, Senior Techleads and H&R resources.

(more…)

This article is showing you how to use SonarQube with ReactJS and its JSX files. I will use both SonarQube JavaScript plugin and the additional plugin Sonar EsLint plugin.

(more…)

My weekly DZone”s digest #1

23rd December 2017 | Digest, DZone, State of the art | No Comments

This is my first post that offers a digest from a selection of DZone’s articles. I will pick DZone’s article based on my interests.

This week the subjects are : BDD Testing, Bad code, Database Connection Pooling, Kotlin, Enterprise Architecture

A few benefits you get by doing BDD

A few benefits you get by doing BDD : This article is an introduction to the Behaviour Driven Development practice. It’s interesting because we are regularly meeting teams, developers, architectures (pick your favorite one) that are confusing technical details and functionalities. As a result, the design, the tests and the architecture hides the user behaviour (the use cases ?) under a pile of technical stones. This article is a nice introduction. I recommend to go further these articles : * Your boss won’t appreciate tdd, try BDD * BDD Programming Frameworks * Java Framework JBehave.

Gumption Traps: Bad Code

Bad code, how my code...

Bad code, how my code…

Gumption Traps: Bad Code : an article about the bad code and how to deal with it.

{% blockquote Grzegorz Ziemoński%} The first step to avoid the bad code trap is to stop producing such code yourself. When faced with existing bad code,one must work smart to maintain motivation. {% endblockquote %}

This is a good introduction sentence. This week, I had a meeting with a skilled and amazing team. The meeting’s goal was to find a way to find the technical debt. The very technical debt that is ruining the application and undermining the team’s motivation. What I found interesting and refreshing in this article, is the pragmatic tone and the advice.

{% blockquote Grzegorz Ziemoński%} To avoid bad code, try to minimize the amount of newly produced bad code. {% endblockquote %}

How to avoid the depress linked to the bad code ? First of all, I want to say that developers are not receiving enough training on how to improve the code. Usually university / college courses are dedicated about How to use a framework. Therefore, few of them are able to qualify what is a bad code, what are its characteristics and de facto the ways to improve it. To avoid bad code, I try to demonstrate the personal benefits for the developers to improve their skills. Quality is not only a question of money (how much the customer is paying) but rather how much your company is paying attention to your training and personal development.

A lot of developers are overwhelmed under the technical debts without the appropriate tools (mind, technics, theory) to handle it. I try to give them gumptions about the benefits to be a better developer and how to handle the weakness of a sick application. To save a software rather than practicing euthanasia 🙂

Database Connection Pooling in Java With HikariCP

When we are discussing about Database connection pooling, most of my colleagues are relying on the good old Tomcat dbcp. However there is a niche, really funny and interesting, the guys that a competing for the best DBCP. And HikariCP is clearly a step ahead of everyone.

The article Database Connection Pooling in Java With HikariCP is presenting how to use a custom DBCP in your software.

Hikari Performance

Hikari Performance

I think it would have been great to present the differences with the standard DBCP and further debate on the advantages/disadvantages of the solutions. A good idea for a newt article 🙂

Concurrency: Java Futures and Kotlin Coroutines

Java Futures and Kotlin Coroutines An interesting article about how Java Futures and Kotlin co-routines can coexists. Honestly I am a little bit disappointed and thought that Kotlin would make things easier like in Node.JS

Are Code Rules Meant to Be Broken?

Another article about Code Quality and we could be dubious whether exists an answer to that question : Are Code Rules Meant to Be Broken.

I won’t enter too much in the details, the author’s point of view seems to be Code Rules are good if they are respected. If they are broken, it implies that the Code rules need to evolve 🙂 What do you think about it ?

Java vs. Kotlin: First Impressions Using Kotlin for a Commercial Android Project

This article is interesting since it presents a feedback session on using Kotlin in a Android project.

The following big PLUS to use Kotlin are :

  • Null safety through nullable and non-nullable types, safe calls, and safe casts.
  • Extension functions.
  • Higher-order functions / lambda expressions.
  • Data classes.
  • Immutability.
  • Coroutines (added on Kotlin 1.1).
  • Type aliases (added on Kotlin 1.1).

  Quality Code Is Loosely Coupled

Quality Code Is Loosely Coupled

This article is explaining one of the most dangerous side of coding : Coupling. Must to read article despite the lack of schemas.

Five Habits That Help Code Quality

This article is a great introduction on code assessment. These five habits are indeed things to track in your software code as a sign of decay and code sickness.

The habits are : – Write (Useful) Unit Tests – Keep Coupling to a Minimum – Be Mindful of the Principle of Least Astonishment – Minimize Cyclomatic Complexity – Get Names Right

10 Good Excuses for Not Reusing Enterprise Code

This article is really useful in the context of Digital Transformation to assess which softwares you should keep and throw.

Example of excuses : – I didn’t know that code existed. – I don’t know what that code does. – I don’t know how to use that code. – That code is not packaged in a reusable manner.

Test proven design

An interesting article and example on how to improve your own code using different skills. I really recommend to read this article and the next future ones : Test proven design.

Optimization WordPress Plugins & Solutions by W3 EDGE