Whether we like it or not, artificial intelligence is infusing all economic and soon social strata. Here are some basic elements on the subject, the first article of a series that will hopefully allow us to distinguish the wheat from the chaff in this area.
When it is not the non-consensual support of deceptive marketing,artificial intelligence (AI) generally refers to machine learning algorithms of which artificial neural networks are the figureheads. At the heart of AI, they constitute a field of study called deep learning.
Un réseau de neurones repose sur une myriade de paramètres qui lui permettent à partir d’une <red-highlight>Entrée<red-highlight>, de restituer une <red-highlight>Sortie<red-highlight>.
<quote-alt>M𝜽(Entrée) = Sortie<quote-alt>
Le réseau est notre modèle <red-highlight>M<red-highlight>. Pour identifier les paramètres <red-highlight>θ<red-highlight>, on a besoin de données. Puis, on entraîne le modèle <red-highlight>M<red-highlight> avec les paramètres <red-highlight>θ<red-highlight> fixés. Enfin on peut l’utiliser sur n’importe quelle entrée. Pour ces deux dernières tâches, on a besoin de puissance de calcul.
Computing power means energy consumption. For Meta's latest :
This means that developing these models would have cost around 2,638 MWh under our assumptions, and a total emission of 1,015 tCO2eq.
So reusing existing models rather than developing and pre-training one's own models - when feasible - is a digital sobriety issue.
La forme que prend une <red-highlight>Entrée<red-highlight> est aussi variée que celle d’une <red-highlight>Sortie<red-highlight> mais quelques usages sont particulièrement parlants.
Image & probability - Classification
From an image, give the percentage of certainty that it contains a type of object.
The eternal CAPTCHA materialize a resurgence of the training task of the model, an essential task that consists in giving the model an exercise and its answer key. The red lights and crosswalks on which millions of Internet users click every day are used to find out whether or not you are a robot, but also to feed computer vision models, such as those used in autonomous vehicles.
After all, a video is only a succession of images, so it's only a step to uses related to video or biometric surveillance.
Text, always more text - Text-generation
This video of Mister Phi is very useful to understand what is behind the text generation algorithms. They are neither more nor less than pharaonic machines to predict the most probable word that follows.
Draw me an algo - Text-to-image
Reinforced deference to Paul ValéryThe way these algorithms are constructed means that the same input will not produce the same output. This only confirms what the poet said.
There are many tools in this field. Stable Diffusion, DALL E and Midjourney have been in the news a lot lately, and they are all related, as shown in this state of the art of text-to-image.
Midjourney has the wind in its sails and has the particularity of being able to consume images in the prompt.
When we say that the same causes produce the same effects, we are not saying anything. Because the same things never happen again - and besides, we can never know all the causes. Paul Valery
Father Castor, tell us a story - Text-to-speech
Before Louis Braille's efforts become obsolete, mankind will probably explore many other paths and voices with neural networks trained to read text. Players in the field are actively developing the synthetic voices of tomorrow, for example PaddleSpeech from PaddlePaddle or TensorFlowTTS based on TensorFlow 2. Some platforms like Uberduck make it easy and possible to use these models.
Artificial Paradise - Text-to-video
Will text-to-video be the ultimate outcome of image generation algorithms? Meta and others have been working on this complex, seductive and slippery subject. The current results are sometimes straight out of the depths of the Internet cringe, but there is no doubt that the meteoric evolution of these networks will exceed predictions.
Natural Language Processing (NLP)
Innovated in 2017 by Google, Transformer is quickly becoming the state of the art in NLP. It is the origin of several models/applications in this field. For the past few months, BigTechs have been presenting their Large Language Models (LLMs). These models are trained on large data sets, which contain hundreds of millions to billions of words. LLMs, as they are called, rely on complex algorithms, including transformer architectures that move through large datasets and recognize word-level patterns. This data helps the model better understand natural language and its use in context, and then make predictions related to text generation, text classification, etc.
Microsoft and OpenAI: GPT, GPT-2, GPT-3, GPT-4, ChatGPT, Codex
Google : LaMBDA, BERT, Bard, Vision Transformer (ViT)
Meta : LLaMA
Deepmind: Gopher, Chinchilla
HuggingFace's BigScience team + Microsoft + Nvidia: BLOOM
When we talk about Transformer-based architecture, we mean algorithms that are based on an attention mechanism. This is a mechanism analogous to "human" attention, which consists in increasing the importance of certain incoming data and decreasing it for others.
Computer Vision
Vision Language Models: OpenAI-CLIP, Google ALIGN, OpenAI-DALL-E, Stable Diffusion, Vokenization
Object Detection: YOLO, SSD, R-CNN, FAST R-CNN
Past the uplifting puns about ChatGPT, the hype of a chair shaped like a lawyer - or the other way around - and the excitement of the first prompts, what are we left with? And what is AI doing at Docloop?
OCR (Optical character recognition) consists in transforming an image made of pixels into a text that can be interpreted by a computer program. It is a mature technology based on two AI models that have been mastered for a long time: text detection(e.g. with DBNet, LinkNet for docTR) and text recognition (Text recognition e.g. with ViTSTR, MASTER, CRNN for docTR)
Nanonets, GCP OCR, ABBYY, Tesseract, PaddleOCR, Azure, doctr or easyOCR are all OCR engines that can retrieve all the strings of a document with their position in this document.
If the starting point is a PDF file, there are two cases:
- PDF contains images, a scanned version of a paper document for example. For simple processing, tools like OCRmyPDF use the Tesseract OCR engine and superimpose the image of the scanned PDF file and the "selectable" text generated by the OCR
- The PDF contains text, a locked version of a word document for example. This is called text-based PDF file: In this case, the strings are already explicitly available in the file and tools like camelot-py allow to get the values.
Once the OCR is done, we find ourselves with thousands of words in bulk. It will be the role of the Extraction stage to structure this information and give it meaning.
Schematically, this is what a human naturally does when he sees the header of a document that contains INVOICE: we put this document in the "INVOICE" category in our mind, with all the implications that this entails: the document contains detailed amounts and a total amount, there is probably a VAT number and a SIRET somewhere, the due date is a function of the date of issuance of the invoice...
Classification can be an end in itself as part of a company's Electronic Document Management (EDM) to archive the documentation received on a daily basis. When receiving a 50-page document bundle, it can be useful to immediately separate packing lists, purchase orders, invoices and customs declarations.
Classification can also be the starting point for an intelligent document processing process. From the identified pages, an OCR step is applied and precise information is extracted from the document in order to use it later, at random and for example to reconcile invoices...
The OCR step allows us to obtain a list of words and their position in the document. The art of information extraction algorithms (or Document understanding) is to be able to categorize certain strings of characters.
The complexity varies from simple to complex:
- Simple field with a label: You can use the words around the value to determine what it means.
- Simple field without label: One can only rely on rules external to the document to determine what the values mean. One needs to assimilate business rules to arrive at e.g. "A dangerous class for a dangerous package is an integer between 1 and 9" or "A UNO code is composed of 4 digits".
- Structured table: It doesn't exist in nature.
- Unstructured table : This is the object of the current arms race in the field of information retrieval. The document section below could be represented as a row in a table where each field is a column, but the layout is not explicit on this point, and we even see that the information in one box spills over into another box.
Beyond the mess of acronyms and a mishmash of pseudo-technical terms, we finally come back to eminently classical considerations and trivial conclusions, not to say Rabelaisian ones: "Science without conscience is but the ruin of the soul".
At Docloop, we want to make good use of these developing technologies to provide the right model in the right place.
What we want to do is to extract information that has been inaccessible until now and apply all the rules that allow us to use it with the right software or the right person.