Our Maiky AI is based on a symbiotic collaboration between several advanced AI algorithms.
1. OCR – Optical Character Recognition, is a transformation in which all characters from the image are recognized as such from an image of a text by means of pattern recognition and stored separately by a computer (program). In other words, the text from an image is converted into editable text. An example of this is automatic vehicle number plate recognition.
2. SA – Smart Annotator, is the algorithm which comes with Matcher tool that can be used to specify custom rules for phrase matching. The process to use the Matcher tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match or how we call them (keywords). Next, you have to add the patterns to the Matcher tool and finally, you have to apply the Matcher tool to the document that you want to match your rules with. Finally after the text is being matched we begin extraction and annotation of the identified phrases that are described/explained by our keywords.
3. NER – Named Entity Recognition, is the most important, or I would say, the starting step in Information Retrieval. Information Retrieval is the technique to extract important and useful information from unstructured raw text documents. Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as policies, regulations, laws, person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc. Our Maiky AI comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens. The Maiky AI on Spacy NER system contains a word embedding strategy using sub word features and "Bloom" embed, and a deep convolution neural network with residual connections. The system is designed to give a good balance of efficiency, accuracy and adaptability.
4. LDA – Latent Dirichlet Allocation, is one of the most popular topic modelling methods. LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of probabilities. Mainly each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.
All of this algorithms combined compose the Maiky AI which is able to understand contextual text and extract the necessary information namely regarding to compliance. Our preliminary results are already promising and we are already able to find the needle in the haystack.
Remark: Our research of Spacy as well as NLTK implementation of Stanford NLP concludes that both can be used for NER to achieve good results. Spacy has support for word vectors, so it's fast and accurate. It is recommended to use Spacy NER for production over Stanford NER. For customizing the process of NER, both models can be used. This requires data labelling and annotation which means giving tags to entities.