Big Data Terminology for Non-Geeks: NLP, NER and More

Big Data is a combination of well-known and less common technologies and is used mostly by larger organizations and innovative start-ups to solve problems or find insights.

Retail companies, for example, use big data to correlate large quantities of customer data, in order to optimize their stock or their delivery systems. In order to do this they must have systems in place that can gather various types of data and for the most part in real-time, such as video, audio, written text and speech.

Let’s take a look at some of the technologies.

Natural language processing (NLP)

After all the data sources and data is compiled, the task of retrieving structured data from unstructured data and especially text is necessary. This is where Natural Language Processing comes in.

Natural Language Processing gives machines the ability to read and understand as well as derive meaning from the languages humans speak, and it is part of Artificial Intelligence. Techniques used by large language models to understand and generate human language, including text classification and sentiment analysis. These methods often use a combination of machine learning algorithms, statistical models and linguistic rules.

NLP is at the heart of modern software that processes and understands human language, leveraging the vast amount of language data on the web and in social media. One of the most well known examples of NLP is IBM’s Watson, who was able to win the TV show Jeopardy by beating two of Jeopardy's greatest champions.

As BigData-Startups.com founder Mark van Rijmenam writes “the key stumbling block here is that computers understand "unambiguous and highly structured" programming language, while human language is a minefield of nuance, emotion, and implied intent.”

Geoffrey Pullum, a professor of general linguistics at the University of Edinburgh. Pullum outlines three prerequisites for computers to master human language: "First, enough syntax to uniquely identify the sentence; second, enough semantics to extract its literal meaning; and third, enough pragmatics to infer the intent behind the utterance, and thus discerning what should be done or assumed given that it was uttered."

Named entity recognition (NER)

Named entity recognition (NER) is an NLP task that consists in tagging groups of words that correspond to "predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.”

NER finds mentions of specified things in text and this aids garnering insights from the structured data.

Natural Language Generation (NLG)

A recent Forbes article, Why Big Data Needs Natural Language Generation to Work, explains this well:

“But the bigger game of NLG is not about the language but about handling the growing number of insights that are being produced by big data through automated forms of analysis. If your idea of big data is that you have a data scientist doing some sort of analysis and then presenting it through a dashboard, you are thinking far too small. The fact of the matter is that big data really can’t be understood without machine learning and advanced statistical algorithms. While it takes skill and expertise to apply these methods, once you have them running, they continue to pump out the insights.”

“The data, once extracted, would then be sent to the semantic engine which would first determine what was true and then determine which of those signals are important and impactful to various audiences.”

"What is true is determined through the application of techniques that would be familiar to any data scientist: time series and regression analysis, histogramming, ranking, etc. The semantic engine then decides what’s important based on an understanding of what’s normal for the whole population of the data.”

“The second type of analysis the semantic engine does is to determine what is interesting or impactful to a particular audience. A retail representative at a bank may be interested in a whole different set of signals than someone who is originating mortgages.”

You need a systematic approach to a semantic model. That’s the secret to having a big impact from big data.

SophoTree takes Big Data and uses Natural language processing (NLP) and algorithms like Named Entity Recognition (NER) on its platform. Try it out and see the difference in your search results.

Bias

A type of error that can occur in a large language model if its output is skewed by the model’s training data. For example, a model may associate specific traits or professions with a certain race or gender, leading to inaccurate predictions and offensive responses.

Emergent behavior

Unexpected or unintended abilities in a large language model, enabled by the model’s learning patterns and rules from its training data. For example, models that are trained on programming and coding sites can write new code. Other examples include creative abilities like composing poetry, music and fictional stories.

Generative A.I.

Technology that creates content — including text, images, video and computer code — by identifying patterns in large quantities of training data, and then creating original material that has similar characteristics. Examples include ChatGPT for text and DALL-E and Midjourney for images.

Hallucination

A well-known phenomenon in large language models, in which the system provides an answer that is factually incorrect, irrelevant or nonsensical, because of limitations in its training data and architecture.

Large language model

A type of neural network that learns skills — including generating prose, conducting conversations and writing computer code — by analyzing vast amounts of text from across the internet. The basic function is to predict the next word in a sequence, but these models have surprised experts by learning new abilities.

Neural network

A mathematical system, modeled on the human brain, that learns skills by finding statistical patterns in data. It consists of layers of artificial neurons: The first layer receives the input data, and the last layer outputs the results. Even the experts who create neural networks don’t always understand what happens in between.

Parameters

Numerical values that define a large language model’s structure and behavior, like clues that help it guess what words come next. Systems like GPT-4 are thought to have hundreds of billions of parameters.


Big Data Terminology for Non-Geeks: NLP, NER and More
SophoTree Inc, Alexander D. Kostopoulos (ST) September 15, 2023
Share this post
Archive
SophoTree: Emotion Analysis, Language Specific Analytics and Interlinked Data