Corpus

Blog/News30/09/2022

We will work with corpus data in 36 different languages that is used to develop chatbots and machine translation. The key to uncovering crucial information from unstructured data is what makes Uptempo cognitive data annotation and labeling services so valuable to enterprises.

Uptempo Data team systematically operates and builds text data of more than 10 million sentences every year. With a pool of professional translators from over 50 countries, over 30 language pairs worldwide, and “crowdworking”, we can satisfactorily solve large and special language corpus data projects that are difficult for other companies to carry out.

Our process

Step 1 – File design:

We detect erroneous sentences and nonproductive sentences by reviewing the entire work file.

Step 2 – File assignment:

The difficulty/field is subdivided, and the appropriate professionals are considered and assigned.

Step 3 – Live monitoring:

By working in the Cloud, crowd workers view the work status in real-time.

Step 4 – AI Translator Contrast:

If the match rate with other machine translators is high, we primarily go through the work again.

Step 5 – Quality evaluation:

We go through an objective quality evaluation and files with low scores undergo secondary work.

Step 6 – File collection/delivery:

When final files are assembled, the last review is carried out and optimal finished files are delivered to the client.

Quality control for building corpus data

From preparing text data to final data building and utilization, Uptempo Data team guarantees a ‘high level of quality for ‘a huge amount’ of data.

Step 1: Check Domain

Check whether the domains such as legal/medical/game/ IT match

Step 2: Check sentence length

Sentence length analysis between the source language and the target language, retranslation if the sentence length between the two highly differs

Step 3: Deduplication

Remove sentences that match perfectly

Step 4: Machine translation similarity analysis

Machine translator similarity analysis using edit Distance (retranslation for strings with high similarity)

Step 5: Semantic conformity verification

Semantic conformity quality evaluation through a third-party expert.

Step 6: AI Modeling Validation

Data validation utilizing AI Solutions

Step 7: Delivery

Delivery in the form of files requested by the client such as CSV, JSON, etc.

Corpus

Our process

Quality control for building corpus data

Share It On

Let's talk about your idea!