CHINA IWOM Blog

Text Mining at CIC

2007.04.17 :: by: Sam


I am posting Paul’s own translation of his post on text mining, demonstrating the “tech” side of CIC.

__________________________________________________________________

At CIC, the core of our business is our data mining technology. Data mining has gained some awareness with the general public with an urban legend referring to “Beer and Diapers” (see here and here).

CIC uses a particular data mining technology – text mining – to discover the hidden and untold stories, like that of “Beers and Diapers,” within millions of online user generated messages. In contrast to data mining, text mining delves into unstructured data without columns, fields, or even a date. This unstructured data is then retrieved using a process called Information Retrieval, which involves either a Vector Space Model or Natural Language Processing. Afterwards, we can put structured records into the database and work with the data as needed.

We definitely want our results to be accurate. We have two measurements for accuracy: 1) recall: the percentage of messages (out of all those related to product A) we retrieved that were related to product A; 2) precision: within the articles we believe to be related to product A, the percentage of those that are really related to product A.

If processing a handful of messages with an acceptable accuracy is not hard enough, then consider the work to be done with millions of messages on a daily basis. This poses a challenge to the entire text mining system as a whole: the algorithm must be efficient, and the architecture must be stable and scalable.

The Chinese language itself poses another challenge. Unlike western alphabet-based languages, the Chinese language does not have words, but pictographs (characters), moreover there are no boundaries between them. Imagine processing this string: “thisistobesegmentedfirstandthenwecandosomethingaboutit.” The available text mining technology used for English can’t be directly applied. We have to first sort out and segment apart the words. Without this procedure, meanings are easily lost. For example, the phrase “我可乐坏了” (I’m so happy) will show up as a result in response to the query “可乐” (Coke) simply because they share the same string of characters “可乐,” which is intended to mean Coke but within the first sentence happens to be the combination of two different words. Here “可” means rather and “乐” happy (yes, individual Chinese characters often have their own definitions).

What makes the situation even more difficult is the use of slang by Chinese netizens. We need to be able to differentiate when “粉丝” means not vermicelli but fans, when “玉米” means not corn but fanatic fans of a celebrity. Depending on the context, the term “KK” can refer to the Chevrolet Spark, Toyota Camry, or Citroen Fukang (a model made by Citroen China). We also need to know that小黑 (the little black) is referring to the IBM ThinkPad.

When we are searching “laptop,” we also need to type in “notebook,” or to search “benben” (nickname of laptop given by Chinese net users) without mentioning “laptop.” This is beyond the keyword-matching approach employed by the majority of current search engines and has thus achieved a certain level of Semantic-based searching.

The tech team at CIC welcomes these challenges. The tech team works closely with CIC’s social media analysts across industries, including consumer electronics, sports, beverage, online gaming and automobiles. By mining and analyzing millions of messages, little clues are pieced together here and there. Some interesting stories usually materialize. We might find connections between entities that nobody ever thought of or detect a potential brand crisis.

For people who love to learn about fascinating new technological developments, it’s even harder for them to be disappointed. By conducting research and development in this field, we are constantly exposed to the knowledge of various disciplines including computer science, linguistics, statistics, social science and even mass communication. We study the information entropy of the language, calculating the similarity of texts within a multidimensional space, segmenting Chinese characters using dynamic programming, finding out the correlation of words within the framework of probability theory, working out the eigenvector of the text vector matrix, implementing large scale computing and storage with a distributed system. Sound like rocket science? Not exactly, but there is a great deal of resemblance between Quantum mechanics and Text Mining. The CIC tech team enjoys their work greatly, especially overcoming the variety of challenges. Their close work and communication with the analysts leads to meaningful and practical results for CIC’s clients.


3 comments

  1. Cristian 说:
    2008.04.01

    Good luck with data mining and Chinese: how can you translate because that process requires years of language study and many others and years of study relates to persons and not computers. Computers can’t tell the difference between orange, less orange, a little bit less orange etc. Hard job to do but my best wishes to you and your crew.

  2. Sam Flemming 说:
    2008.04.02

    Thanks for the comment. We actually don’t translate as part of the text mining. If client requires an English report, we will translate the results of the text mining as part of the report.

    Regarding the subtle differences and nuances within language, agreed it is not easy to distinguish. In the end, our reports are put together via a combination of human and automated analysis, and our analysts will make the final say in “making sense of the buzz.”

  3. Dubai Web Design, Development 说:
    2008.08.12

    Good data Mining practice, but still problem exists. what we do if client need a report in English. It’s mean we need to hire some one who know chinese language and trnaslate us into english. If this data mining automatically translate the data into English that can give edge and high priority.

Add Comment