{"id":7186,"date":"2023-10-10T14:16:16","date_gmt":"2023-10-10T13:16:16","guid":{"rendered":"https:\/\/seecheck.org\/?p=7186"},"modified":"2024-05-03T12:35:01","modified_gmt":"2024-05-03T11:35:01","slug":"black-boxes-and-palindromes-chatgpt-in-the-serbian-language-class","status":"publish","type":"post","link":"https:\/\/seecheck.org\/index.php\/2023\/10\/10\/black-boxes-and-palindromes-chatgpt-in-the-serbian-language-class\/","title":{"rendered":"Black boxes and palindromes: ChatGPT in the Serbian language class"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote q-bc-cred is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em><a href=\"https:\/\/fakenews.rs\/2023\/09\/25\/crne-kutije-i-palindromi-chatgpt-na-casu-srpskog-jezika\/\" title=\"\">Original article<\/a>&nbsp;(in Serbian) was published on 25\/09\/2023; Author: Marija Zemunovi\u0107<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">A reader asked us to check the statements made by marketing expert Petar Vasic about ChatGPT in a short <a href=\"https:\/\/www.facebook.com\/reel\/1051217809178089\">video<\/a> posted on Instagram. The video was taken from Milan Strongman&#8217;s <a href=\"https:\/\/youtu.be\/dRTkFmWfzn4?si=2t5GJKLeT7NExQE-&amp;t=1196\">podcast<\/a>, where Vasic, as a guest, said the following, among other things:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>\u201cWhen OpenAI was training ChatGPT, it was actually all in English. They do not know, as they have publicly admitted, how it learned Serbian and other languages. It did it by itself. So it looks like it has loosened up a bit. Who knows what is really happening. That Black Box that they call, the neuro network. In fact, we don&#8217;t know what&#8217;s going on inside. We have input that we give, it goes through the Black Box. And then the output. But we don&#8217;t know what it developed there, and we see that it wants to learn some forbidden things. Especially things related to chemistry. Since the fear is for bombs, for things like that. We see that it already knows chemistry, probably at the level of someone who has finished college. Which is very dangerous\u201d.<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.sciencefocus.com\/future-technology\/gpt-3\">ChatGPT<\/a> is a chatbot based on artificial intelligence. It was trained based on a large and varied corpus of texts from the Internet (hundreds of billions of units) and thus \u201clearned\u201d to statistically predict the next word in a sentence. In this way, it more or less successfully responds to user inquiries. We checked the claims made by Vasic about this chatbot, and in the rest of the text, we will try to briefly present what is true and what is not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>ChatGPT and English language<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Vasic&#8217;s statement regarding \u201cwhen OpenAI trained ChatGPT, everything was actually in English\u201d can be assessed as incorrect. Although it is not known exactly which units were used to train this system, since there are too many of them, the framework <a href=\"https:\/\/gregoreite.com\/drilling-down-details-on-the-ai-training-datasets\/\">corps<\/a> are known. The largest among them &#8211; Common Crawl &#8211; provided 60% of the \u201ctokens\u201d, while, for the sake of comparison, the contents pulled from Wikipedia make up a 12 times smaller part of the corpus. <a href=\"https:\/\/commoncrawl.org\/\">Common Crawl<\/a> has been collecting material from the Internet for 16 years and is an open repository with 240 billion pages.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"533\" src=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1-1024x533.png\" alt=\"\" class=\"wp-image-7188\" srcset=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1-1024x533.png 1024w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1-300x156.png 300w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1-768x400.png 768w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1-1200x624.png 1200w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-1.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Common Crawl stores texts in different languages, which we could divide into <a href=\"https:\/\/arxiv.org\/pdf\/2304.05613.pdf\">several categories<\/a> according to the level of participation: languages with high, medium, low and extremely low participation. In this regard, English is a \u201ccategory by itself\u201d, because it occupies 45-46% of the space. In this sense, we can really say that ChatGPT is biased towards English (like the Internet as a whole). However, <a href=\"https:\/\/commoncrawl.github.io\/cc-crawl-statistics\/plots\/languages\">other languages<\/a> are also included in the training material, primarily German, Russian, Chinese, Japanese, French and Spanish &#8211; each of them has more than 4% participation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Serbian is at the bottom of the language category of medium participation (between 0.1% and 1%), after Turkish, Swedish, Arabic, Persian, Korean, Greek, Hungarian and Bulgarian, and ahead of Hindi, Lithuanian and Slovenian. For example, Albanian, Malay, Tamil and Georgian belong to the lowest category, and Scottish Gaelic, Tibetan, Yiddish and Kyrgyz to the lowest. Although the \u201csmall\u201d languages occupy a very small part of the corpus, it is still a matter of extensive material that has been incorporated.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"415\" src=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2-1024x415.png\" alt=\"\" class=\"wp-image-7189\" srcset=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2-1024x415.png 1024w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2-300x121.png 300w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2-768x311.png 768w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2-1200x486.png 1200w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-2.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To the question \u201cAre you trained exclusively on the corpus of content in English?\u201d, ChatGPT answers no, stating that the corpus also includes content in other languages (Spanish, French, Italian&#8230;). The dominant position of the English language is unquestionable. The mentioned bias towards English is noticeable on different levels, from grammatical to cultural. If we ask ChatGPT to write us a rhyming poem in Serbian, to give us an example of a word game or to write some palindrome, we will see that it does not do well, i.e. that &#8211; to put it simply &#8211; it answers in Serbian, but still \u201cthinks\u201d in English. However, it is also unquestionable that a significant part of the \u201cknowledge\u201d of this chatbot is built based on content from other languages, including some very small ones.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>ChatGPT as a black box<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When it comes to Vasic&#8217;s thesis about the \u201cblack box\u201d, that is, the claim that we do not know how this system processes and connects data &#8211; it is largely correct. The way this model processes a huge amount of information is not completely known <a href=\"https:\/\/www.vox.com\/unexplainable\/2023\/7\/15\/23793840\/chat-gpt-ai-science-mystery-unexplainable-podcast\">even to its creators<\/a>. Why? Because the system is set up in such a way that it learns itself how to connect data at one point. Therefore, this process has become too complex to be comprehensively analyzed and deconstructed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An additional note of mystification is introduced by the fact that the companies that develop AI models are quite <a href=\"https:\/\/www.theverge.com\/2023\/3\/15\/23640180\/openai-gpt-4-launch-closed-research-ilya-sutskever-interview\">secretive<\/a> when it comes to the mechanisms they implement and test. However, there are <a href=\"https:\/\/archive.ph\/OdTDt\">researchers<\/a> who conduct research and ask <a href=\"https:\/\/www.youtube.com\/watch?v=TO0J2Yw7usM&amp;t=4696s\">questions<\/a> about this system, its functioning and its future.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"657\" src=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3-1024x657.png\" alt=\"\" class=\"wp-image-7190\" srcset=\"https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3-1024x657.png 1024w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3-300x193.png 300w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3-768x493.png 768w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3-1200x770.png 1200w, https:\/\/seecheck.org\/wp-content\/uploads\/2023\/10\/Black-boxes-3.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>ChatGPT and bombshell questions<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ChatGPT is indeed able to provide <a href=\"https:\/\/futurism.com\/amazing-jailbreak-chatgpt\">answers<\/a> to questions that it should not be able to answer. Although there is an ethical filter that blocks giving immoral or \u201cdangerous\u201d answers, users have found and are finding various \u201choles\u201d in the system. One of such \u201choles\u201d refers precisely to the case when this chatbot was indirectly prompted to provide instructions for making a <a href=\"https:\/\/archive.ph\/Ngtwv\">bomb<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When it comes to the assessment that this system knows chemistry at the level of a university graduate, we can draw attention to two analyzes in parallel. <a href=\"https:\/\/openai.com\/research\/gpt-4\">OpenAI<\/a> wrote about the first one, presenting GPT4. The chatbot managed to solve the chemistry test (<a href=\"https:\/\/apcentral.collegeboard.org\/about-ap\/ap-a-glance\">AP Chemistry<\/a>) better than 70% of the real test takers. On the other hand, the research called \u201cChatGPT also needs a chemistry teacher\u201d is also interesting. It proved that by changing the context it is easily possible to induce ChatGPT to give the wrong answer to a question from an entrance exam for chemistry studies.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Original article&nbsp;(in Serbian) was published on 25\/09\/2023; Author: Marija Zemunovi\u0107 A reader asked us to check the statements made by marketing expert Petar Vasic about ChatGPT in a short video posted on Instagram. The video was taken from Milan Strongman&#8217;s podcast, where Vasic, as a guest, said the following, among other things: \u201cWhen OpenAI was [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":7187,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[316],"tags":[231,230,357,28],"class_list":["post-7186","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-fact-checks","tag-ai","tag-artificial-intelligence","tag-chatgpt","tag-serbia"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/posts\/7186","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/comments?post=7186"}],"version-history":[{"count":3,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/posts\/7186\/revisions"}],"predecessor-version":[{"id":8772,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/posts\/7186\/revisions\/8772"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/media\/7187"}],"wp:attachment":[{"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/media?parent=7186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/categories?post=7186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/seecheck.org\/index.php\/wp-json\/wp\/v2\/tags?post=7186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}