Saya perlu membuat metode yang membaca file html lalu menampilkan jumlah kemunculan kata.

Misalnya: String [] kata = {"senang", "baik", "baik"};

Kata bahagia digunakan 7 kali. Kata bagus digunakan 1 kali. Kata bahagia digunakan 2 kali.

Inilah yang saya lakukan:

    public static void ReadWriteDisplay() {
    
    Path in = Paths.get("E:\\TextToHTML.html");
    Path out = Paths.get("E:\\HTMLToText.txt");
    String s = "";
    String str = "";
    try {
        
        InputStream input = new BufferedInputStream(Files.newInputStream(in));
        BufferedReader reader = new BufferedReader(new InputStreamReader(input));
        
        OutputStream output = new BufferedOutputStream(Files.newOutputStream(out, CREATE, WRITE, TRUNCATE_EXISTING));
        BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output));
        
        s = reader.readLine();
        while(s != null) {
            
            str += s;
            writer.write(s);
            writer.newLine();
            s = reader.readLine();
        }
        
        
        
        reader.close();
        writer.close();
        
        String a[] = str.split(" ");
        
        System.out.println("str: "+str);
        
        String [] positive = {"happy", "nice", "good", "joy", "love"};
        int [] count = {0, 0, 0, 0, 0};
        
        for (int i = 0; i < a.length; i++) {
            
            if(positive[0].equalsIgnoreCase(a[i]))
                count[0]++;
            if(positive[1].equalsIgnoreCase(a[i]))
                count[1]++;
            if(positive[2].equalsIgnoreCase(a[i]))
                count[2]++;
            if(positive[3].equalsIgnoreCase(a[i]))
                count[3]++;
            if(positive[4].equalsIgnoreCase(a[i]))
                count[4]++;
        }
        
        for (int x = 0; x < 5; x++) {
            
            System.out.println("The word "+positive[x]+" was used "+count[x]+" times.");
        }
        
        
    }catch(Exception e) {
        
        System.err.println("Message: "+ e);
    }
    
}

Metode saya berjalan tetapi tidak memberikan jumlah kejadian yang akurat. Alasannya karena beberapa kata dalam html diapit <> yang menyebabkan <>Halo<> disimpan di array string saya alih-alih kata Halo.

Berikut adalah contoh keluarannya:

str: <!DOCTYPE html><html lang="en"><head>    <meta charset="utf-8">    <meta http-equiv="X-UA-Compatible" content="IE=edge">    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>    <meta http-equiv="content-language" content="en" />    <meta name="viewport" content="width=device-width, initial-scale=1">    <meta name="google-site-verification" content="rUp8isOBygjhxPJ2qyy6QtBi9vWRFhIboMXucJsCtrE" />    <title>JustPaste.it - Share Text &amp; Images the Easy Way</title>    <link rel="preload" href="/static/img/jp_logo_1_en_v4.png" as="image" />                <meta name="robots" content="noindex, nofollow" />        <meta name="googlebot" content="noindex, nofollow" />                                <link rel="preload" href="/build/global.395f53d0.css" as="style" />            <link rel="stylesheet" type="text/css"  href="/build/global.395f53d0.css" />                    <link rel="shortcut icon" href="/static/other/fav.ico" />             <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->        <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->        <!--[if lt IE 9]>            <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>            <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>        <![endif]-->        <script>      window.article = {"id":42017684,"url":"https:\/\/justpaste.it\/6fn9m","shortUrl":"https:\/\/jpst.it\/2wiek","pdfUrl":"https:\/\/justpaste.it\/6fn9m\/pdf","qrCodeData":"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFcAAABXCAIAAAD+qk47AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACCklEQVR4nO2by27DMAwEx0X\/\/5fTAwFdaNB8SEmB7BzjSDEWy4ikpOv1evH1\/Hz6Bf4FUgGkgiEVQCoYv\/6j67omM65FJzOPX6HWKD9PaebSj8oLIBWMm4hYlBIq79Jg+Pqyd3vpR4dvuJAXQCoYUUQsAi9lPOlt74dnloZzbygvgFQwUhExpJft9EKjh7wAUsF4R0QE+Bh5g\/898gJIBSMVEUNzDjOiDMN55AWQCkYUEcOWTqlrtL18KCEvgFQwbiJie7qSMXkpELa\/obwAUsFI7UcEpXHw397bmMh0cXtJVzBKXgCpYFyB3xYlT\/Ye3bzZ7q264EflBZAKRmqHLmPyYJR\/5IeXEqrt8SgvgFQwojoiY9feEpN5VCLo4maQF0AqGLVzTcM\/50UpEdpVj+sUxwNSAao7dJk6erHrhN65umYhL4BUMGoRUTJ56TsBw\/UoM0peAKlg1CrrRamgLnEu6VLW9IBUgLj7Ouz\/DJePHr16RF4AqWA096yDc92lCXs3hjzDyJIXQCoYB+\/Q9Q4vDS9cBPOojnhAKsDRO3R+nl3dp94uhrKmB6QCHL1Dlznp1GsWbUdeAKlgvOPGUK8juqt5mymx5QWQCsbBiCglS5+9KCEvgFQwDt6hO3djdHtfV14AqWAcvEO36B1M6mVNvQpFXgCpYNzs0H0h8gJIBUMqgFQwpALAH\/JvmLtnlWjnAAAAAElFTkSuQmCC"};      window.statsUrl = 'https\u003A\/\/stats.justpaste.it';      window.viewKey = 'x6ER';      window.barOptions = {"isLoggedIn":false,"hasPublicProfile":false,"displayOwnership":false,"isArticleOwner":false,"isPasswordProtected":false,"isCaptchaRequired":null,"isCaptchaEntered":false,"captchaSettings":null,"premiumUserData":null,"isPrivate":false,"isExpired":false,"expireAfterRead":false,"isShared":false,"defaultAvatar":"\/static\/img\/avatar60.jpg","createdText":"6h","showLastEdit":false,"modifiedText":"6h","isInTrash":false,"viewsText":"2","favouritesCount":0,"onlineText":"1","getFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article\/42017684","addFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article","removeFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article-delete\/42017684","apiShowArticleDynamicUrl":"\/api\/v1\/article-dynamic","voteUrl":"\/api\/account\/v1\/vote","contentLang":"en","positiveVotes":0,"negativeVotes":0,"currentVote":"empty","linkSharingUrl":null,"linkSharingSecret":null};          </script>        <script src="/build/runtime.a1e5a72a.js" async></script>        <script src="/build/1676.2c557867.js" async></script>        <script src="/build/8452.a9a1e0c5.js" async></script>        <script src="/build/5936.ad26e56d.js" async></script>        <script src="/build/9412.4a605741.js" async></script>        <script src="/build/showarticlewidget.3bbca334.js" async></script>        </head><body marginwidth="0" dir="ltr" marginheight="0"><!-- Static navbar --><div class="navbar navbar-default navbar-static-top mainTableTopMiddle" role="navigation">    <div class="container">        <div class="navbar-header pull-left">            <a href="/"><img src="/static/img/jp_logo_1_en_v4.png" width="186px" height="54px" alt="JustPaste.it" /></a>        </div>        <div class="navbar-header pull-left">            <div class="nav navbar-nav mainTableTopMiddleRight hidden-xs hidden-sm">                <img src="/static/img/jp_logo_2_en_v5.png" width="390px" height="54px" />            </div>        </div>        <div class="navbar-header pull-right" style="padding-top:8px">            <div id="mainPanelButtons"></div>        </div>    </div><!--/.nav-collapse --></div><div id="headContainer" class="container" style="max-width: 960px">    <div class="row">        <div class="col-md-12">            <div id="mainTableContent">                <div style="max-width: 960px; vertical-align: top">            <div id="showArticleWidget"><div class="showArticleWidgetPlaceholder"></div></div>        <div id="articleContent">        <p>happy</p> <p>nice nice</p> <p>good good good</p> <p>joy Joy joy Joy joy</p> <p>Love love Love love Love</p>    </div>            <div id="showArticleBottomWidget"><div class="articleBottomWidgetPlaceholder"></div></div>    <span style="visibility:hidden" class="glyphicon glyphicon-link"></span></div>            </div>        </div>    </div> <!-- /row --></div> <!-- /container --><div id="footer" style="min-height: 30px;">    <div class="container" style="vertical-align: middle">        <div class="col-md-3 col-xs-5 col-sm-4 text-muted" style="font-size: 95%;" align="left">            &copy; 2021 <span class="hidden-xs">justpaste.it</span>        </div>        <div class="col-md-9 col-xs-7 col-sm-8 text-muted"  align="right">            <ul class="list-inline basePageFooterList">                <li class="hidden-xs">                    <a href="/login">Account</a>                </li>                <li class="hidden-xs">                    <a href="/terms">Terms</a>                </li>                <li class="hidden-xs">                    <a href="/privacypolicy">Privacy</a>                </li>                <li class="hidden-xs">                    <a href="/cookies">Cookies</a>                </li>                <li>                    <a href="/u/justpasteit">Blog</a>                </li>                <li>                    <a href="/about">About</a>                </li>            </ul>        </div>    </div></div>        <script>      window.mainPanelOptions = {        addArticleUrl: '/',        loginUrl: '/login',        logoutUrl: '/logout',        favouriteArticlesUrl: '/account/favourite',        subscribedArticlesUrl: '/account/subscribed',        sharedArticlesUrl: '/account/shared',        manageAccountUrl: '/account/manage',        messagesUrl: '/account/messages',        articlesStatsUrl: '/account/articles-stats',        premiumUrl: '/premium/subscription',        unreadMessagesUrl: 'https://msg.justpaste.it/api/v1/conversation/unread',        profileSettings: '/account/settings',        isLoggedIn: false,        userEmail: null,        userPermalink: null,        userProfileIsPublic: false,        userProfileLink: null      };          </script>        <script src="/build/mainpanelwidget.80530742.js" async></script>        </body></html>

    The word happy was used 0 times.
    The word nice was used 0 times.
    The word good was used 1 times.
    The word joy was used 3 times.
    The word love was used 3 times.

Bagaimana cara membagi atau menghitung jumlah kemunculan dengan benar? Terima kasih!

0
X4c3 28 Mei 2021, 21:33

2 jawaban

Jawaban Terbaik

Ini akan membantu Anda untuk menghapus karakter khusus, ini hanya akan memungkinkan huruf misalnya : <>Halo<> akan diganti seperti Halo

String alphaOnly = input.replaceAll("[^a-zA-Z]+","");

0
VijayS 28 Mei 2021, 18:53

Anda cukup menggunakan pustaka jsoup: Java HTML Parser untuk mengambil semua teks struktur html.

Unduh file jar dari: https://jsoup.org/download

Kode di bawah ini akan menghitung kemunculan kata-kata:

static void countOccurance(String htmlStructure) {
        String[] positive = { "happy", "nice", "good", "joy", "love" };
        Document document = Jsoup.parse(htmlStructure);
        String[] text = document.body().text().split("\\s+");
        for (String word : positive) {
            int wordCount = countWord(text, word);
            System.out.println("The word " + word + " was used " + wordCount + " times.");
        }
    }

    static int countWord(String[] documentText, String wordToFind) {
        int count = 0;
        for (int i = 0; i < documentText.length; i++) {
            if (wordToFind.equalsIgnoreCase(documentText[i]))
                count++;
        }
        return count;
    }
2
Koustubh Madkaikar 28 Mei 2021, 19:00