Split text lines into words and select the correct ones

Category: visual studio vb

Question

myahia72 on Tue, 27 Feb 2018 20:21:11


The following code splits each lines into words and store the first words in each line into array list and the second words into another array list and so on. Then it selects the most frequent word from each list as correct word. 

Module Module1

    Sub Main()
        Dim correctLine As String = ""
        Dim line1 As String = "Canda has more than ones official language"
        Dim line2 As String = "Canada has more than one oficial languages"
        Dim line3 As String = "Canada has nore than one official lnguage"
        Dim line4 As String = "Canada has nore than one offical language"

        Dim wordsOfLine1() As String = line1.Split(" ")
        Dim wordsOfLine2() As String = line2.Split(" ")
        Dim wordsOfLine3() As String = line3.Split(" ")
        Dim wordsOfLine4() As String = line4.Split(" ")
 

        For i As Integer = 0 To wordsOfLine1.Length - 1
            Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)})
            Dim counts = From n In wordAllLinesTemp
            Group n By n Into Group
            Order By Group.Count() Descending
            Select Group.First
            correctLine = correctLine & counts.First & " "
        Next
        correctLine = correctLine.Remove(correctLine.Length - 1)
        Console.WriteLine(correctLine)
        Console.ReadKey()

    End Sub

End Module

So this is my code. How can I make it works with lines of different number of words. I mean that the length of each lines here is 7 words and the for loopworks with this length (length-1). Suppose that line 3 contains 5 words. 


Replies

Acamar on Tue, 27 Feb 2018 20:53:01


So I want to split all text line words into arrays and then apply voting method on these words.

You haven't indicted what the problem is.   How far into the process have you got and what is the difficulty you have run into.  Post the code you have so far.

Simple Samples on Tue, 27 Feb 2018 23:00:02


Yeah, what is the question?

Well maybe you are asking how to "Split text lines into words" but then the "select the correct ones" part is very vague. Note that you should out the entire question in the body of the post, don't expect to try to ask the entire question in the subject (title). There is absolutely no question in the body.

As for splitting lines you need to decide how complex you want to make it. For example if you get "it.For" then is that the end of a sentence and the beginning of another and the space has been mistakenly omitted? What if you get ".Net" then is that another mistake? Some people (marketing types of people especially) exist to violate rules and like to do things however they want to so you might have a period in the middle of a name. How complicated do you need (want) to be? You need to decide that first.

Mr. Monkeyboy on Wed, 28 Feb 2018 01:56:23


What's the question? How to split lines of text? How to get a percentage of possible correct words for an index of the 5 string arrays? Provide you all the code for a graduate project idea you came up with? What do you want?

La vida loca

Cherry Bu on Wed, 28 Feb 2018 05:20:17


Hi myahia72,

If you want to split text lines into words, you can use string.split method to do this:

 Dim str As String="Canda has more than ones official language"
        Dim words() As String = str.Split(" ")

You said that you use array1 contain the first word from each line and select the correct one, can you provide your existing code here, it is helpful to us to know what you want to do.

Best regards,

Cherry

myahia72 on Wed, 28 Feb 2018 18:01:30


I have posted the code and also I have updated the question

Reed Kimble on Wed, 28 Feb 2018 20:42:15


Instead of accessing wordsOfLineX(i) directly, create a lambda or helper method to safely get a string from the array, returning an empty string if the index is invalid.  For example:

Module Module1

    Sub Main()
        Dim correctLine As String = ""
        Dim line1 As String = "Canda has more than ones official language"
        Dim line2 As String = "Canada has more than one oficial languages"
        Dim line3 As String = "Canada has nore than one official lnguage"
        Dim line4 As String = "Canada has nore than one offical language"

        Dim wordsOfLine1() As String = line1.Split(" ")
        Dim wordsOfLine2() As String = line2.Split(" ")
        Dim wordsOfLine3() As String = line3.Split(" ")
        Dim wordsOfLine4() As String = line4.Split(" ")

        Dim getWordSafely = Function(array As String(), index As Integer)
                                If index > -1 AndAlso index < array.Length Then Return array(index)
                                Return String.Empty
                            End Function

        For i As Integer = 0 To wordsOfLine1.Length - 1
            Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                        getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})
            Dim counts = From n In wordAllLinesTemp
                         Group n By n Into Group
                         Order By Group.Count() Descending
                         Select Group.First
            correctLine = correctLine & counts.First & " "
        Next
        correctLine = correctLine.Remove(correctLine.Length - 1)
        Console.WriteLine(correctLine)
        Console.ReadKey()

    End Sub

End Module
Just keep in mind that now an empty string could be a predominant result depending on how many short strings there are.

Acamar on Wed, 28 Feb 2018 20:46:27


I have posted the code and also I have updated the question

You have also made all the previous responses look like nonsense.  If you are provided with code for the project it should be posted as an additional post, not by rewriting your question. 

The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6.   You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.

myahia72 on Wed, 28 Feb 2018 20:53:27


Thanks very much

Reed Kimble on Wed, 28 Feb 2018 20:55:16


The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6.   You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.

But there's only one loop over the first line, so it becomes the maximum length line.  The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).

-EDIT-

Though I agree that the post appeared to begin with a question about how to organize the words and now is more about dealing with one of the problems (varying length strings) that one might encounter with this kind of thing.


Reed Kimble - "When you do things right, people won't be sure you've done anything at all"


Reed Kimble on Wed, 28 Feb 2018 20:59:13


I mentioned it in a reply to Acamar above but it is worth reiterating - the first line is deciding the maximum number of words to test.  It might be better to get the longest string and use that length:

        Dim maxLength = (Aggregate a In {wordsOfLine1, wordsOfLine2, wordsOfLine3, wordsOfLine4} Select a.Length Into Max)

        For i As Integer = 0 To maxLength - 1
            Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                        getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})

Acamar on Wed, 28 Feb 2018 21:11:19


But there's only one loop over the first line, so it becomes the maximum length line.  The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).

Then don't do it like that.   If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes itrrelevant.   If lines with unequal number of words are allowed then there are several options.  OP could ignore lines that don't match in number of words, or do some sort of similarity ranking to work out which column each word goes into (that is, where to insert a blank dummy word).  Whatever the choice, just extending the lines so they match is going to corrupt the voting.

Reed Kimble on Wed, 28 Feb 2018 21:18:36


If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes irrelevant.  

I completely agree that short lines may skew the results, but we don't really know what the expected results are supposed to be or what the input will actually look like.

Simple Samples on Wed, 28 Feb 2018 21:35:27


The following is a possibility. It will adapt to the number of words in each line. This does not do everything but it does most of it and the rest should be easy.

Class classWordsOfLine
    Public line As String
    Public Words() As String
    Public Sub New(line As String)
        Me.line = line
        Words = line.Split(" ")
    End Sub
End Class

Module Module1

    Sub Main()
        Dim correctLine As String = ""
        Dim WordsOfLine(4) As classWordsOfLine
        Dim maxwords As Integer = 0
        '
        WordsOfLine(0) = New classWordsOfLine("Canda has more than ones official language")
        maxwords = Math.Max(maxwords, WordsOfLine(0).Words.Length)
        WordsOfLine(1) = New classWordsOfLine("Canada has more than one oficial languages")
        maxwords = Math.Max(maxwords, WordsOfLine(1).Words.Length)
        WordsOfLine(2) = New classWordsOfLine("Canada has nore than one official lnguage")
        maxwords = Math.Max(maxwords, WordsOfLine(2).Words.Length)
        WordsOfLine(3) = New classWordsOfLine("Canada has nore than one offical language")
        maxwords = Math.Max(maxwords, WordsOfLine(3).Words.Length)
        WordsOfLine(4) = New classWordsOfLine("Canada has nore than one language")
        maxwords = Math.Max(maxwords, WordsOfLine(4).Words.Length)
        '
        For fromx As Integer = 0 To maxwords - 1
            Dim words(4) As String
            Dim tox As Integer = 0
            For linex As Integer = 0 To WordsOfLine.Length - 1
                ' if the number of words are less than the current index then don't try it
                If WordsOfLine(linex).Words.Length - 1 >= fromx Then
                    words(tox) = WordsOfLine(linex).Words(fromx)
                    tox = tox + 1
                End If
            Next
            ReDim Preserve words(tox - 1)
            ' words now has the words and just the right number of them
            Console.WriteLine(String.Join(" | ", words))
        Next
    End Sub

End Module

myahia72 on Thu, 01 Mar 2018 11:35:52


I think the suggested solution about ignoring the lines with missing some words may be a good suggestions since I have about 70 lines resulted from one run and I have 5 runs. So there will be five 70 lines. The possibilities of having lines with missing words is low and ignoring these lines will not affect the results. 

myahia72 on Thu, 01 Mar 2018 12:04:30


Actually the program here will not ignore the lines with missing words. Instead it will add a word from the next line to the words array as following

Canda | Canada | Canada | Canada | Canada
has | has | has | has | has
more | more | nore | nore | nore
than | than | than | than | than
ones | one | one | one | one
official | oficial | official | offical | language
language | languages | lnguage | language

I think this line 

If WordsOfLine(linex).Words.Length - 1 >= fromx Then


 should update to

If WordsOfLine(linex).Words.Length >= maxwords Then
 

Simple Samples on Thu, 01 Mar 2018 15:40:48


Yes the problem of what to do when there is a mismatch in the number of words is a design problem. The solution needs to be defined in the requirements.

This is obviously a theoretical exercise intended to show a specific methodology not disclosed here. I agree that if the requirements were clarified then the implementation can be improved correspondingly.

A more realistic implementation would likely include some kind of spell check. A dictionary would help for recognition of words in it. A sophisticated solution could use a Natural Language form of recognition of words that could help match words to columns when there are fewer words. This application could be much more complex so I certainly understand there are fundamental imperfections.