Question
myahia72 on Tue, 27 Feb 2018 20:21:11
The following code splits each lines into words and store the first words in each line into array list and the second words into another array list and so on. Then it selects the most frequent word from each list as correct word.
Module Module1 Sub Main() Dim correctLine As String = "" Dim line1 As String = "Canda has more than ones official language" Dim line2 As String = "Canada has more than one oficial languages" Dim line3 As String = "Canada has nore than one official lnguage" Dim line4 As String = "Canada has nore than one offical language" Dim wordsOfLine1() As String = line1.Split(" ") Dim wordsOfLine2() As String = line2.Split(" ") Dim wordsOfLine3() As String = line3.Split(" ") Dim wordsOfLine4() As String = line4.Split(" ") For i As Integer = 0 To wordsOfLine1.Length - 1 Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)}) Dim counts = From n In wordAllLinesTemp Group n By n Into Group Order By Group.Count() Descending Select Group.First correctLine = correctLine & counts.First & " " Next correctLine = correctLine.Remove(correctLine.Length - 1) Console.WriteLine(correctLine) Console.ReadKey() End Sub End Module
So this is my code. How can I make it works with lines of different number of words. I mean that the length of each lines here is 7 words and the for loopworks with this length (length-1). Suppose that line 3 contains 5 words.
Replies
Acamar on Tue, 27 Feb 2018 20:53:01
So I want to split all text line words into arrays and then apply voting method on these words.
You haven't indicted what the problem is. How far into the process have you got and what is the difficulty you have run into. Post the code you have so far.
Simple Samples on Tue, 27 Feb 2018 23:00:02
Yeah, what is the question?
Well maybe you are asking how to "Split text lines into words" but then the "select the correct ones" part is very vague. Note that you should out the entire question in the body of the post, don't expect to try to ask the entire question in the subject (title). There is absolutely no question in the body.
As for splitting lines you need to decide how complex you want to make it. For example if you get "it.For" then is that the end of a sentence and the beginning of another and the space has been mistakenly omitted? What if you get ".Net" then is that another mistake? Some people (marketing types of people especially) exist to violate rules and like to do things however they want to so you might have a period in the middle of a name. How complicated do you need (want) to be? You need to decide that first.
Mr. Monkeyboy on Wed, 28 Feb 2018 01:56:23
What's the question? How to split lines of text? How to get a percentage of possible correct words for an index of the 5 string arrays? Provide you all the code for a graduate project idea you came up with? What do you want?
La vida loca
Cherry Bu on Wed, 28 Feb 2018 05:20:17
Hi myahia72,
If you want to split text lines into words, you can use string.split method to do this:
Dim str As String="Canda has more than ones official language" Dim words() As String = str.Split(" ")
You said that you use array1 contain the first word from each line and select the correct one, can you provide your existing code here, it is helpful to us to know what you want to do.
Best regards,
Cherry
myahia72 on Wed, 28 Feb 2018 18:01:30
I have posted the code and also I have updated the question
Reed Kimble on Wed, 28 Feb 2018 20:42:15
Instead of accessing wordsOfLineX(i) directly, create a lambda or helper method to safely get a string from the array, returning an empty string if the index is invalid. For example:
Module Module1 Sub Main() Dim correctLine As String = "" Dim line1 As String = "Canda has more than ones official language" Dim line2 As String = "Canada has more than one oficial languages" Dim line3 As String = "Canada has nore than one official lnguage" Dim line4 As String = "Canada has nore than one offical language" Dim wordsOfLine1() As String = line1.Split(" ") Dim wordsOfLine2() As String = line2.Split(" ") Dim wordsOfLine3() As String = line3.Split(" ") Dim wordsOfLine4() As String = line4.Split(" ") Dim getWordSafely = Function(array As String(), index As Integer) If index > -1 AndAlso index < array.Length Then Return array(index) Return String.Empty End Function For i As Integer = 0 To wordsOfLine1.Length - 1 Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i), getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)}) Dim counts = From n In wordAllLinesTemp Group n By n Into Group Order By Group.Count() Descending Select Group.First correctLine = correctLine & counts.First & " " Next correctLine = correctLine.Remove(correctLine.Length - 1) Console.WriteLine(correctLine) Console.ReadKey() End Sub End ModuleJust keep in mind that now an empty string could be a predominant result depending on how many short strings there are.
Acamar on Wed, 28 Feb 2018 20:46:27
I have posted the code and also I have updated the question
You have also made all the previous responses look like nonsense. If you are provided with code for the project it should be posted as an additional post, not by rewriting your question.
The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6. You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.
myahia72 on Wed, 28 Feb 2018 20:53:27
Thanks very much
Reed Kimble on Wed, 28 Feb 2018 20:55:16
The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6. You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.
But there's only one loop over the first line, so it becomes the maximum length line. The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).
-EDIT-
Though I agree that the post appeared to begin with a question about how to organize the words and now is more about dealing with one of the problems (varying length strings) that one might encounter with this kind of thing.
Reed Kimble - "When you do things right, people won't be sure you've done anything at all"
Reed Kimble on Wed, 28 Feb 2018 20:59:13
I mentioned it in a reply to Acamar above but it is worth reiterating - the first line is deciding the maximum number of words to test. It might be better to get the longest string and use that length:
Dim maxLength = (Aggregate a In {wordsOfLine1, wordsOfLine2, wordsOfLine3, wordsOfLine4} Select a.Length Into Max) For i As Integer = 0 To maxLength - 1 Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i), getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})
Acamar on Wed, 28 Feb 2018 21:11:19
But there's only one loop over the first line, so it becomes the maximum length line. The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).
Then don't do it like that. If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes itrrelevant. If lines with unequal number of words are allowed
then there are several options. OP could ignore lines that don't match in number of words, or do some sort of similarity ranking to work out which column each word goes into (that is, where to insert a blank dummy word). Whatever
the choice, just extending the lines so they match is going to corrupt the voting.
Reed Kimble on Wed, 28 Feb 2018 21:18:36
I completely agree that short lines may skew the results, but we don't really know what the expected results are supposed to be or what the input will actually look like.If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes irrelevant.
Simple Samples on Wed, 28 Feb 2018 21:35:27
The following is a possibility. It will adapt to the number of words in each line. This does not do everything but it does most of it and the rest should be easy.
Class classWordsOfLine Public line As String Public Words() As String Public Sub New(line As String) Me.line = line Words = line.Split(" ") End Sub End Class Module Module1 Sub Main() Dim correctLine As String = "" Dim WordsOfLine(4) As classWordsOfLine Dim maxwords As Integer = 0 ' WordsOfLine(0) = New classWordsOfLine("Canda has more than ones official language") maxwords = Math.Max(maxwords, WordsOfLine(0).Words.Length) WordsOfLine(1) = New classWordsOfLine("Canada has more than one oficial languages") maxwords = Math.Max(maxwords, WordsOfLine(1).Words.Length) WordsOfLine(2) = New classWordsOfLine("Canada has nore than one official lnguage") maxwords = Math.Max(maxwords, WordsOfLine(2).Words.Length) WordsOfLine(3) = New classWordsOfLine("Canada has nore than one offical language") maxwords = Math.Max(maxwords, WordsOfLine(3).Words.Length) WordsOfLine(4) = New classWordsOfLine("Canada has nore than one language") maxwords = Math.Max(maxwords, WordsOfLine(4).Words.Length) ' For fromx As Integer = 0 To maxwords - 1 Dim words(4) As String Dim tox As Integer = 0 For linex As Integer = 0 To WordsOfLine.Length - 1 ' if the number of words are less than the current index then don't try it If WordsOfLine(linex).Words.Length - 1 >= fromx Then words(tox) = WordsOfLine(linex).Words(fromx) tox = tox + 1 End If Next ReDim Preserve words(tox - 1) ' words now has the words and just the right number of them Console.WriteLine(String.Join(" | ", words)) Next End Sub End Module
myahia72 on Thu, 01 Mar 2018 11:35:52
I think the suggested solution about ignoring the lines with missing some words may be a good suggestions since I have about 70 lines resulted from one run and I have 5 runs. So there will be five 70 lines. The possibilities of having lines with missing words is low and ignoring these lines will not affect the results.
myahia72 on Thu, 01 Mar 2018 12:04:30
Actually the program here will not ignore the lines with missing words. Instead it will add a word from the next line to the words array as following
Canda | Canada | Canada | Canada | Canada
has | has | has | has | has
more | more | nore | nore | nore
than | than | than | than | than
ones | one | one | one | one
official | oficial | official | offical | language
language | languages | lnguage | language
I think this line
If WordsOfLine(linex).Words.Length - 1 >= fromx Then
should update to
If WordsOfLine(linex).Words.Length >= maxwords Then
Simple Samples on Thu, 01 Mar 2018 15:40:48
Yes the problem of what to do when there is a mismatch in the number of words is a design problem. The solution needs to be defined in the requirements.
This is obviously a theoretical exercise intended to show a specific methodology not disclosed here. I agree that if the requirements were clarified then the implementation can be improved correspondingly.
A more realistic implementation would likely include some kind of spell check. A dictionary would help for recognition of words in it. A sophisticated solution could use a Natural Language form of recognition of words that could help match words to columns when there are fewer words. This application could be much more complex so I certainly understand there are fundamental imperfections.