xsharp.eu • Alines() used with multiple String „Parsechar“ to parse large files ?
Page 1 of 1

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Sun Jul 11, 2021 1:40 pm
by mainhatten
Hi,
I volunteered to code alines() implementation in X#, as it fits GetWord* I did earlier, is often used here and I don’t think I know enough of X# to try for tableupdate(), cursorset/getProp, Buffermode, Cursoradapter package I currently miss most in X#.

In my use cases there are ~2.8 main usages for aLines():
Splitting texts into array of lines like the name suggests, often with default parameter
Splitting CSV (often Excel SSV semicolon separated values...) lines into array item lines
Splitting into words using multiple char separators when GetwordNum does not fit (the 0.8 usage)

For these use cases I have ample test material from War&Peace – needed as pushing Dotnet perf to where fox is when running inside fox C-runtime is not easy, esp. as Dotnet works on unicode „char“ which is more demanding than always-1-byte-char used in Fox and strings thereof.

Now Alines() method signature „cParseChar“ is a bit misleading – you can use Len(Separator)>1 and it works – default Parse“Chars“ are Chr(13), Chr(13)+Chr(10), Chr(10). Worst case would be long file, many separators all starting with same char. One possible and relatively demanding scenario I came up with is special „partial parsing“ of XML/HTML, most separators starting with „<“. Perhaps as multi-step sequence, eliminating void tags in all forms of tagging first, getting to interesting meat afterwards...

I can build a table with heavy HTML content by de-7zipping .chm files and use those with somewhat concoted runs with various tags, but prefer to employ anything formed by real outside needs.

If anybody has data file and alines() calls doing hefty, but real-world work with
alines(taAlines, cHeftySourceStrOrTableofSuchStrings, nWhatFitsThePurpose ;
, cStringWithLenGT1_1, cStringWithLenGT1_2, cStringWithLenGT1_3 ;
[ , cStringWithLenGT1_4, cStringWithLenGT1_5, cStringWithLenGT1_6...] )

to check and optimize my implementation on, receiving such data and alines() calls here or via private msg or directly via email would be splendid.

tia
thomas

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Sun Jul 11, 2021 8:55 pm
by lumberjack
Hi Thomas,
Have you looked at the .NET String.Split overloads that contains overloads for arrays of char and strings?

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Mon Jul 12, 2021 6:43 am
by mainhatten
lumberjack wrote:Hi Thomas,
Have you looked at the .NET String.Split overloads that contains overloads for arrays of char and strings?
Hi Johan,
thx for chiming in. Yes, I have looked at String.Split(), which might resolve ~90% of the calls/use cases if everything falls into place - have not even started to check in earnest.

OTOH it is already clear that not all of vfp nFlags can be handled, even if those use cases might crop up less often. In current Net5 docs you have [None,RemoveEmptyEntries, TrimEntries], Xide only shows [None,RemoveEmptyEntries] as options - must check if TrimEntries is available in all Dotnet versions, as
https://github.com/dotnet/runtime/issues/31038
proposing it seems to be from 2019. Perhaps Chris or Robert know a shortcut and can chime in before I try to search versions.

Then there might be "different opinions" of what is whitespace / empty: vfp Empty() includes chr(13), default value of getword* does not, Dotnet has its own idea on Whitespace. In GetWord I resolved this by giving x# coders the explicit option to set DotNet Whitespace as function to separate words. If similar differences exist in trimming of alines(), I probably will offer a new flag to run with Dotnet specifications if trimming results differ. So testing edge cases will have to look into more edges.

Use cases for the other flags?
Include the last element in the array even if the element is empty: Might be needed if searching for Eof, Eot, Eos and similar markers used in last century OSes with different values.
Case insensitive parsing might be easy to retrofit on char/char arrays, but all permutations in string[], if result probably needs to be in proper case ?
Include the parsing characters in the array might be benefical to ground level parsing.

So yes, Split() will be offered and perhaps be used for a subset of flags, but how much it will cover is yet uncertain.

thx again
thomas

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Mon Jul 12, 2021 2:54 pm
by Chris
Hi Thomas,

Indeed, this is not included in the version of the .Net framework I have either, I suspect it's only being included in .Net core. In any case, I wouldn't write code for the runtime that depends on it!

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Mon Jul 12, 2021 3:24 pm
by mainhatten
Hi Chris,
thx for confirming. Adding .Trim() in fluent interfaces probably won't inflate code size very much ;-)

More a reminder for myself to think about such possible side effects when developing with "latest" Dotnet version. Vfp gave us not that many options - including options to shoot yourself in the foot ;-)

regards
thomas
Chris wrote: Indeed, this is not included in the version of the .Net framework I have either, I suspect it's only being included in .Net core. In any case, I wouldn't write code for the runtime that depends on it!

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Tue Jul 13, 2021 7:54 am
by atlopes
Thomas, although not aimed directly at the issue of parsing large files, and since you're working with the ALINES() function implementation, please keep in mind that the order of the parsing characters is relevant to how the string is split into array elements.

Also, contrary to the VFP documentation, CHR(10) + CHR(13) is not handled as a single splitter.

Code: Select all

CLEAR

LOCAL ARRAY Test1[1], Test2[1], Test3[1]
LOCAL Source AS String

m.Source = "A" + CHR(13) + CHR(10) + "B" + CHR(10) + CHR(13) + CHR(10) + "C"

ALINES(m.Test1, m.Source)
ALINES(m.Test2, m.Source, 0, CHR(13) + CHR(10), CHR(13), CHR(10))  && same as default
ALINES(m.Test3, m.Source, 0, CHR(13), CHR(10), CHR(13) + CHR(10))

DISPLAY MEMORY LIKE Test?

Alines() used with multiple String „Parsechar“ to parse large files ?

Posted: Tue Jul 13, 2021 7:00 pm
by mainhatten
Hi Antonio,
atlopes wrote:please keep in mind that the order of the parsing characters is relevant to how the string is split into array elements.
Yupp, this an implementation detail I have to follow even if in my very own POV it borders on bug (I'd divide on longest fitting string first, but vfp approach might lead to implementation running faster as less search/comparison effort is needed)- but keeping vfp compatibility is target, not my sometimes warped sense of fitness ;-)
Also, contrary to the VFP documentation, CHR(10) + CHR(13) is not handled as a single splitter.
Again ACK - as strange as not classifying CHR(13) as separator in GetWord*. but as empty()
Added your code to my examples - had reached same conclusion splitting long strings of hex digits.
thx & regards
thomas