Reading a large file that doesn't fit in memory Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Reading a .gz fileReading data from fileLarge ASCII file data readProcessing large file in PythonFile class and large constructorReading a file in PythonReading log file of subprocessReading and writing configuration fileReading a certain number of lines from a fileApplying a filter to a large array with elements that are not regularly spaced

How is simplicity better than precision and clarity in prose?

Estimate capacitor parameters

How many things? AとBがふたつ

Antler Helmet: Can it work?

Is there a documented rationale why the House Ways and Means chairman can demand tax info?

Using "nakedly" instead of "with nothing on"

What items from the Roman-age tech-level could be used to deter all creatures from entering a small area?

Can a monk deflect thrown melee weapons?

Can the prologue be the backstory of your main character?

Are my PIs rude or am I just being too sensitive?

What computer would be fastest for Mathematica Home Edition?

ELI5: Why do they say that Israel would have been the fourth country to land a spacecraft on the Moon and why do they call it low cost?

Stop battery usage [Ubuntu 18]

How to rotate it perfectly?

Notation for two qubit composite product state

How do I automatically answer y in bash script?

Who can trigger ship-wide alerts in Star Trek?

How do you clear the ApexPages.getMessages() collection in a test?

How can I make names more distinctive without making them longer?

How to stop my camera from exagerrating differences in skin colour?

What are the performance impacts of 'functional' Rust?

How should I respond to a player wanting to catch a sword between their hands?

What did Darwin mean by 'squib' here?

Area of a 2D convex hull



Reading a large file that doesn't fit in memory



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Reading a .gz fileReading data from fileLarge ASCII file data readProcessing large file in PythonFile class and large constructorReading a file in PythonReading log file of subprocessReading and writing configuration fileReading a certain number of lines from a fileApplying a filter to a large array with elements that are not regularly spaced



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0












$begingroup$


We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: userId, timestamp, actionType where actionType is one of two possible values: [open, close]



Constraints:



  1. The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

  2. Code has to be able to run on a single machine.

  3. Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

  4. There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

  5. Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.



The desired output for the solution should be [userId, timeSpent,….] for all the users in the log file.



Sample log file (comma-separated, text file)



1,1435456566,open 
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close


Approach



Here is the code I've written in Python and Scala, which seems to be not efficient and up to the expectations of the scenario given. I'd like feedback on how I could optimise this code as per the given scenario.





Scala implementation



import java.io.FileInputStream
import java.util.Scanner, Map, LinkedList
import java.lang.Long
import scala.collection.mutable

object UserMetrics extends App
if (args.length == 0)
println("Please provide input data file name for processing")

val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)


case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)

class UserMetrics

val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()

def readInputFile(stArr:String, timeOut: Int)
var inputStream: FileInputStream = null
var sc: Scanner = null
try
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine())
val line: String = sc.nextLine();
processInput(line, timeOut)


for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap)
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("" + key +","+timespent + "")


if (sc.ioException() != null)
throw sc.ioException();

finally
if (inputStream != null)
inputStream.close();

if (sc != null)
sc.close();




def processInput(line: String, timeOut: Int)
val strSp = line.split(",")

val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()

val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)

if (lsUserInfo != null && lsUserInfo.size() > 0)
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus

if (prevStatus.equals("open"))
if (status.equals(lastUserInfo.prevStatus))
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
else if(!status.equals(lastUserInfo.prevStatus))
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

else if(prevStatus.equals("close"))
if (status.equals(lastUserInfo.prevStatus))
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
else if(!status.equals(lastUserInfo.prevStatus))

lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))


else if(lsUserInfo.size()==0)
lsUserInfo.add(uInfo)

usermap.put(userId, lsUserInfo)





Python Implementation





import sys

def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.

assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks

fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()

ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks

if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()

while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block

def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]

elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]

def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal

def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict =
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict

if __name__ == "__main__":
avgTimeSpentDict =
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)

fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object


Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.










share|improve this question









New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
    $endgroup$
    – Oh My Goodness
    14 hours ago











  • $begingroup$
    This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
    $endgroup$
    – Wiki_91
    12 hours ago










  • $begingroup$
    is the problem real or imaginary?
    $endgroup$
    – Oh My Goodness
    12 hours ago










  • $begingroup$
    Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
    $endgroup$
    – Wiki_91
    11 hours ago






  • 1




    $begingroup$
    you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
    $endgroup$
    – Oh My Goodness
    11 hours ago

















0












$begingroup$


We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: userId, timestamp, actionType where actionType is one of two possible values: [open, close]



Constraints:



  1. The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

  2. Code has to be able to run on a single machine.

  3. Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

  4. There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

  5. Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.



The desired output for the solution should be [userId, timeSpent,….] for all the users in the log file.



Sample log file (comma-separated, text file)



1,1435456566,open 
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close


Approach



Here is the code I've written in Python and Scala, which seems to be not efficient and up to the expectations of the scenario given. I'd like feedback on how I could optimise this code as per the given scenario.





Scala implementation



import java.io.FileInputStream
import java.util.Scanner, Map, LinkedList
import java.lang.Long
import scala.collection.mutable

object UserMetrics extends App
if (args.length == 0)
println("Please provide input data file name for processing")

val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)


case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)

class UserMetrics

val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()

def readInputFile(stArr:String, timeOut: Int)
var inputStream: FileInputStream = null
var sc: Scanner = null
try
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine())
val line: String = sc.nextLine();
processInput(line, timeOut)


for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap)
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("" + key +","+timespent + "")


if (sc.ioException() != null)
throw sc.ioException();

finally
if (inputStream != null)
inputStream.close();

if (sc != null)
sc.close();




def processInput(line: String, timeOut: Int)
val strSp = line.split(",")

val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()

val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)

if (lsUserInfo != null && lsUserInfo.size() > 0)
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus

if (prevStatus.equals("open"))
if (status.equals(lastUserInfo.prevStatus))
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
else if(!status.equals(lastUserInfo.prevStatus))
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

else if(prevStatus.equals("close"))
if (status.equals(lastUserInfo.prevStatus))
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
else if(!status.equals(lastUserInfo.prevStatus))

lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))


else if(lsUserInfo.size()==0)
lsUserInfo.add(uInfo)

usermap.put(userId, lsUserInfo)





Python Implementation





import sys

def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.

assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks

fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()

ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks

if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()

while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block

def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]

elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]

def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal

def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict =
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict

if __name__ == "__main__":
avgTimeSpentDict =
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)

fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object


Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.










share|improve this question









New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
    $endgroup$
    – Oh My Goodness
    14 hours ago











  • $begingroup$
    This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
    $endgroup$
    – Wiki_91
    12 hours ago










  • $begingroup$
    is the problem real or imaginary?
    $endgroup$
    – Oh My Goodness
    12 hours ago










  • $begingroup$
    Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
    $endgroup$
    – Wiki_91
    11 hours ago






  • 1




    $begingroup$
    you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
    $endgroup$
    – Oh My Goodness
    11 hours ago













0












0








0





$begingroup$


We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: userId, timestamp, actionType where actionType is one of two possible values: [open, close]



Constraints:



  1. The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

  2. Code has to be able to run on a single machine.

  3. Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

  4. There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

  5. Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.



The desired output for the solution should be [userId, timeSpent,….] for all the users in the log file.



Sample log file (comma-separated, text file)



1,1435456566,open 
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close


Approach



Here is the code I've written in Python and Scala, which seems to be not efficient and up to the expectations of the scenario given. I'd like feedback on how I could optimise this code as per the given scenario.





Scala implementation



import java.io.FileInputStream
import java.util.Scanner, Map, LinkedList
import java.lang.Long
import scala.collection.mutable

object UserMetrics extends App
if (args.length == 0)
println("Please provide input data file name for processing")

val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)


case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)

class UserMetrics

val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()

def readInputFile(stArr:String, timeOut: Int)
var inputStream: FileInputStream = null
var sc: Scanner = null
try
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine())
val line: String = sc.nextLine();
processInput(line, timeOut)


for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap)
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("" + key +","+timespent + "")


if (sc.ioException() != null)
throw sc.ioException();

finally
if (inputStream != null)
inputStream.close();

if (sc != null)
sc.close();




def processInput(line: String, timeOut: Int)
val strSp = line.split(",")

val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()

val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)

if (lsUserInfo != null && lsUserInfo.size() > 0)
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus

if (prevStatus.equals("open"))
if (status.equals(lastUserInfo.prevStatus))
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
else if(!status.equals(lastUserInfo.prevStatus))
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

else if(prevStatus.equals("close"))
if (status.equals(lastUserInfo.prevStatus))
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
else if(!status.equals(lastUserInfo.prevStatus))

lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))


else if(lsUserInfo.size()==0)
lsUserInfo.add(uInfo)

usermap.put(userId, lsUserInfo)





Python Implementation





import sys

def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.

assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks

fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()

ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks

if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()

while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block

def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]

elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]

def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal

def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict =
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict

if __name__ == "__main__":
avgTimeSpentDict =
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)

fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object


Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.










share|improve this question









New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: userId, timestamp, actionType where actionType is one of two possible values: [open, close]



Constraints:



  1. The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.

  2. Code has to be able to run on a single machine.

  3. Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.

  4. There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.

  5. Timestamps will come in strictly ascending order.

For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.



The desired output for the solution should be [userId, timeSpent,….] for all the users in the log file.



Sample log file (comma-separated, text file)



1,1435456566,open 
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close


Approach



Here is the code I've written in Python and Scala, which seems to be not efficient and up to the expectations of the scenario given. I'd like feedback on how I could optimise this code as per the given scenario.





Scala implementation



import java.io.FileInputStream
import java.util.Scanner, Map, LinkedList
import java.lang.Long
import scala.collection.mutable

object UserMetrics extends App
if (args.length == 0)
println("Please provide input data file name for processing")

val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)


case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)

class UserMetrics

val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()

def readInputFile(stArr:String, timeOut: Int)
var inputStream: FileInputStream = null
var sc: Scanner = null
try
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine())
val line: String = sc.nextLine();
processInput(line, timeOut)


for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap)
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("" + key +","+timespent + "")


if (sc.ioException() != null)
throw sc.ioException();

finally
if (inputStream != null)
inputStream.close();

if (sc != null)
sc.close();




def processInput(line: String, timeOut: Int)
val strSp = line.split(",")

val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()

val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)

if (lsUserInfo != null && lsUserInfo.size() > 0)
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus

if (prevStatus.equals("open"))
if (status.equals(lastUserInfo.prevStatus))
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
else if(!status.equals(lastUserInfo.prevStatus))
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))

else if(prevStatus.equals("close"))
if (status.equals(lastUserInfo.prevStatus))
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
else if(!status.equals(lastUserInfo.prevStatus))

lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))


else if(lsUserInfo.size()==0)
lsUserInfo.add(uInfo)

usermap.put(userId, lsUserInfo)





Python Implementation





import sys

def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.

assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks

fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()

ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks

if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()

while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block

def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount

elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]

elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]

def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal

def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict =
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict

if __name__ == "__main__":
avgTimeSpentDict =
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)

fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object


Both programs give the desired output, but efficiency is what matters for this particular scenario. Let me know if you have anything better or any suggestions on the existing implementation.







python scala memory-optimization






share|improve this question









New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 5 mins ago









Jamal

30.6k11121227




30.6k11121227






New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 15 hours ago









Wiki_91Wiki_91

43




43




New contributor




Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Wiki_91 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
    $endgroup$
    – Oh My Goodness
    14 hours ago











  • $begingroup$
    This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
    $endgroup$
    – Wiki_91
    12 hours ago










  • $begingroup$
    is the problem real or imaginary?
    $endgroup$
    – Oh My Goodness
    12 hours ago










  • $begingroup$
    Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
    $endgroup$
    – Wiki_91
    11 hours ago






  • 1




    $begingroup$
    you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
    $endgroup$
    – Oh My Goodness
    11 hours ago
















  • $begingroup$
    how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
    $endgroup$
    – Oh My Goodness
    14 hours ago











  • $begingroup$
    This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
    $endgroup$
    – Wiki_91
    12 hours ago










  • $begingroup$
    is the problem real or imaginary?
    $endgroup$
    – Oh My Goodness
    12 hours ago










  • $begingroup$
    Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
    $endgroup$
    – Wiki_91
    11 hours ago






  • 1




    $begingroup$
    you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
    $endgroup$
    – Oh My Goodness
    11 hours ago















$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago





$begingroup$
how is it possible that the aggregated data doesn't fit in memory? It's ~20 bytes per user - you really have a userbase of billions?
$endgroup$
– Oh My Goodness
14 hours ago













$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
12 hours ago




$begingroup$
This is to bring out memory efficient solution and critical thinking among programmers in one of our internal org forum.
$endgroup$
– Wiki_91
12 hours ago












$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago




$begingroup$
is the problem real or imaginary?
$endgroup$
– Oh My Goodness
12 hours ago












$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago




$begingroup$
Imaginery.. We handle such huge volume in distributed Hadoop cluster with spark. But this challenge is to avoid and handle the same solution in single machine.
$endgroup$
– Wiki_91
11 hours ago




1




1




$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago




$begingroup$
you've taken a real problem and applied made-up constraints, like a programming puzzle would have, and got the worst of both worlds. The arbitrary One True Solution character of a puzzle is combined with the vagueness, length and tedium of a real problem. I suggest to remove a bunch of detail to create a short puzzle, or drop the fake restrictions and add real context like "actual size of input" and "actual available memory" to describe an authentic engineering problem.
$endgroup$
– Oh My Goodness
11 hours ago










0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217434%2freading-a-large-file-that-doesnt-fit-in-memory%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes








Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.












Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.











Wiki_91 is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f217434%2freading-a-large-file-that-doesnt-fit-in-memory%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

名間水力發電廠 目录 沿革 設施 鄰近設施 註釋 外部連結 导航菜单23°50′10″N 120°42′41″E / 23.83611°N 120.71139°E / 23.83611; 120.7113923°50′10″N 120°42′41″E / 23.83611°N 120.71139°E / 23.83611; 120.71139計畫概要原始内容臺灣第一座BOT 模式開發的水力發電廠-名間水力電廠名間水力發電廠 水利署首件BOT案原始内容《小檔案》名間電廠 首座BOT水力發電廠原始内容名間電廠BOT - 經濟部水利署中區水資源局

Prove that NP is closed under karp reduction?Space(n) not closed under Karp reductions - what about NTime(n)?Class P is closed under rotation?Prove or disprove that $NL$ is closed under polynomial many-one reductions$mathbfNC_2$ is closed under log-space reductionOn Karp reductionwhen can I know if a class (complexity) is closed under reduction (cook/karp)Check if class $PSPACE$ is closed under polyonomially space reductionIs NPSPACE also closed under polynomial-time reduction and under log-space reduction?Prove PSPACE is closed under complement?Prove PSPACE is closed under union?

Is my guitar’s action too high? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)Strings too stiff on a recently purchased acoustic guitar | Cort AD880CEIs the action of my guitar really high?Μy little finger is too weak to play guitarWith guitar, how long should I give my fingers to strengthen / callous?When playing a fret the guitar sounds mutedPlaying (Barre) chords up the guitar neckI think my guitar strings are wound too tight and I can't play barre chordsF barre chord on an SG guitarHow to find to the right strings of a barre chord by feel?High action on higher fret on my steel acoustic guitar