Iteratively Build a Summary Dataset in an Effective Way The Next CEO of Stack Overflow

Avoiding the "not like other girls" trope?

The sum of any ten consecutive numbers from a fibonacci sequence is divisible by 11

Simplify trigonometric expression using trigonometric identities

How to show a landlord what we have in savings?

Physiological effects of huge anime eyes

Noise during hard braking

What difference does it make matching a word with/without a trailing whitespace?

How exploitable/balanced is this homebrew spell: Spell Permenancy?

How do I secure a TV wall mount?

Free fall ellipse or parabola?

Strange use of "whether ... than ..." in official text

Is there a rule of thumb for determining the amount one should accept for of a settlement offer?

How can I force the size of an int for debugging purposes?

What is Decreasing Arithmetic progression?

Is a distribution that is normal, but highly skewed, considered Gaussian?

Does the Idaho Potato Commission associate potato skins with healthy eating?

Calculate the Mean mean of two numbers

What does this strange code stamp on my passport mean?

What are the unusually-enlarged wing sections on this P-38 Lightning?

Could a dragon use its wings to swim?

Percent Dissociated from Titration Curve

Why was Sir Cadogan fired?

My ex-girlfriend uses my Apple ID to login to her iPad, do I have to give her my Apple ID password to reset it?

Words hidden in my phone number



Iteratively Build a Summary Dataset in an Effective Way



The Next CEO of Stack Overflow










0












$begingroup$


This is a problem I find a lot!! Can I achieve this goal without consuming so much time?



My code below achieves what I want it to achieve. However, I believe it could be a lot more efficient and Pythonic.



PROBLEM:
I want to extract summary data from a larger dataset and I only know how to do so utilizing next For loops. For example, I have a large dataset containing golf data, and I would like to extract summary statistics for the individual golf holes.



This code creates a scoring distribution and mean score for each Season-Hole-Round-Score vs. Par combination (48 rows in total).



import numpy as np
import pandas as pd
import itertools

seasons = [2001,2001,2001,2001,2002,2002,2002,2002]
holes = [1,1,2,2,1,1,2,2]
rounds = [3,4,3,4,3,4,3,4]
scores = [1,-1,0,0,0,1,-1,1] # actual scores vs. par

df = pd.DataFrame('season' : seasons, 'hole': holes, 'round':rounds, 'score': scores)


all_seasons = set(seasons); all_holes = set(holes); all_scores = [-1,0,1]
all_rounds = ["R3","R4","Weekend"] #some averages combine rounds
round_iter = np.arange(0,4) #position of rounds list
round_ids = [[3],[4],[3,4]] # weekend incldues rounds 3 and 4

hold_list = [] #blank list

for season,round,hole in itertools.product(all_seasons,round_iter,all_holes):

hold_data = df[((df['season'] == season) & (df['hole'] == hole))
& (df['round'].isin(round_ids[round-1]))]

mean_score = hold_data['score'].mean()
vspar_distro = hold_data['score'].value_counts().to_dict()
for score in all_scores:
count_score = 0
if score in vspar_distro:
count_score = vspar_distro[score]
hold_list.append([season,all_rounds[round-1]
,hole,mean_score,score,count_score])


historical_df = pd.DataFrame(hold_list,columns
= ['season','round','hole','mean_score','vspar_score','count'])


This produces the df that I desire (here are the first 5 rows), but applying this to a file with 100k+ records takes a long time and I believe there is a more efficient way. Thanks!



 season round hole mean_score vspar_score count
0 2001 Weekend 1 0.0 -1 1
1 2001 Weekend 1 0.0 0 0
2 2001 Weekend 1 0.0 1 1
3 2001 Weekend 2 0.0 -1 0
4 2001 Weekend 2 0.0 0 2








share







New contributor




python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    0












    $begingroup$


    This is a problem I find a lot!! Can I achieve this goal without consuming so much time?



    My code below achieves what I want it to achieve. However, I believe it could be a lot more efficient and Pythonic.



    PROBLEM:
    I want to extract summary data from a larger dataset and I only know how to do so utilizing next For loops. For example, I have a large dataset containing golf data, and I would like to extract summary statistics for the individual golf holes.



    This code creates a scoring distribution and mean score for each Season-Hole-Round-Score vs. Par combination (48 rows in total).



    import numpy as np
    import pandas as pd
    import itertools

    seasons = [2001,2001,2001,2001,2002,2002,2002,2002]
    holes = [1,1,2,2,1,1,2,2]
    rounds = [3,4,3,4,3,4,3,4]
    scores = [1,-1,0,0,0,1,-1,1] # actual scores vs. par

    df = pd.DataFrame('season' : seasons, 'hole': holes, 'round':rounds, 'score': scores)


    all_seasons = set(seasons); all_holes = set(holes); all_scores = [-1,0,1]
    all_rounds = ["R3","R4","Weekend"] #some averages combine rounds
    round_iter = np.arange(0,4) #position of rounds list
    round_ids = [[3],[4],[3,4]] # weekend incldues rounds 3 and 4

    hold_list = [] #blank list

    for season,round,hole in itertools.product(all_seasons,round_iter,all_holes):

    hold_data = df[((df['season'] == season) & (df['hole'] == hole))
    & (df['round'].isin(round_ids[round-1]))]

    mean_score = hold_data['score'].mean()
    vspar_distro = hold_data['score'].value_counts().to_dict()
    for score in all_scores:
    count_score = 0
    if score in vspar_distro:
    count_score = vspar_distro[score]
    hold_list.append([season,all_rounds[round-1]
    ,hole,mean_score,score,count_score])


    historical_df = pd.DataFrame(hold_list,columns
    = ['season','round','hole','mean_score','vspar_score','count'])


    This produces the df that I desire (here are the first 5 rows), but applying this to a file with 100k+ records takes a long time and I believe there is a more efficient way. Thanks!



     season round hole mean_score vspar_score count
    0 2001 Weekend 1 0.0 -1 1
    1 2001 Weekend 1 0.0 0 0
    2 2001 Weekend 1 0.0 1 1
    3 2001 Weekend 2 0.0 -1 0
    4 2001 Weekend 2 0.0 0 2








    share







    New contributor




    python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      0












      0








      0





      $begingroup$


      This is a problem I find a lot!! Can I achieve this goal without consuming so much time?



      My code below achieves what I want it to achieve. However, I believe it could be a lot more efficient and Pythonic.



      PROBLEM:
      I want to extract summary data from a larger dataset and I only know how to do so utilizing next For loops. For example, I have a large dataset containing golf data, and I would like to extract summary statistics for the individual golf holes.



      This code creates a scoring distribution and mean score for each Season-Hole-Round-Score vs. Par combination (48 rows in total).



      import numpy as np
      import pandas as pd
      import itertools

      seasons = [2001,2001,2001,2001,2002,2002,2002,2002]
      holes = [1,1,2,2,1,1,2,2]
      rounds = [3,4,3,4,3,4,3,4]
      scores = [1,-1,0,0,0,1,-1,1] # actual scores vs. par

      df = pd.DataFrame('season' : seasons, 'hole': holes, 'round':rounds, 'score': scores)


      all_seasons = set(seasons); all_holes = set(holes); all_scores = [-1,0,1]
      all_rounds = ["R3","R4","Weekend"] #some averages combine rounds
      round_iter = np.arange(0,4) #position of rounds list
      round_ids = [[3],[4],[3,4]] # weekend incldues rounds 3 and 4

      hold_list = [] #blank list

      for season,round,hole in itertools.product(all_seasons,round_iter,all_holes):

      hold_data = df[((df['season'] == season) & (df['hole'] == hole))
      & (df['round'].isin(round_ids[round-1]))]

      mean_score = hold_data['score'].mean()
      vspar_distro = hold_data['score'].value_counts().to_dict()
      for score in all_scores:
      count_score = 0
      if score in vspar_distro:
      count_score = vspar_distro[score]
      hold_list.append([season,all_rounds[round-1]
      ,hole,mean_score,score,count_score])


      historical_df = pd.DataFrame(hold_list,columns
      = ['season','round','hole','mean_score','vspar_score','count'])


      This produces the df that I desire (here are the first 5 rows), but applying this to a file with 100k+ records takes a long time and I believe there is a more efficient way. Thanks!



       season round hole mean_score vspar_score count
      0 2001 Weekend 1 0.0 -1 1
      1 2001 Weekend 1 0.0 0 0
      2 2001 Weekend 1 0.0 1 1
      3 2001 Weekend 2 0.0 -1 0
      4 2001 Weekend 2 0.0 0 2








      share







      New contributor




      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      This is a problem I find a lot!! Can I achieve this goal without consuming so much time?



      My code below achieves what I want it to achieve. However, I believe it could be a lot more efficient and Pythonic.



      PROBLEM:
      I want to extract summary data from a larger dataset and I only know how to do so utilizing next For loops. For example, I have a large dataset containing golf data, and I would like to extract summary statistics for the individual golf holes.



      This code creates a scoring distribution and mean score for each Season-Hole-Round-Score vs. Par combination (48 rows in total).



      import numpy as np
      import pandas as pd
      import itertools

      seasons = [2001,2001,2001,2001,2002,2002,2002,2002]
      holes = [1,1,2,2,1,1,2,2]
      rounds = [3,4,3,4,3,4,3,4]
      scores = [1,-1,0,0,0,1,-1,1] # actual scores vs. par

      df = pd.DataFrame('season' : seasons, 'hole': holes, 'round':rounds, 'score': scores)


      all_seasons = set(seasons); all_holes = set(holes); all_scores = [-1,0,1]
      all_rounds = ["R3","R4","Weekend"] #some averages combine rounds
      round_iter = np.arange(0,4) #position of rounds list
      round_ids = [[3],[4],[3,4]] # weekend incldues rounds 3 and 4

      hold_list = [] #blank list

      for season,round,hole in itertools.product(all_seasons,round_iter,all_holes):

      hold_data = df[((df['season'] == season) & (df['hole'] == hole))
      & (df['round'].isin(round_ids[round-1]))]

      mean_score = hold_data['score'].mean()
      vspar_distro = hold_data['score'].value_counts().to_dict()
      for score in all_scores:
      count_score = 0
      if score in vspar_distro:
      count_score = vspar_distro[score]
      hold_list.append([season,all_rounds[round-1]
      ,hole,mean_score,score,count_score])


      historical_df = pd.DataFrame(hold_list,columns
      = ['season','round','hole','mean_score','vspar_score','count'])


      This produces the df that I desire (here are the first 5 rows), but applying this to a file with 100k+ records takes a long time and I believe there is a more efficient way. Thanks!



       season round hole mean_score vspar_score count
      0 2001 Weekend 1 0.0 -1 1
      1 2001 Weekend 1 0.0 0 0
      2 2001 Weekend 1 0.0 1 1
      3 2001 Weekend 2 0.0 -1 0
      4 2001 Weekend 2 0.0 0 2






      python python-3.x





      share







      New contributor




      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 mins ago









      python_rubepython_rube

      1




      1




      New contributor




      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      python_rube is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          0






          active

          oldest

          votes












          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          python_rube is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216684%2fiteratively-build-a-summary-dataset-in-an-effective-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          python_rube is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          python_rube is a new contributor. Be nice, and check out our Code of Conduct.












          python_rube is a new contributor. Be nice, and check out our Code of Conduct.











          python_rube is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216684%2fiteratively-build-a-summary-dataset-in-an-effective-way%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          名間水力發電廠 目录 沿革 設施 鄰近設施 註釋 外部連結 导航菜单23°50′10″N 120°42′41″E / 23.83611°N 120.71139°E / 23.83611; 120.7113923°50′10″N 120°42′41″E / 23.83611°N 120.71139°E / 23.83611; 120.71139計畫概要原始内容臺灣第一座BOT 模式開發的水力發電廠-名間水力電廠名間水力發電廠 水利署首件BOT案原始内容《小檔案》名間電廠 首座BOT水力發電廠原始内容名間電廠BOT - 經濟部水利署中區水資源局

          格濟夫卡 參考資料 导航菜单51°3′40″N 34°2′21″E / 51.06111°N 34.03917°E / 51.06111; 34.03917ГезівкаПогода в селі 编辑或修订